Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi all, Forgive me. Rather than post more code in this thread, I've created a gist with revised code that resolves some inconsistencies in what I posted here earlier. https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3 Again, this isn't a full-featured CSV parser by any means; it assumes fairly uniform CSV. Its contribution is that it is a fairly concise XQuery implementation that works around the absence of lookahead/lookbehind regex support in XPath. Joe
Re: [basex-talk] csv:parse in the age of XQuery 3.1
And corrected query body: let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-tokens($header-row) ! replace(., '\s+', '_') for $row in $body-rows let $cells := local:get-tokens($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} }
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Sorry, a typo crept in. Here's the corrected function: declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($string-to-analyze, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() };
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi Christian, Yes, that sounds like the culprit. Searching back through my files, Adam Retter responded on exist-open (at http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with the lack of lookahead support in XPath, by pointing to an XSLT he adapted for CSV parsing, https://github.com/digital-preservation/csv-tools/blob/master/csv-to-xml_v3.xsl. I adapted this technique to XQuery, and it works on the sample case in my earlier email. Joe ```xquery xquery version "3.1"; declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($row, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() }; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-cells($header-row) for $row in $body-rows let $cells := local:get-cells($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} } ``` On Mon, Sep 12, 2016 at 10:11 AM, Christian Grünwrote: >> Christian: I tried removing the quote escaping but still get an error. >> Here's a small test to reproduce: >> >> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') > > I assume it’s the lookbehind assertion that is not allowed in XQuery > (but I should definitely spend more time on it to give you a better > answer..).
Re: [basex-talk] csv:parse in the age of XQuery 3.1
> Christian: I tried removing the quote escaping but still get an error. > Here's a small test to reproduce: > > fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') I assume it’s the lookbehind assertion that is not allowed in XQuery (but I should definitely spend more time on it to give you a better answer..).
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi all, Christian: I completely agree, CSV is a nightmare. One way to reduce the headaches (in, say, developing an EXPath CSV library) might be to require that CSV pass validation by a tool such as http://digital-preservation.github.io/csv-validator/. Adam Retter presented his work on CSV Schema and CSV Validator at http://slides.com/adamretter/csv-validation. This might require the user to fix issues in the CSV first, but would reduce the scope of variation considerably. I notice that the Jackson CSV parser leverages the notion of a schema in its imports: https://github.com/FasterXML/jackson-dataformat-csv. Hans-Jürgen: Thanks for the pointer to your library - it looks fantastic. I look forward to trying it out. Liam: Thanks for the info about XQuery's additional regex handling beyond XSD. And, lastly, to keep this post still basex related... Christian: I tried removing the quote escaping but still get an error. Here's a small test to reproduce: xquery version "3.1"; let $row := '"Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002' return fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') Joe On Mon, Sep 12, 2016 at 7:29 AM, Christian Grünwrote: > I didn’t check the regex in general, but one reason I think why it > fails is the escaped quote. For example, the following query is > illegal in XQuery 3.1… > > matches('a"b', 'a\"b') > > …where as the following one is ok: > > matches('a"b', 'a"b') > > > > On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau wrote: >> Cordial thanks, Liam - I was not aware of that! >> >> @Joe: Rule of life: when one is especially sure to be right, one is surely >> wrong, and so was I, and right were you(r first two characters). >> >> >> Liam R. E. Quin schrieb am 5:54 Montag, 12.September 2016: >> >> >> Hans-Jürgen, wrote: >> >> ! Already the first >>> two characters >>> (?render the expression invalid:(1) An unescaped ? is an >>> occurrence indicator, making the preceding entity optional(2) An >>> unescaped ( is used for grouping, it does not repesent anything >>> => there is no entity preceding the ? which the ? could make optional >>> => error >> >> >> Actually (?: ) is a non-capturing group, defined in XPath 3.0 and >> XQuery 3.0, based on the same syntax in other languages. >> >> This extension, like a number of others, is useful because the >> expression syntax defined by XSD doesn't make use of capturing groups >> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing >> groups, but in XPath and XQuery they are used. >> >> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax >> >> Liam >> >> >> -- >> Liam R. E. Quin >> The World Wide Web Consortium (W3C) >> >> >>
Re: [basex-talk] csv:parse in the age of XQuery 3.1
I didn’t check the regex in general, but one reason I think why it fails is the escaped quote. For example, the following query is illegal in XQuery 3.1… matches('a"b', 'a\"b') …where as the following one is ok: matches('a"b', 'a"b') On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennauwrote: > Cordial thanks, Liam - I was not aware of that! > > @Joe: Rule of life: when one is especially sure to be right, one is surely > wrong, and so was I, and right were you(r first two characters). > > > Liam R. E. Quin schrieb am 5:54 Montag, 12.September 2016: > > > Hans-Jürgen, wrote: > > ! Already the first >> two characters >> (?render the expression invalid:(1) An unescaped ? is an >> occurrence indicator, making the preceding entity optional(2) An >> unescaped ( is used for grouping, it does not repesent anything >> => there is no entity preceding the ? which the ? could make optional >> => error > > > Actually (?: ) is a non-capturing group, defined in XPath 3.0 and > XQuery 3.0, based on the same syntax in other languages. > > This extension, like a number of others, is useful because the > expression syntax defined by XSD doesn't make use of capturing groups > (there's no \1 or $1 or whatever), and so it doesn't need non-capturing > groups, but in XPath and XQuery they are used. > > See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax > > Liam > > > -- > Liam R. E. Quin > The World Wide Web Consortium (W3C) > > >
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Cordial thanks, Liam - I was not aware of that! @Joe: Rule of life: when one is especially sure to be right, one is surely wrong, and so was I, and right were you(r first two characters). Liam R. E. Quinschrieb am 5:54 Montag, 12.September 2016: Hans-Jürgen, wrote: ! Already the first > two characters > (?render the expression invalid:(1) An unescaped ? is an > occurrence indicator, making the preceding entity optional(2) An > unescaped ( is used for grouping, it does not repesent anything > => there is no entity preceding the ? which the ? could make optional > => error Actually (?: ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages. This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used. See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax Liam -- Liam R. E. Quin The World Wide Web Consortium (W3C)
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hans-Jürgen, wrote: ! Already the first > two characters > (?render the expression invalid:(1) An unescaped ? is an > occurrence indicator, making the preceding entity optional(2) An > unescaped ( is used for grouping, it does not repesent anything > => there is no entity preceding the ? which the ? could make optional > => error Actually (?: ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages. This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used. See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax Liam -- Liam R. E. QuinThe World Wide Web Consortium (W3C)
Re: [basex-talk] csv:parse in the age of XQuery 3.1
@Hans-Jürgen… Nice work, thanks for the hint! On Sun, Sep 11, 2016 at 10:23 PM, Hans-Juergen Rennauwrote: > Joe, just in case it is of interest to you: the TopicTools framework, > downloadable at > >https://github.com/hrennau/topictools > > contains an XQuery-implemented, full-featured csv parser (module > _csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the > parser is automatically added to your application code and you can declare > command-line parameters of your tool to have (among many others) a csv-based > data type. Having declared the parameter to have such a type (e.g. csvURI or > csvFOX), you can forget csv, as your application code sees nothing else but > XML. More offline, if this is interesting to you. > > Cheers, > Hans > > PS - illustrating ... > > Tool invocations: >basex -b
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi Joe, My concern is that a single regex, no matter how complex, won’t do justice to parse arbitary CSV data. The CSV input we got so far for testing was simply too diverse (I spent 10% of my time into implementing a basic CSV parser in BaseX, and 90% into examining these special cases, and implementing custom extensions). As example, some comments regarding quotes: * Quotes can serve as delimiters for a single cell, but they can also be escaped: "Hi \"John\"" * Quotes can also occur inside a string. In Excel, those quotes will be escaped via double quotes… Hi ""John"". Other parsers prefer backslashes. * Real-life CSV data is regularly corrupt, so we’d also need to tolerate missing trailing quotes and other special cases. On the other hand, parsing tabular texts is basically no big deal, and our current Java implementation [1] could surely be ported to XQuery. Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/io/parse/csv/CsvParser.java On Sun, Sep 11, 2016 at 9:44 PM, Joe Wicentowskiwrote: > Hans-Jürgen, > > I figured as much. I wonder if we can come up with an xsd-compliant regex > for this purpose? It may not give us a full-featured CSV parser, but would > handle reasonably uniform cases. > > Joe > > Sent from my iPhone > > > > > On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" > wrote: > >> Joe, concerning your regex, I would complain, too! Already the first two >> characters >> (? >> render the expression invalid: >> (1) An unescaped ? is an occurrence indicator, making the preceding entity >> optional >> (2) An unescaped ( is used for grouping, it does not repesent anything >> => there is no entity preceding the ? which the ? could make optional => >> error >> >> Please keep in mind that the regex flavor supported by XPath is the regex >> flavor defined by the XSD spec. There are a few constructs used in Perl & Co >> which are not defined in XPath regex. >> >> What concerns the CSV implementation, I came to realize my error: the >> BaseX implementation *is* Java code, not XQuery code - the xqm module just >> contains the function signature, marked "external". >> >> Cheers, >> Hans >> >> >> Joe Wicentowski schrieb am 21:27 Sonntag, 11.September >> 2016: >> >> >> Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and >> Christian. >> >> The other day, short of a comprehensive solution, I went in search of >> a regex that would handle quoted values that contain commas that >> shouldn't serve as delimiters. I found one that worked in eXist but >> not in BaseX. >> >> Source for the regex: http://stackoverflow.com/a/13259681/659732 >> >> The query: >> >> ``` >> xquery version "3.1"; >> >> let $csv := 'Author,Title,ISBN,Binding,Year Published >> Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 >> James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 >> Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 >> "Larry Bossidy, Ram Charan, Charles >> Burck",Execution,9780609610572,Hardcover,2002 >> Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' >> let $lines := tokenize($csv, '\n') >> let $header-row := fn:head($lines) >> let $body-rows := fn:tail($lines) >> let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") >> for $row in $body-rows >> let $cells := fn:analyze-string($row, >> '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group >> return >> element Book { >> for $cell at $count in $cells >> return element {$headers[$count]} {$cell/string()} >> } >> It produces the desired results: >> >> >> Jeannette Walls >> The Glass Castle >> 074324754X >> Paperback >> 2006 >> >> >> James Surowiecki >> The Wisdom of Crowds >> 9780385503860 >> Paperback >> 2005 >> >> >> Lawrence Lessig >> The Future of Ideas >> 9780375505782 >> Paperback >> 2002 >> >> >> Larry Bossidy, Ram Charan, Charles Burck >> Execution >> 9780609610572 >> Hardcover >> 2002 >> >> >> Kurt Vonnegut >> Slaughterhouse-Five >> 9780791059258 >> Paperback >> 1999 >> >> >> Unfortunately BaseX complains about the regex, with the following error: >> >> Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular >> expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. >> >> Without a column location, I'm unable to tell where the problem is. >> Is there something used in this expression that BaseX doesn't support? >> >> On the topic of the potential memory pitfalls of a pure XQuery >> solution for our hypothetical EXPath library, I think the primary >> problem is that the entire CSV has to be loaded into memory. I wonder >> if implementations could use the new `fn:unparsed-text-lines()` >> function from XQuery 3.0 to stream the CSV through XQuery without >> requiring the entire thing to be in memory? Or
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Joe, just in case it is of interest to you: the TopicTools framework, downloadable at https://github.com/hrennau/topictools contains an XQuery-implemented, full-featured csv parser (module _csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the parser is automatically added to your application code and you can declare command-line parameters of your tool to have (among many others) a csv-based data type. Having declared the parameter to have such a type (e.g. csvURI or csvFOX), you can forget csv, as your application code sees nothing else but XML. More offline, if this is interesting to you. Cheers,Hans PS - illustrating ... Tool invocations: basex -b
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hans-Jürgen, I figured as much. I wonder if we can come up with an xsd-compliant regex for this purpose? It may not give us a full-featured CSV parser, but would handle reasonably uniform cases. Joe Sent from my iPhone On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau"wrote: Joe, concerning your regex, I would complain, too! Already the first two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex. What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external". Cheers,Hans Joe Wicentowski schrieb am 21:27 Sonntag, 11.September 2016: Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian. The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX. Source for the regex: http://stackoverflow.com/a/13259681/659732 The query: ``` xquery version "3.1"; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, ' ') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results: Jeannette Walls The Glass Castle 074324754X Paperback 2006 James Surowiecki The Wisdom of Crowds 9780385503860 Paperback 2005 Lawrence Lessig The Future of Ideas 9780375505782 Paperback 2002 Larry Bossidy, Ram Charan, Charles Burck Execution 9780609610572 Hardcover 2002 Kurt Vonnegut Slaughterhouse-Five 9780791059258 Paperback 1999 Unfortunately BaseX complains about the regex, with the following error: Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support? On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language? Joe On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün wrote: > Hi Joe, > > Thanks for your mail. You are completely right, using an array would > be the natural choice with csv:parse. It’s mostly due to backward > compatibility that we didn’t update the function. > > @All: I’m pretty sure that all of us would like having an EXPath spec > for parsing CSV data. We still need one volunteer to make it happen ;) > Anyone out there? > > Cheers > Christian > > > On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski wrote: >> Dear BaseX developers, >> >> I noticed in example 3 under >> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with >> option { 'format': 'map' } returns a map of maps, with hardcoded row >> numbers: >> >> map { >> 1: map { >> "City": "Newton", >> "Name": "John" >> }, >> 2: map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> Using maps, which are unordered, to represent something ordered like >> rows in a CSV, hardcoded row numbers are necessary for reassembling >> the map in document order. I assume this was a necessary approach >> when the module was developed in the map-only world of XQuery 3.0. >> Now that 3.1
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Joe, concerning your regex, I would complain, too! Already the first two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex. What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external". Cheers,Hans Joe Wicentowskischrieb am 21:27 Sonntag, 11.September 2016: Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian. The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX. Source for the regex: http://stackoverflow.com/a/13259681/659732 The query: ``` xquery version "3.1"; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results: Jeannette Walls The Glass Castle 074324754X Paperback 2006 James Surowiecki The Wisdom of Crowds 9780385503860 Paperback 2005 Lawrence Lessig The Future of Ideas 9780375505782 Paperback 2002 Larry Bossidy, Ram Charan, Charles Burck Execution 9780609610572 Hardcover 2002 Kurt Vonnegut Slaughterhouse-Five 9780791059258 Paperback 1999 Unfortunately BaseX complains about the regex, with the following error: Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support? On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language? Joe On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün wrote: > Hi Joe, > > Thanks for your mail. You are completely right, using an array would > be the natural choice with csv:parse. It’s mostly due to backward > compatibility that we didn’t update the function. > > @All: I’m pretty sure that all of us would like having an EXPath spec > for parsing CSV data. We still need one volunteer to make it happen ;) > Anyone out there? > > Cheers > Christian > > > On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski wrote: >> Dear BaseX developers, >> >> I noticed in example 3 under >> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with >> option { 'format': 'map' } returns a map of maps, with hardcoded row >> numbers: >> >> map { >> 1: map { >> "City": "Newton", >> "Name": "John" >> }, >> 2: map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> Using maps, which are unordered, to represent something ordered like >> rows in a CSV, hardcoded row numbers are necessary for reassembling >> the map in document order. I assume this was a necessary approach >> when the module was developed in the map-only world of XQuery 3.0. >> Now that 3.1 supports arrays, might an array of maps be a closer fit >> for CSV parsing? >> >> array { >> map { >> "City": "Newton", >> "Name": "John" >> }, >> map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> I'm also curious, do you know of any efforts to create an EXPath spec >> for CSV?
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian. The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX. Source for the regex: http://stackoverflow.com/a/13259681/659732 The query: ``` xquery version "3.1"; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results: Jeannette Walls The Glass Castle 074324754X Paperback 2006 James Surowiecki The Wisdom of Crowds 9780385503860 Paperback 2005 Lawrence Lessig The Future of Ideas 9780375505782 Paperback 2002 Larry Bossidy, Ram Charan, Charles Burck Execution 9780609610572 Hardcover 2002 Kurt Vonnegut Slaughterhouse-Five 9780791059258 Paperback 1999 Unfortunately BaseX complains about the regex, with the following error: Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support? On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language? Joe On Sun, Sep 11, 2016 at 4:53 AM, Christian Grünwrote: > Hi Joe, > > Thanks for your mail. You are completely right, using an array would > be the natural choice with csv:parse. It’s mostly due to backward > compatibility that we didn’t update the function. > > @All: I’m pretty sure that all of us would like having an EXPath spec > for parsing CSV data. We still need one volunteer to make it happen ;) > Anyone out there? > > Cheers > Christian > > > On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski wrote: >> Dear BaseX developers, >> >> I noticed in example 3 under >> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with >> option { 'format': 'map' } returns a map of maps, with hardcoded row >> numbers: >> >> map { >> 1: map { >> "City": "Newton", >> "Name": "John" >> }, >> 2: map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> Using maps, which are unordered, to represent something ordered like >> rows in a CSV, hardcoded row numbers are necessary for reassembling >> the map in document order. I assume this was a necessary approach >> when the module was developed in the map-only world of XQuery 3.0. >> Now that 3.1 supports arrays, might an array of maps be a closer fit >> for CSV parsing? >> >> array { >> map { >> "City": "Newton", >> "Name": "John" >> }, >> map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> I'm also curious, do you know of any efforts to create an EXPath spec >> for CSV? Putting spec and CSV in the same sentence is dangerous, >> since CSV is a notoriously under-specified format: "The CSV file >> format is not standardized" (see >> https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps >> there is a common enough need for CSV parsing that such a spec would >> benefit the community? I thought I'd start by asking here, since >> BaseX's seems to be the most developed (or only?) CSV module in >> XQuery. >> >> Then there's the question of how to approach implementations of such a >> spec. While XQuery is probably capable of parsing and serializing >> small enough CSV, CSVs do get large and naive processing with XQuery >> would tend to run into memory issues (as I found with xqjson). This >> means implementations would tend to write in a lower-level language. >> eXist, for example, uses
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi Joe, Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function. @All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there? Cheers Christian On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowskiwrote: > Dear BaseX developers, > > I noticed in example 3 under > http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with > option { 'format': 'map' } returns a map of maps, with hardcoded row > numbers: > > map { > 1: map { > "City": "Newton", > "Name": "John" > }, > 2: map { > "City": "Oldtown", > "Name": "Jack" > } > } > > Using maps, which are unordered, to represent something ordered like > rows in a CSV, hardcoded row numbers are necessary for reassembling > the map in document order. I assume this was a necessary approach > when the module was developed in the map-only world of XQuery 3.0. > Now that 3.1 supports arrays, might an array of maps be a closer fit > for CSV parsing? > > array { > map { > "City": "Newton", > "Name": "John" > }, > map { > "City": "Oldtown", > "Name": "Jack" > } > } > > I'm also curious, do you know of any efforts to create an EXPath spec > for CSV? Putting spec and CSV in the same sentence is dangerous, > since CSV is a notoriously under-specified format: "The CSV file > format is not standardized" (see > https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps > there is a common enough need for CSV parsing that such a spec would > benefit the community? I thought I'd start by asking here, since > BaseX's seems to be the most developed (or only?) CSV module in > XQuery. > > Then there's the question of how to approach implementations of such a > spec. While XQuery is probably capable of parsing and serializing > small enough CSV, CSVs do get large and naive processing with XQuery > would tend to run into memory issues (as I found with xqjson). This > means implementations would tend to write in a lower-level language. > eXist, for example, uses Jackson for fn:parse-json(). I see Jackson > has a CSV extension too: > https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on > the suitability of XQuery for the task? > > Joe
Re: [basex-talk] csv:parse in the age of XQuery 3.1
nstanz.de> Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1 What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation. Kind regards, Hans-Jürgen Marc van Grootel <marc.van.groo...@gmail.com> schrieb am 15:55 Donnerstag, 8.September 2016: I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result. Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive. --Marc
Re: [basex-talk] csv:parse in the age of XQuery 3.1
As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways. 1. BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds. 2. Command script – The CSV was loaded and converted to XML in about 10 seconds. SET PARSER csv SET CSVPARSER encoding=windows-1252, header=true, separator=comma SET CREATEFILTER *.csv create database csvtest1 "path\to\file.csv" 3. XQuery – The CSV was loaded and converted to XML in about 20 seconds. db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' ) 4. XQuery (parsing only) – CSV file was parsed in about 4 seconds. csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}) 5. XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds. csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'}) These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience? I came across an example in the documentation that gave me an error message. The Command Line example at http://docs.basex.org/wiki/Parsers#CSV_Parser has SET CSVPARSER<http://docs.basex.org/wiki/Options#CSVPARSER> encoding=utf-8, lines=true, header=false, separator=space When trying this in BaseX 8.2.3 I get an error message: Error: PARSER: csv Unknown option 'lines'. The “lines” option is not listed in the CSV Module parser documentation at http://docs.basex.org/wiki/CSV_Module#Options. I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated? Vincent From: basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Hans-Juergen Rennau Sent: Thursday, September 08, 2016 10:02 AM To: Marc van Grootel <marc.van.groo...@gmail.com> Cc: BaseX <basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1 What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation. Kind regards, Hans-Jürgen Marc van Grootel <marc.van.groo...@gmail.com<mailto:marc.van.groo...@gmail.com>> schrieb am 15:55 Donnerstag, 8.September 2016: I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result. Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive. --Marc
Re: [basex-talk] csv:parse in the age of XQuery 3.1
What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation. Kind regards,Hans-Jürgen Marc van Grootelschrieb am 15:55 Donnerstag, 8.September 2016: I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result. Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive. --Marc
Re: [basex-talk] csv:parse in the age of XQuery 3.1
I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result. Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive. --Marc
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Joe, just to back you: I believe that an EXPath spec for CSV processing would be *extremely* useful! (There is hardly a format as ubiquitous as CSV.) And I had similar experience concerning the performance - concretely, a 22 MB file proved to be simply unprocessable! Which means that BaseX support for CSV is only partial. So I ardently hope for the BaseX team to enable the parsing of large CSV, and I hope for an initiative pulling CSV into EXPath! Kind regards,Hans-Jürgen Joe Wicentowskischrieb am 6:14 Donnerstag, 8.September 2016: Dear BaseX developers, I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers: map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } } Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing? array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } } I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery. Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task? Joe
[basex-talk] csv:parse in the age of XQuery 3.1
Dear BaseX developers, I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers: map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } } Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing? array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } } I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery. Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task? Joe