Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi all, Forgive me. Rather than post more code in this thread, I've created a gist with revised code that resolves some inconsistencies in what I posted here earlier. https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3 Again, this isn't a full-featured CSV parser by any means; it assumes fairly uniform CSV. Its contribution is that it is a fairly concise XQuery implementation that works around the absence of lookahead/lookbehind regex support in XPath. Joe
Re: [basex-talk] csv:parse in the age of XQuery 3.1
And corrected query body: let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-tokens($header-row) ! replace(., '\s+', '_') for $row in $body-rows let $cells := local:get-tokens($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} }
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Sorry, a typo crept in. Here's the corrected function: declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($string-to-analyze, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() };
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi Christian, Yes, that sounds like the culprit. Searching back through my files, Adam Retter responded on exist-open (at http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with the lack of lookahead support in XPath, by pointing to an XSLT he adapted for CSV parsing, https://github.com/digital-preservation/csv-tools/blob/master/csv-to-xml_v3.xsl. I adapted this technique to XQuery, and it works on the sample case in my earlier email. Joe ```xquery xquery version "3.1"; declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($row, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() }; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-cells($header-row) for $row in $body-rows let $cells := local:get-cells($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} } ``` On Mon, Sep 12, 2016 at 10:11 AM, Christian Grünwrote: >> Christian: I tried removing the quote escaping but still get an error. >> Here's a small test to reproduce: >> >> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') > > I assume it’s the lookbehind assertion that is not allowed in XQuery > (but I should definitely spend more time on it to give you a better > answer..).
Re: [basex-talk] csv:parse in the age of XQuery 3.1
> Christian: I tried removing the quote escaping but still get an error. > Here's a small test to reproduce: > > fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') I assume it’s the lookbehind assertion that is not allowed in XQuery (but I should definitely spend more time on it to give you a better answer..).
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Hi all, Christian: I completely agree, CSV is a nightmare. One way to reduce the headaches (in, say, developing an EXPath CSV library) might be to require that CSV pass validation by a tool such as http://digital-preservation.github.io/csv-validator/. Adam Retter presented his work on CSV Schema and CSV Validator at http://slides.com/adamretter/csv-validation. This might require the user to fix issues in the CSV first, but would reduce the scope of variation considerably. I notice that the Jackson CSV parser leverages the notion of a schema in its imports: https://github.com/FasterXML/jackson-dataformat-csv. Hans-Jürgen: Thanks for the pointer to your library - it looks fantastic. I look forward to trying it out. Liam: Thanks for the info about XQuery's additional regex handling beyond XSD. And, lastly, to keep this post still basex related... Christian: I tried removing the quote escaping but still get an error. Here's a small test to reproduce: xquery version "3.1"; let $row := '"Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002' return fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?') Joe On Mon, Sep 12, 2016 at 7:29 AM, Christian Grünwrote: > I didn’t check the regex in general, but one reason I think why it > fails is the escaped quote. For example, the following query is > illegal in XQuery 3.1… > > matches('a"b', 'a\"b') > > …where as the following one is ok: > > matches('a"b', 'a"b') > > > > On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau wrote: >> Cordial thanks, Liam - I was not aware of that! >> >> @Joe: Rule of life: when one is especially sure to be right, one is surely >> wrong, and so was I, and right were you(r first two characters). >> >> >> Liam R. E. Quin schrieb am 5:54 Montag, 12.September 2016: >> >> >> Hans-Jürgen, wrote: >> >> ! Already the first >>> two characters >>> (?render the expression invalid:(1) An unescaped ? is an >>> occurrence indicator, making the preceding entity optional(2) An >>> unescaped ( is used for grouping, it does not repesent anything >>> => there is no entity preceding the ? which the ? could make optional >>> => error >> >> >> Actually (?: ) is a non-capturing group, defined in XPath 3.0 and >> XQuery 3.0, based on the same syntax in other languages. >> >> This extension, like a number of others, is useful because the >> expression syntax defined by XSD doesn't make use of capturing groups >> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing >> groups, but in XPath and XQuery they are used. >> >> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax >> >> Liam >> >> >> -- >> Liam R. E. Quin >> The World Wide Web Consortium (W3C) >> >> >>
Re: [basex-talk] csv:parse in the age of XQuery 3.1
I didn’t check the regex in general, but one reason I think why it fails is the escaped quote. For example, the following query is illegal in XQuery 3.1… matches('a"b', 'a\"b') …where as the following one is ok: matches('a"b', 'a"b') On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennauwrote: > Cordial thanks, Liam - I was not aware of that! > > @Joe: Rule of life: when one is especially sure to be right, one is surely > wrong, and so was I, and right were you(r first two characters). > > > Liam R. E. Quin schrieb am 5:54 Montag, 12.September 2016: > > > Hans-Jürgen, wrote: > > ! Already the first >> two characters >> (?render the expression invalid:(1) An unescaped ? is an >> occurrence indicator, making the preceding entity optional(2) An >> unescaped ( is used for grouping, it does not repesent anything >> => there is no entity preceding the ? which the ? could make optional >> => error > > > Actually (?: ) is a non-capturing group, defined in XPath 3.0 and > XQuery 3.0, based on the same syntax in other languages. > > This extension, like a number of others, is useful because the > expression syntax defined by XSD doesn't make use of capturing groups > (there's no \1 or $1 or whatever), and so it doesn't need non-capturing > groups, but in XPath and XQuery they are used. > > See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax > > Liam > > > -- > Liam R. E. Quin > The World Wide Web Consortium (W3C) > > >
Re: [basex-talk] csv:parse in the age of XQuery 3.1
Cordial thanks, Liam - I was not aware of that! @Joe: Rule of life: when one is especially sure to be right, one is surely wrong, and so was I, and right were you(r first two characters). Liam R. E. Quinschrieb am 5:54 Montag, 12.September 2016: Hans-Jürgen, wrote: ! Already the first > two characters > (?render the expression invalid:(1) An unescaped ? is an > occurrence indicator, making the preceding entity optional(2) An > unescaped ( is used for grouping, it does not repesent anything > => there is no entity preceding the ? which the ? could make optional > => error Actually (?: ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages. This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used. See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax Liam -- Liam R. E. Quin The World Wide Web Consortium (W3C)