Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi all,

Forgive me.  Rather than post more code in this thread, I've created a
gist with revised code that resolves some inconsistencies in what I
posted here earlier.

  https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3

Again, this isn't a full-featured CSV parser by any means; it assumes
fairly uniform CSV.  Its contribution is that it is a fairly concise
XQuery implementation that works around the absence of
lookahead/lookbehind regex support in XPath.

Joe


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
And corrected query body:

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := local:get-tokens($header-row) ! replace(., '\s+', '_')
for $row in $body-rows
let $cells := local:get-tokens($row)
return
element row {
  for $cell at $count in $cells
  return element {$headers[$count]} {$cell}
}


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Sorry, a typo crept in.  Here's the corrected function:

declare function local:get-cells($row as xs:string) as xs:string {
(: workaround lack of lookahead support in XPath: end row with comma :)
let $string-to-analyze := $row || ","
let $analyze := fn:analyze-string($string-to-analyze, '(("[^"]*")+|[^,]*),')
for $group in $analyze//fn:group[@nr="1"]
return
if (matches($group, '^".+"$')) then
replace($group, '^"([^"]+)"$', '$1')
else
$group/string()
};


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi Christian,

Yes, that sounds like the culprit.  Searching back through my files,
Adam Retter responded on exist-open (at
http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with
the lack of lookahead support in XPath, by pointing to an XSLT he
adapted for CSV parsing,
https://github.com/digital-preservation/csv-tools/blob/master/csv-to-xml_v3.xsl.
I adapted this technique to XQuery, and it works on the sample case in
my earlier email.

Joe

```xquery
xquery version "3.1";

declare function local:get-cells($row as xs:string) as xs:string {
(: workaround lack of lookahead support in XPath: end row with comma :)
let $string-to-analyze := $row || ","
let $analyze := fn:analyze-string($row, '(("[^"]*")+|[^,]*),')
for $group in $analyze//fn:group[@nr="1"]
return
if (matches($group, '^".+"$')) then
replace($group, '^"([^"]+)"$', '$1')
else
$group/string()
};

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := local:get-cells($header-row)
for $row in $body-rows
let $cells := local:get-cells($row)
return
element row {
  for $cell at $count in $cells
  return element {$headers[$count]} {$cell}
}
```

On Mon, Sep 12, 2016 at 10:11 AM, Christian Grün
 wrote:
>> Christian: I tried removing the quote escaping but still get an error.
>> Here's a small test to reproduce:
>>
>> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
>
> I assume it’s the lookbehind assertion that is not allowed in XQuery
> (but I should definitely spend more time on it to give you a better
> answer..).


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Christian Grün
> Christian: I tried removing the quote escaping but still get an error.
> Here's a small test to reproduce:
>
> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')

I assume it’s the lookbehind assertion that is not allowed in XQuery
(but I should definitely spend more time on it to give you a better
answer..).


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi all,

Christian: I completely agree, CSV is a nightmare.  One way to reduce
the headaches (in, say, developing an EXPath CSV library) might be to
require that CSV pass validation by a tool such as
http://digital-preservation.github.io/csv-validator/.  Adam Retter
presented his work on CSV Schema and CSV Validator at
http://slides.com/adamretter/csv-validation.  This might require the
user to fix issues in the CSV first, but would reduce the scope of
variation considerably.  I notice that the Jackson CSV parser
leverages the notion of a schema in its imports:
https://github.com/FasterXML/jackson-dataformat-csv.

Hans-Jürgen: Thanks for the pointer to your library - it looks
fantastic.  I look forward to trying it out.

Liam: Thanks for the info about XQuery's additional regex handling beyond XSD.

And, lastly, to keep this post still basex related...

Christian: I tried removing the quote escaping but still get an error.
Here's a small test to reproduce:

xquery version "3.1";

let $row := '"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002'
return
fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')

Joe

On Mon, Sep 12, 2016 at 7:29 AM, Christian Grün
 wrote:
> I didn’t check the regex in general, but one reason I think why it
> fails is the escaped quote. For example, the following query is
> illegal in XQuery 3.1…
>
>   matches('a"b', 'a\"b')
>
> …where as the following one is ok:
>
>   matches('a"b', 'a"b')
>
>
>
> On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau  wrote:
>> Cordial thanks, Liam - I was not aware of that!
>>
>> @Joe: Rule of life: when one is especially sure to be right, one is surely
>> wrong, and so was I, and right were you(r first two characters).
>>
>>
>> Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
>>
>>
>> Hans-Jürgen, wrote:
>>
>> ! Already the first
>>> two characters
>>> (?render the expression invalid:(1) An unescaped ? is an
>>> occurrence indicator, making the preceding entity optional(2) An
>>> unescaped ( is used for grouping, it does not repesent anything
>>> => there is no entity preceding the ? which the ? could make optional
>>> => error
>>
>>
>> Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
>> XQuery 3.0, based on the same syntax in other languages.
>>
>> This extension, like a number of others, is useful because the
>> expression syntax defined by XSD doesn't make use of capturing groups
>> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing
>> groups, but in XPath and XQuery they are used.
>>
>> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
>>
>> Liam
>>
>>
>> --
>> Liam R. E. Quin 
>> The World Wide Web Consortium (W3C)
>>
>>
>>


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Christian Grün
I didn’t check the regex in general, but one reason I think why it
fails is the escaped quote. For example, the following query is
illegal in XQuery 3.1…

  matches('a"b', 'a\"b')

…where as the following one is ok:

  matches('a"b', 'a"b')



On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau  wrote:
> Cordial thanks, Liam - I was not aware of that!
>
> @Joe: Rule of life: when one is especially sure to be right, one is surely
> wrong, and so was I, and right were you(r first two characters).
>
>
> Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
>
>
> Hans-Jürgen, wrote:
>
> ! Already the first
>> two characters
>> (?render the expression invalid:(1) An unescaped ? is an
>> occurrence indicator, making the preceding entity optional(2) An
>> unescaped ( is used for grouping, it does not repesent anything
>> => there is no entity preceding the ? which the ? could make optional
>> => error
>
>
> Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
> XQuery 3.0, based on the same syntax in other languages.
>
> This extension, like a number of others, is useful because the
> expression syntax defined by XSD doesn't make use of capturing groups
> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing
> groups, but in XPath and XQuery they are used.
>
> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
>
> Liam
>
>
> --
> Liam R. E. Quin 
> The World Wide Web Consortium (W3C)
>
>
>


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Hans-Juergen Rennau
Cordial thanks, Liam - I was not aware of that!
@Joe: Rule of life: when one is especially sure to be right, one is surely 
wrong, and so was I, and right were you(r first two characters).
 

Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
 

 Hans-Jürgen, wrote:
! Already the first
> two characters 
>     (?render the expression invalid:(1) An unescaped ? is an
> occurrence indicator, making the preceding entity optional(2) An
> unescaped ( is used for grouping, it does not repesent anything
> => there is no entity preceding the ? which the ? could make optional
> => error

Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
XQuery 3.0, based on the same syntax in other languages.

This extension, like a number of others, is useful because the
expression syntax defined by XSD doesn't make use of capturing groups
(there's no \1 or $1 or whatever), and so it doesn't need non-capturing 
groups, but in XPath and XQuery they are used.

See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax

Liam


-- 
Liam R. E. Quin 
The World Wide Web Consortium (W3C)