Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi all,

Forgive me.  Rather than post more code in this thread, I've created a
gist with revised code that resolves some inconsistencies in what I
posted here earlier.

  https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3

Again, this isn't a full-featured CSV parser by any means; it assumes
fairly uniform CSV.  Its contribution is that it is a fairly concise
XQuery implementation that works around the absence of
lookahead/lookbehind regex support in XPath.

Joe


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
And corrected query body:

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := local:get-tokens($header-row) ! replace(., '\s+', '_')
for $row in $body-rows
let $cells := local:get-tokens($row)
return
element row {
  for $cell at $count in $cells
  return element {$headers[$count]} {$cell}
}


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Sorry, a typo crept in.  Here's the corrected function:

declare function local:get-cells($row as xs:string) as xs:string {
(: workaround lack of lookahead support in XPath: end row with comma :)
let $string-to-analyze := $row || ","
let $analyze := fn:analyze-string($string-to-analyze, '(("[^"]*")+|[^,]*),')
for $group in $analyze//fn:group[@nr="1"]
return
if (matches($group, '^".+"$')) then
replace($group, '^"([^"]+)"$', '$1')
else
$group/string()
};


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi Christian,

Yes, that sounds like the culprit.  Searching back through my files,
Adam Retter responded on exist-open (at
http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with
the lack of lookahead support in XPath, by pointing to an XSLT he
adapted for CSV parsing,
https://github.com/digital-preservation/csv-tools/blob/master/csv-to-xml_v3.xsl.
I adapted this technique to XQuery, and it works on the sample case in
my earlier email.

Joe

```xquery
xquery version "3.1";

declare function local:get-cells($row as xs:string) as xs:string {
(: workaround lack of lookahead support in XPath: end row with comma :)
let $string-to-analyze := $row || ","
let $analyze := fn:analyze-string($row, '(("[^"]*")+|[^,]*),')
for $group in $analyze//fn:group[@nr="1"]
return
if (matches($group, '^".+"$')) then
replace($group, '^"([^"]+)"$', '$1')
else
$group/string()
};

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := local:get-cells($header-row)
for $row in $body-rows
let $cells := local:get-cells($row)
return
element row {
  for $cell at $count in $cells
  return element {$headers[$count]} {$cell}
}
```

On Mon, Sep 12, 2016 at 10:11 AM, Christian Grün
 wrote:
>> Christian: I tried removing the quote escaping but still get an error.
>> Here's a small test to reproduce:
>>
>> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
>
> I assume it’s the lookbehind assertion that is not allowed in XQuery
> (but I should definitely spend more time on it to give you a better
> answer..).


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Christian Grün
> Christian: I tried removing the quote escaping but still get an error.
> Here's a small test to reproduce:
>
> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')

I assume it’s the lookbehind assertion that is not allowed in XQuery
(but I should definitely spend more time on it to give you a better
answer..).


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Joe Wicentowski
Hi all,

Christian: I completely agree, CSV is a nightmare.  One way to reduce
the headaches (in, say, developing an EXPath CSV library) might be to
require that CSV pass validation by a tool such as
http://digital-preservation.github.io/csv-validator/.  Adam Retter
presented his work on CSV Schema and CSV Validator at
http://slides.com/adamretter/csv-validation.  This might require the
user to fix issues in the CSV first, but would reduce the scope of
variation considerably.  I notice that the Jackson CSV parser
leverages the notion of a schema in its imports:
https://github.com/FasterXML/jackson-dataformat-csv.

Hans-Jürgen: Thanks for the pointer to your library - it looks
fantastic.  I look forward to trying it out.

Liam: Thanks for the info about XQuery's additional regex handling beyond XSD.

And, lastly, to keep this post still basex related...

Christian: I tried removing the quote escaping but still get an error.
Here's a small test to reproduce:

xquery version "3.1";

let $row := '"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002'
return
fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')

Joe

On Mon, Sep 12, 2016 at 7:29 AM, Christian Grün
 wrote:
> I didn’t check the regex in general, but one reason I think why it
> fails is the escaped quote. For example, the following query is
> illegal in XQuery 3.1…
>
>   matches('a"b', 'a\"b')
>
> …where as the following one is ok:
>
>   matches('a"b', 'a"b')
>
>
>
> On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau  wrote:
>> Cordial thanks, Liam - I was not aware of that!
>>
>> @Joe: Rule of life: when one is especially sure to be right, one is surely
>> wrong, and so was I, and right were you(r first two characters).
>>
>>
>> Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
>>
>>
>> Hans-Jürgen, wrote:
>>
>> ! Already the first
>>> two characters
>>> (?render the expression invalid:(1) An unescaped ? is an
>>> occurrence indicator, making the preceding entity optional(2) An
>>> unescaped ( is used for grouping, it does not repesent anything
>>> => there is no entity preceding the ? which the ? could make optional
>>> => error
>>
>>
>> Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
>> XQuery 3.0, based on the same syntax in other languages.
>>
>> This extension, like a number of others, is useful because the
>> expression syntax defined by XSD doesn't make use of capturing groups
>> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing
>> groups, but in XPath and XQuery they are used.
>>
>> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
>>
>> Liam
>>
>>
>> --
>> Liam R. E. Quin 
>> The World Wide Web Consortium (W3C)
>>
>>
>>


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Christian Grün
I didn’t check the regex in general, but one reason I think why it
fails is the escaped quote. For example, the following query is
illegal in XQuery 3.1…

  matches('a"b', 'a\"b')

…where as the following one is ok:

  matches('a"b', 'a"b')



On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau  wrote:
> Cordial thanks, Liam - I was not aware of that!
>
> @Joe: Rule of life: when one is especially sure to be right, one is surely
> wrong, and so was I, and right were you(r first two characters).
>
>
> Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
>
>
> Hans-Jürgen, wrote:
>
> ! Already the first
>> two characters
>> (?render the expression invalid:(1) An unescaped ? is an
>> occurrence indicator, making the preceding entity optional(2) An
>> unescaped ( is used for grouping, it does not repesent anything
>> => there is no entity preceding the ? which the ? could make optional
>> => error
>
>
> Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
> XQuery 3.0, based on the same syntax in other languages.
>
> This extension, like a number of others, is useful because the
> expression syntax defined by XSD doesn't make use of capturing groups
> (there's no \1 or $1 or whatever), and so it doesn't need non-capturing
> groups, but in XPath and XQuery they are used.
>
> See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
>
> Liam
>
>
> --
> Liam R. E. Quin 
> The World Wide Web Consortium (W3C)
>
>
>


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-12 Thread Hans-Juergen Rennau
Cordial thanks, Liam - I was not aware of that!
@Joe: Rule of life: when one is especially sure to be right, one is surely 
wrong, and so was I, and right were you(r first two characters).
 

Liam R. E. Quin  schrieb am 5:54 Montag, 12.September 2016:
 

 Hans-Jürgen, wrote:
! Already the first
> two characters 
>     (?render the expression invalid:(1) An unescaped ? is an
> occurrence indicator, making the preceding entity optional(2) An
> unescaped ( is used for grouping, it does not repesent anything
> => there is no entity preceding the ? which the ? could make optional
> => error

Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
XQuery 3.0, based on the same syntax in other languages.

This extension, like a number of others, is useful because the
expression syntax defined by XSD doesn't make use of capturing groups
(there's no \1 or $1 or whatever), and so it doesn't need non-capturing 
groups, but in XPath and XQuery they are used.

See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax

Liam


-- 
Liam R. E. Quin 
The World Wide Web Consortium (W3C)


   

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Liam R. E. Quin
Hans-Jürgen, wrote:
! Already the first
> two characters 
>     (?render the expression invalid:(1) An unescaped ? is an
> occurrence indicator, making the preceding entity optional(2) An
> unescaped ( is used for grouping, it does not repesent anything
> => there is no entity preceding the ? which the ? could make optional
> => error

Actually (?:  ) is a non-capturing group, defined in XPath 3.0 and
XQuery 3.0, based on the same syntax in other languages.

This extension, like a number of others, is useful because the
expression syntax defined by XSD doesn't make use of capturing groups
(there's no \1 or $1 or whatever), and so it doesn't need non-capturing 
groups, but in XPath and XQuery they are used.

See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax

Liam


-- 
Liam R. E. Quin 
The World Wide Web Consortium (W3C)


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Christian Grün
@Hans-Jürgen… Nice work, thanks for the hint!


On Sun, Sep 11, 2016 at 10:23 PM, Hans-Juergen Rennau  wrote:
> Joe, just in case it is of interest to you: the TopicTools framework,
> downloadable at
>
>https://github.com/hrennau/topictools
>
> contains an XQuery-implemented, full-featured csv parser (module
> _csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the
> parser is automatically added to your application code and you can declare
> command-line parameters of your tool to have (among many others) a csv-based
> data type. Having declared the parameter to have such a type (e.g. csvURI or
> csvFOX), you can forget csv, as your application code sees nothing else but
> XML. More offline, if this is interesting to you.
>
> Cheers,
> Hans
>
> PS - illustrating ...
>
> Tool invocations:
>basex -b 

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Christian Grün
Hi Joe,

My concern is that a single regex, no matter how complex, won’t do
justice to parse arbitary CSV data. The CSV input we got so far for
testing was simply too diverse (I spent 10% of my time into
implementing a basic CSV parser in BaseX, and 90% into examining these
special cases, and implementing custom extensions). As example, some
comments regarding quotes:

* Quotes can serve as delimiters for a single cell, but they can also
be escaped: "Hi \"John\""

* Quotes can also occur inside a string. In Excel, those quotes will
be escaped via double quotes… Hi ""John"". Other parsers prefer
backslashes.

* Real-life CSV data is regularly corrupt, so we’d also need to
tolerate missing trailing quotes and other special cases.

On the other hand, parsing tabular texts is basically no big deal, and
our current Java implementation [1] could surely be ported to XQuery.

Christian

[1] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/io/parse/csv/CsvParser.java




On Sun, Sep 11, 2016 at 9:44 PM, Joe Wicentowski  wrote:
> Hans-Jürgen,
>
> I figured as much. I wonder if we can come up with an xsd-compliant regex
> for this purpose? It may not give us a full-featured CSV parser, but would
> handle reasonably uniform cases.
>
> Joe
>
> Sent from my iPhone
>
>
>
>
> On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau"
>  wrote:
>
>> Joe, concerning your regex, I would complain, too! Already the first two
>> characters
>> (?
>> render the expression invalid:
>> (1) An unescaped ? is an occurrence indicator, making the preceding entity
>> optional
>> (2) An unescaped ( is used for grouping, it does not repesent anything
>> => there is no entity preceding the ? which the ? could make optional =>
>> error
>>
>> Please keep in mind that the regex flavor supported by XPath is the regex
>> flavor defined by the XSD spec. There are a few constructs used in Perl & Co
>> which are not defined in XPath regex.
>>
>> What concerns the CSV implementation, I came to realize my error: the
>> BaseX implementation *is* Java code, not XQuery code - the xqm module just
>> contains the function signature, marked "external".
>>
>> Cheers,
>> Hans
>>
>>
>> Joe Wicentowski  schrieb am 21:27 Sonntag, 11.September
>> 2016:
>>
>>
>> Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and
>> Christian.
>>
>> The other day, short of a comprehensive solution, I went in search of
>> a regex that would handle quoted values that contain commas that
>> shouldn't serve as delimiters.  I found one that worked in eXist but
>> not in BaseX.
>>
>> Source for the regex: http://stackoverflow.com/a/13259681/659732
>>
>> The query:
>>
>> ```
>> xquery version "3.1";
>>
>> let $csv := 'Author,Title,ISBN,Binding,Year Published
>> Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
>> James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
>> Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
>> "Larry Bossidy, Ram Charan, Charles
>> Burck",Execution,9780609610572,Hardcover,2002
>> Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
>> let $lines := tokenize($csv, '\n')
>> let $header-row := fn:head($lines)
>> let $body-rows := fn:tail($lines)
>> let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
>> for $row in $body-rows
>> let $cells := fn:analyze-string($row,
>> '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
>> return
>> element Book {
>>   for $cell at $count in $cells
>>   return element {$headers[$count]} {$cell/string()}
>> }
>> It produces the desired results:
>>
>> 
>> Jeannette Walls
>> The Glass Castle
>> 074324754X
>> Paperback
>> 2006
>> 
>> 
>> James Surowiecki
>> The Wisdom of Crowds
>> 9780385503860
>> Paperback
>> 2005
>> 
>> 
>> Lawrence Lessig
>> The Future of Ideas
>> 9780375505782
>> Paperback
>> 2002
>> 
>> 
>> Larry Bossidy, Ram Charan, Charles Burck
>> Execution
>> 9780609610572
>> Hardcover
>> 2002
>> 
>> 
>> Kurt Vonnegut
>> Slaughterhouse-Five
>> 9780791059258
>> Paperback
>> 1999
>> 
>>
>> Unfortunately BaseX complains about the regex, with the following error:
>>
>> Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
>> expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.
>>
>> Without a column location, I'm unable to tell where the problem is.
>> Is there something used in this expression that BaseX doesn't support?
>>
>> On the topic of the potential memory pitfalls of a pure XQuery
>> solution for our hypothetical EXPath library, I think the primary
>> problem is that the entire CSV has to be loaded into memory.  I wonder
>> if implementations could use the new `fn:unparsed-text-lines()`
>> function from XQuery 3.0 to stream the CSV through XQuery without
>> requiring the entire thing to be in memory?  Or 

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Hans-Juergen Rennau
Joe, just in case it is of interest to you: the TopicTools framework, 
downloadable at
   https://github.com/hrennau/topictools
contains an XQuery-implemented, full-featured csv parser (module 
_csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the 
parser is automatically added to your application code and you can declare 
command-line parameters of your tool to have (among many others) a csv-based 
data type. Having declared the parameter to have such a type (e.g. csvURI or 
csvFOX), you can forget csv, as your application code sees nothing else but 
XML. More offline, if this is interesting to you.
Cheers,Hans

PS - illustrating ...

Tool invocations:   basex -b 

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Joe Wicentowski
Hans-Jürgen,
I figured as much. I wonder if we can come up with an xsd-compliant regex for 
this purpose? It may not give us a full-featured CSV parser, but would handle 
reasonably uniform cases.
Joe

Sent from my iPhone




On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau"  
wrote:










Joe, concerning your regex, I would complain, too! Already the first two 
characters 
    (?render the expression invalid:(1) An unescaped ? is an occurrence 
indicator, making the preceding entity optional(2) An unescaped ( is used for 
grouping, it does not repesent anything
=> there is no entity preceding the ? which the ? could make optional => error

Please keep in mind that the regex flavor supported by XPath is the regex 
flavor defined by the XSD spec. There are a few constructs used in Perl & Co 
which are not defined in XPath regex.

What concerns the CSV implementation, I came to realize my error: the BaseX 
implementation *is* Java code, not XQuery code - the xqm module just contains 
the function signature, marked "external".
Cheers,Hans
 

Joe Wicentowski  schrieb am 21:27 Sonntag, 11.September 
2016:
  

 Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and 
Christian.

The other day, short of a comprehensive solution, I went in search of
a regex that would handle quoted values that contain commas that
shouldn't serve as delimiters.  I found one that worked in eXist but
not in BaseX.

Source for the regex: http://stackoverflow.com/a/13259681/659732

The query:

```
xquery version "3.1";

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '
')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
for $row in $body-rows
let $cells := fn:analyze-string($row,
'(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
return
    element Book {
      for $cell at $count in $cells
      return element {$headers[$count]} {$cell/string()}
    }
It produces the desired results:


    Jeannette Walls
    The Glass Castle
    074324754X
    Paperback
    2006


    James Surowiecki
    The Wisdom of Crowds
    9780385503860
    Paperback
    2005


    Lawrence Lessig
    The Future of Ideas
    9780375505782
    Paperback
    2002


    Larry Bossidy, Ram Charan, Charles Burck
    Execution
    9780609610572
    Hardcover
    2002


    Kurt Vonnegut
    Slaughterhouse-Five
    9780791059258
    Paperback
    1999


Unfortunately BaseX complains about the regex, with the following error:

Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.

Without a column location, I'm unable to tell where the problem is.
Is there something used in this expression that BaseX doesn't support?

On the topic of the potential memory pitfalls of a pure XQuery
solution for our hypothetical EXPath library, I think the primary
problem is that the entire CSV has to be loaded into memory.  I wonder
if implementations could use the new `fn:unparsed-text-lines()`
function from XQuery 3.0 to stream the CSV through XQuery without
requiring the entire thing to be in memory?  Or are we basically
setting ourselves up for the EXPath solution being a wrapper around an
external library written in a lower level language?

Joe

On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün
 wrote:
> Hi Joe,
>
> Thanks for your mail. You are completely right, using an array would
> be the natural choice with csv:parse. It’s mostly due to backward
> compatibility that we didn’t update the function.
>
> @All: I’m pretty sure that all of us would like having an EXPath spec
> for parsing CSV data. We still need one volunteer to make it happen ;)
> Anyone out there?
>
> Cheers
> Christian
>
>
> On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski  wrote:
>> Dear BaseX developers,
>>
>> I noticed in example 3 under
>> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
>> option { 'format': 'map' } returns a map of maps, with hardcoded row
>> numbers:
>>
>> map {
>>     1: map {
>>         "City": "Newton",
>>         "Name": "John"
>>     },
>>     2: map {
>>         "City": "Oldtown",
>>         "Name": "Jack"
>>     }
>> }
>>
>> Using maps, which are unordered, to represent something ordered like
>> rows in a CSV, hardcoded row numbers are necessary for reassembling
>> the map in document order.  I assume this was a necessary approach
>> when the module was developed in the map-only world of XQuery 3.0.
>> Now that 3.1 

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Hans-Juergen Rennau
Joe, concerning your regex, I would complain, too! Already the first two 
characters 
    (?render the expression invalid:(1) An unescaped ? is an occurrence 
indicator, making the preceding entity optional(2) An unescaped ( is used for 
grouping, it does not repesent anything
=> there is no entity preceding the ? which the ? could make optional => error

Please keep in mind that the regex flavor supported by XPath is the regex 
flavor defined by the XSD spec. There are a few constructs used in Perl & Co 
which are not defined in XPath regex.

What concerns the CSV implementation, I came to realize my error: the BaseX 
implementation *is* Java code, not XQuery code - the xqm module just contains 
the function signature, marked "external".
Cheers,Hans
 

Joe Wicentowski  schrieb am 21:27 Sonntag, 11.September 
2016:
 

 Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and 
Christian.

The other day, short of a comprehensive solution, I went in search of
a regex that would handle quoted values that contain commas that
shouldn't serve as delimiters.  I found one that worked in eXist but
not in BaseX.

Source for the regex: http://stackoverflow.com/a/13259681/659732

The query:

```
xquery version "3.1";

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
for $row in $body-rows
let $cells := fn:analyze-string($row,
'(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
return
    element Book {
      for $cell at $count in $cells
      return element {$headers[$count]} {$cell/string()}
    }
It produces the desired results:


    Jeannette Walls
    The Glass Castle
    074324754X
    Paperback
    2006


    James Surowiecki
    The Wisdom of Crowds
    9780385503860
    Paperback
    2005


    Lawrence Lessig
    The Future of Ideas
    9780375505782
    Paperback
    2002


    Larry Bossidy, Ram Charan, Charles Burck
    Execution
    9780609610572
    Hardcover
    2002


    Kurt Vonnegut
    Slaughterhouse-Five
    9780791059258
    Paperback
    1999


Unfortunately BaseX complains about the regex, with the following error:

Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.

Without a column location, I'm unable to tell where the problem is.
Is there something used in this expression that BaseX doesn't support?

On the topic of the potential memory pitfalls of a pure XQuery
solution for our hypothetical EXPath library, I think the primary
problem is that the entire CSV has to be loaded into memory.  I wonder
if implementations could use the new `fn:unparsed-text-lines()`
function from XQuery 3.0 to stream the CSV through XQuery without
requiring the entire thing to be in memory?  Or are we basically
setting ourselves up for the EXPath solution being a wrapper around an
external library written in a lower level language?

Joe

On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün
 wrote:
> Hi Joe,
>
> Thanks for your mail. You are completely right, using an array would
> be the natural choice with csv:parse. It’s mostly due to backward
> compatibility that we didn’t update the function.
>
> @All: I’m pretty sure that all of us would like having an EXPath spec
> for parsing CSV data. We still need one volunteer to make it happen ;)
> Anyone out there?
>
> Cheers
> Christian
>
>
> On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski  wrote:
>> Dear BaseX developers,
>>
>> I noticed in example 3 under
>> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
>> option { 'format': 'map' } returns a map of maps, with hardcoded row
>> numbers:
>>
>> map {
>>    1: map {
>>        "City": "Newton",
>>        "Name": "John"
>>    },
>>    2: map {
>>        "City": "Oldtown",
>>        "Name": "Jack"
>>    }
>> }
>>
>> Using maps, which are unordered, to represent something ordered like
>> rows in a CSV, hardcoded row numbers are necessary for reassembling
>> the map in document order.  I assume this was a necessary approach
>> when the module was developed in the map-only world of XQuery 3.0.
>> Now that 3.1 supports arrays, might an array of maps be a closer fit
>> for CSV parsing?
>>
>> array {
>>    map {
>>        "City": "Newton",
>>        "Name": "John"
>>    },
>>    map {
>>        "City": "Oldtown",
>>        "Name": "Jack"
>>    }
>> }
>>
>> I'm also curious, do you know of any efforts to create an EXPath spec
>> for CSV?  

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Joe Wicentowski
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.

The other day, short of a comprehensive solution, I went in search of
a regex that would handle quoted values that contain commas that
shouldn't serve as delimiters.  I found one that worked in eXist but
not in BaseX.

Source for the regex: http://stackoverflow.com/a/13259681/659732

The query:

```
xquery version "3.1";

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
for $row in $body-rows
let $cells := fn:analyze-string($row,
'(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
return
element Book {
  for $cell at $count in $cells
  return element {$headers[$count]} {$cell/string()}
}
It produces the desired results:


Jeannette Walls
The Glass Castle
074324754X
Paperback
2006


James Surowiecki
The Wisdom of Crowds
9780385503860
Paperback
2005


Lawrence Lessig
The Future of Ideas
9780375505782
Paperback
2002


Larry Bossidy, Ram Charan, Charles Burck
Execution
9780609610572
Hardcover
2002


Kurt Vonnegut
Slaughterhouse-Five
9780791059258
Paperback
1999


Unfortunately BaseX complains about the regex, with the following error:

Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.

Without a column location, I'm unable to tell where the problem is.
Is there something used in this expression that BaseX doesn't support?

On the topic of the potential memory pitfalls of a pure XQuery
solution for our hypothetical EXPath library, I think the primary
problem is that the entire CSV has to be loaded into memory.  I wonder
if implementations could use the new `fn:unparsed-text-lines()`
function from XQuery 3.0 to stream the CSV through XQuery without
requiring the entire thing to be in memory?  Or are we basically
setting ourselves up for the EXPath solution being a wrapper around an
external library written in a lower level language?

Joe

On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün
 wrote:
> Hi Joe,
>
> Thanks for your mail. You are completely right, using an array would
> be the natural choice with csv:parse. It’s mostly due to backward
> compatibility that we didn’t update the function.
>
> @All: I’m pretty sure that all of us would like having an EXPath spec
> for parsing CSV data. We still need one volunteer to make it happen ;)
> Anyone out there?
>
> Cheers
> Christian
>
>
> On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski  wrote:
>> Dear BaseX developers,
>>
>> I noticed in example 3 under
>> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
>> option { 'format': 'map' } returns a map of maps, with hardcoded row
>> numbers:
>>
>> map {
>> 1: map {
>> "City": "Newton",
>> "Name": "John"
>> },
>> 2: map {
>> "City": "Oldtown",
>> "Name": "Jack"
>> }
>> }
>>
>> Using maps, which are unordered, to represent something ordered like
>> rows in a CSV, hardcoded row numbers are necessary for reassembling
>> the map in document order.  I assume this was a necessary approach
>> when the module was developed in the map-only world of XQuery 3.0.
>> Now that 3.1 supports arrays, might an array of maps be a closer fit
>> for CSV parsing?
>>
>> array {
>> map {
>> "City": "Newton",
>> "Name": "John"
>> },
>> map {
>> "City": "Oldtown",
>> "Name": "Jack"
>> }
>> }
>>
>> I'm also curious, do you know of any efforts to create an EXPath spec
>> for CSV?  Putting spec and CSV in the same sentence is dangerous,
>> since CSV is a notoriously under-specified format: "The CSV file
>> format is not standardized" (see
>> https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
>> there is a common enough need for CSV parsing that such a spec would
>> benefit the community?  I thought I'd start by asking here, since
>> BaseX's seems to be the most developed (or only?) CSV module in
>> XQuery.
>>
>> Then there's the question of how to approach implementations of such a
>> spec.  While XQuery is probably capable of parsing and serializing
>> small enough CSV, CSVs do get large and naive processing with XQuery
>> would tend to run into memory issues (as I found with xqjson).  This
>> means implementations would tend to write in a lower-level language.
>> eXist, for example, uses 

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-11 Thread Christian Grün
Hi Joe,

Thanks for your mail. You are completely right, using an array would
be the natural choice with csv:parse. It’s mostly due to backward
compatibility that we didn’t update the function.

@All: I’m pretty sure that all of us would like having an EXPath spec
for parsing CSV data. We still need one volunteer to make it happen ;)
Anyone out there?

Cheers
Christian


On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski  wrote:
> Dear BaseX developers,
>
> I noticed in example 3 under
> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
> option { 'format': 'map' } returns a map of maps, with hardcoded row
> numbers:
>
> map {
> 1: map {
> "City": "Newton",
> "Name": "John"
> },
> 2: map {
> "City": "Oldtown",
> "Name": "Jack"
> }
> }
>
> Using maps, which are unordered, to represent something ordered like
> rows in a CSV, hardcoded row numbers are necessary for reassembling
> the map in document order.  I assume this was a necessary approach
> when the module was developed in the map-only world of XQuery 3.0.
> Now that 3.1 supports arrays, might an array of maps be a closer fit
> for CSV parsing?
>
> array {
> map {
> "City": "Newton",
> "Name": "John"
> },
> map {
> "City": "Oldtown",
> "Name": "Jack"
> }
> }
>
> I'm also curious, do you know of any efforts to create an EXPath spec
> for CSV?  Putting spec and CSV in the same sentence is dangerous,
> since CSV is a notoriously under-specified format: "The CSV file
> format is not standardized" (see
> https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
> there is a common enough need for CSV parsing that such a spec would
> benefit the community?  I thought I'd start by asking here, since
> BaseX's seems to be the most developed (or only?) CSV module in
> XQuery.
>
> Then there's the question of how to approach implementations of such a
> spec.  While XQuery is probably capable of parsing and serializing
> small enough CSV, CSVs do get large and naive processing with XQuery
> would tend to run into memory issues (as I found with xqjson).  This
> means implementations would tend to write in a lower-level language.
> eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
> has a CSV extension too:
> https://github.com/FasterXML/jackson-dataformat-csv.  Any thoughts on
> the suitability of XQuery for the task?
>
> Joe


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-08 Thread Hans-Juergen Rennau
nstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1    What concerns 
me, I definitely want the CSV as XML. But the performance problems have 
certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, 
which is parsed very fast!) - it is the parsing operation itself which, if I'm 
not mistaken, is handled by XQuery code and which must be shifted into the Java 
implementation.    Kind regards, Hans-Jürgen    Marc van Grootel 
<marc.van.groo...@gmail.com> schrieb am 15:55 Donnerstag, 8.September 2016:    
I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.

Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.


--Marc       

   

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-08 Thread Lizzi, Vincent
As it so happens, I just received a 20.5 Mb Excel file which I am loading into 
BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV 
format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to 
BaseX a few different ways.


1.   BaseX GUI – Using “Create Database” with input format CSV, the CSV was 
loaded and converted to XML in a few seconds.


2.   Command script – The CSV was loaded and converted to XML in about 10 
seconds.

SET PARSER csv
SET CSVPARSER encoding=windows-1252, header=true, separator=comma
SET CREATEFILTER *.csv
create database csvtest1 "path\to\file.csv"


3.   XQuery – The CSV was loaded and converted to XML in about 20 seconds.

db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), 
map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )


4.   XQuery (parsing only) – CSV file was parsed in about 4 seconds.

csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 
'header': true()})


5.   XQuery (parsing only) using map – The CSV file was parsed in about 6 
seconds.

csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 
'header': true(), 'format': 'map'})

These alternate methods are, from what I can see, pretty equivalent except for 
the last one which produces a map instead of XML. At what point, i.e. how much 
data in CSV format, would using map start to offer benefits beyond mere 
convenience?


I came across an example in the documentation that gave me an error message. 
The Command Line example at http://docs.basex.org/wiki/Parsers#CSV_Parser has


SET CSVPARSER<http://docs.basex.org/wiki/Options#CSVPARSER> encoding=utf-8, 
lines=true, header=false, separator=space

When trying this in BaseX 8.2.3 I get an error message:

Error: PARSER: csv Unknown option 'lines'.

The “lines” option is not listed in the CSV Module parser documentation at 
http://docs.basex.org/wiki/CSV_Module#Options.

I didn’t want to correct the example in the documentation without checking 
whether it is actually incorrect. Does this example need to be updated?

Vincent



From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Hans-Juergen 
Rennau
Sent: Thursday, September 08, 2016 10:02 AM
To: Marc van Grootel <marc.van.groo...@gmail.com>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1

What concerns me, I definitely want the CSV as XML. But the performance 
problems have certainly nothing to do with XML versus CSV (I often deal with > 
300 MB XML, which is parsed very fast!) - it is the parsing operation itself 
which, if I'm not mistaken, is handled by XQuery code and which must be shifted 
into the Java implementation.

Kind regards,
Hans-Jürgen

Marc van Grootel 
<marc.van.groo...@gmail.com<mailto:marc.van.groo...@gmail.com>> schrieb am 
15:55 Donnerstag, 8.September 2016:

I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.

Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.


--Marc




Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-08 Thread Hans-Juergen Rennau
What concerns me, I definitely want the CSV as XML. But the performance 
problems have certainly nothing to do with XML versus CSV (I often deal with > 
300 MB XML, which is parsed very fast!) - it is the parsing operation itself 
which, if I'm not mistaken, is handled by XQuery code and which must be shifted 
into the Java implementation.
Kind regards,Hans-Jürgen
 

Marc van Grootel  schrieb am 15:55 Donnerstag, 
8.September 2016:
 

 I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.

Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.


--Marc


   

Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-08 Thread Marc van Grootel
I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.

Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.


--Marc


Re: [basex-talk] csv:parse in the age of XQuery 3.1

2016-09-08 Thread Hans-Juergen Rennau
Joe, just to back you: I believe that an EXPath spec for CSV processing would 
be *extremely* useful! (There is hardly a format as ubiquitous as CSV.)
And I had similar experience concerning the performance - concretely, a 22 MB 
file proved to be simply unprocessable! Which means that BaseX support for CSV 
is only partial.

So I ardently hope for the BaseX team to enable the parsing of large CSV, and I 
hope for an initiative pulling CSV into EXPath!
Kind regards,Hans-Jürgen
 

Joe Wicentowski  schrieb am 6:14 Donnerstag, 8.September 
2016:
 

 Dear BaseX developers,

I noticed in example 3 under
http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
option { 'format': 'map' } returns a map of maps, with hardcoded row
numbers:

map {
    1: map {
        "City": "Newton",
        "Name": "John"
    },
    2: map {
        "City": "Oldtown",
        "Name": "Jack"
    }
}

Using maps, which are unordered, to represent something ordered like
rows in a CSV, hardcoded row numbers are necessary for reassembling
the map in document order.  I assume this was a necessary approach
when the module was developed in the map-only world of XQuery 3.0.
Now that 3.1 supports arrays, might an array of maps be a closer fit
for CSV parsing?

array {
    map {
        "City": "Newton",
        "Name": "John"
    },
    map {
        "City": "Oldtown",
        "Name": "Jack"
    }
}

I'm also curious, do you know of any efforts to create an EXPath spec
for CSV?  Putting spec and CSV in the same sentence is dangerous,
since CSV is a notoriously under-specified format: "The CSV file
format is not standardized" (see
https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
there is a common enough need for CSV parsing that such a spec would
benefit the community?  I thought I'd start by asking here, since
BaseX's seems to be the most developed (or only?) CSV module in
XQuery.

Then there's the question of how to approach implementations of such a
spec.  While XQuery is probably capable of parsing and serializing
small enough CSV, CSVs do get large and naive processing with XQuery
would tend to run into memory issues (as I found with xqjson).  This
means implementations would tend to write in a lower-level language.
eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
has a CSV extension too:
https://github.com/FasterXML/jackson-dataformat-csv.  Any thoughts on
the suitability of XQuery for the task?

Joe


   

[basex-talk] csv:parse in the age of XQuery 3.1

2016-09-07 Thread Joe Wicentowski
Dear BaseX developers,

I noticed in example 3 under
http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
option { 'format': 'map' } returns a map of maps, with hardcoded row
numbers:

map {
1: map {
"City": "Newton",
"Name": "John"
},
2: map {
"City": "Oldtown",
"Name": "Jack"
}
}

Using maps, which are unordered, to represent something ordered like
rows in a CSV, hardcoded row numbers are necessary for reassembling
the map in document order.  I assume this was a necessary approach
when the module was developed in the map-only world of XQuery 3.0.
Now that 3.1 supports arrays, might an array of maps be a closer fit
for CSV parsing?

array {
map {
"City": "Newton",
"Name": "John"
},
map {
"City": "Oldtown",
"Name": "Jack"
}
}

I'm also curious, do you know of any efforts to create an EXPath spec
for CSV?  Putting spec and CSV in the same sentence is dangerous,
since CSV is a notoriously under-specified format: "The CSV file
format is not standardized" (see
https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
there is a common enough need for CSV parsing that such a spec would
benefit the community?  I thought I'd start by asking here, since
BaseX's seems to be the most developed (or only?) CSV module in
XQuery.

Then there's the question of how to approach implementations of such a
spec.  While XQuery is probably capable of parsing and serializing
small enough CSV, CSVs do get large and naive processing with XQuery
would tend to run into memory issues (as I found with xqjson).  This
means implementations would tend to write in a lower-level language.
eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
has a CSV extension too:
https://github.com/FasterXML/jackson-dataformat-csv.  Any thoughts on
the suitability of XQuery for the task?

Joe