Hi all,
Forgive me. Rather than post more code in this thread, I've created a
gist with revised code that resolves some inconsistencies in what I
posted here earlier.
https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3
Again, this isn't a full-featured CSV parser by any means; it
And corrected query body:
let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram
Sorry, a typo crept in. Here's the corrected function:
declare function local:get-cells($row as xs:string) as xs:string {
(: workaround lack of lookahead support in XPath: end row with comma :)
let $string-to-analyze := $row || ","
let $analyze :=
Hi Christian,
Yes, that sounds like the culprit. Searching back through my files,
Adam Retter responded on exist-open (at
http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with
the lack of lookahead support in XPath, by pointing to an XSLT he
adapted for CSV parsing,
> Christian: I tried removing the quote escaping but still get an error.
> Here's a small test to reproduce:
>
> fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
I assume it’s the lookbehind assertion that is not allowed in XQuery
(but I should definitely spend more
Hi all,
Christian: I completely agree, CSV is a nightmare. One way to reduce
the headaches (in, say, developing an EXPath CSV library) might be to
require that CSV pass validation by a tool such as
http://digital-preservation.github.io/csv-validator/. Adam Retter
presented his work on CSV
I didn’t check the regex in general, but one reason I think why it
fails is the escaped quote. For example, the following query is
illegal in XQuery 3.1…
matches('a"b', 'a\"b')
…where as the following one is ok:
matches('a"b', 'a"b')
On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau
Cordial thanks, Liam - I was not aware of that!
@Joe: Rule of life: when one is especially sure to be right, one is surely
wrong, and so was I, and right were you(r first two characters).
Liam R. E. Quin schrieb am 5:54 Montag, 12.September 2016:
Hans-Jürgen, wrote:
!
Hans-Jürgen, wrote:
! Already the first
> two characters
> (?render the expression invalid:(1) An unescaped ? is an
> occurrence indicator, making the preceding entity optional(2) An
> unescaped ( is used for grouping, it does not repesent anything
> => there is no entity preceding the ?
@Hans-Jürgen… Nice work, thanks for the hint!
On Sun, Sep 11, 2016 at 10:23 PM, Hans-Juergen Rennau wrote:
> Joe, just in case it is of interest to you: the TopicTools framework,
> downloadable at
>
>https://github.com/hrennau/topictools
>
> contains an XQuery-implemented,
Hi Joe,
My concern is that a single regex, no matter how complex, won’t do
justice to parse arbitary CSV data. The CSV input we got so far for
testing was simply too diverse (I spent 10% of my time into
implementing a basic CSV parser in BaseX, and 90% into examining these
special cases, and
Joe, just in case it is of interest to you: the TopicTools framework,
downloadable at
https://github.com/hrennau/topictools
contains an XQuery-implemented, full-featured csv parser (module
_csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the
parser is automatically added
Hans-Jürgen,
I figured as much. I wonder if we can come up with an xsd-compliant regex for
this purpose? It may not give us a full-featured CSV parser, but would handle
reasonably uniform cases.
Joe
Sent from my iPhone
On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau"
Joe, concerning your regex, I would complain, too! Already the first two
characters
(?render the expression invalid:(1) An unescaped ? is an occurrence
indicator, making the preceding entity optional(2) An unescaped ( is used for
grouping, it does not repesent anything
=> there is no
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of
a regex that would handle quoted values that contain commas that
shouldn't serve as delimiters. I found one that worked in eXist but
not in
Hi Joe,
Thanks for your mail. You are completely right, using an array would
be the natural choice with csv:parse. It’s mostly due to backward
compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec
for parsing CSV data. We still
..@mailman.uni-konstanz.de]On Behalf Of Hans-Juergen
Rennau
Sent: Thursday, September 08, 2016 10:02 AM
To: Marc van Grootel <marc.van.groo...@gmail.com>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1 What concerns
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1
What concerns me, I definitely want the CSV as XML. But the performance
problems have certainly nothing to do with XML versus CSV (I often deal with >
300 MB XML, which is parsed
What concerns me, I definitely want the CSV as XML. But the performance
problems have certainly nothing to do with XML versus CSV (I often deal with >
300 MB XML, which is parsed very fast!) - it is the parsing operation itself
which, if I'm not mistaken, is handled by XQuery code and which
I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing
Joe, just to back you: I believe that an EXPath spec for CSV processing would
be *extremely* useful! (There is hardly a format as ubiquitous as CSV.)
And I had similar experience concerning the performance - concretely, a 22 MB
file proved to be simply unprocessable! Which means that BaseX
21 matches
Mail list logo