s/provide and/provide any/ig ,-) On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen < helmut...@googlemail.com> wrote:
> Hi, > > just looked at your code. Definitely an improvement :-) > > The problem with the double-quotes is, that the delimiter (let's say ',') > might be part of the column value. The goal is to process something like > this without any tricky configuration > > name1,name2,name3 > val1,"val2,...",val3 > ... > > The user should not have to provide and before-hand knowledge regarding the > column layout or the encoding of the CSV file. Ideally the only thing that > has to be specified is firstLineHasFieldnames="true" separator=";". > Autodetecting the separator and encoding would be even more elegant. > > If nobody else has this in the works I will start building such a patch > next week. > > Best Regards > > > On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James <james.d...@ingrambook.com>wrote: > >> Helmut, >> >> I recently submitted SOLR-2549 ( >> https://issues.apache.org/jira/browse/SOLR-2549) to handle both >> fixed-width and delimited flat files. To be honest, I only needed >> fixed-width support for my app so this might not support everything you >> mention for delimited files, but it should be a good start. >> >> In particular, you might need to enhance this to handle the double quotes >> (I had though a delimiter regex along these lines might handle it: >> (?:[\"]?[,]|[\"]$) ... note this is a sample I just cooked up quick and no >> doubt has errors, and maybe as you say a simple regex might not work at all >> ) ... I also didn't do anything with encodings but I'm not sure this will be >> an issue either... >> >> James Dyer >> E-Commerce Systems >> Ingram Content Group >> (615) 213-4311 >> >> -----Original Message----- >> From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] >> Sent: Thursday, June 09, 2011 2:32 PM >> To: solr-user@lucene.apache.org >> Subject: Processing/Indexing CSV >> >> Hi, >> >> there seems to be no way to index CSV using the DataImportHandler. >> >> Using a combination of >> LineEntityProcessor< >> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor> >> and RegexTransformer< >> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer> >> as >> proposed in >> >> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is >> not working for real world CSV files. >> >> E.g. many CSV files have double-quotes enclosing some but not all columns >> - >> there is no elegant way to segment this using a simple regular expression. >> >> As CSV is still very common esp. in E-Commerce scenarios, I propose that >> Solr provides a CSVEntityProcessor that: >> 1) Handles the case of CSV files with/without and with some double-quote >> enclosed columns >> 2) Allows for a configurable column separator (';',',','\t' etc.) >> 3) Allows for a leading row containing column headings >> 4) If there is a leading row with column headings provides a possibility >> to >> address columns by their column names and map them to Solr fields (similar >> to the XPathEntityProcessor) >> 5) Auto-detects encoding of the file (UTF-8 etc.) >> >> This would make it A LOT easier to use Solr for E-Commerce scenarios. >> >> If there is no such entity processor in the works i will develop one ... >> So >> please let me know. >> >> Regards >> > >