Hi,

just looked at your code. Definitely an improvement :-)

The problem with the double-quotes is, that the delimiter (let's say ',')
might be part of the column value. The goal is to process something like
this without any tricky configuration

name1,name2,name3
val1,"val2,...",val3
...

The user should not have to provide and before-hand knowledge regarding the
column layout or the encoding of the CSV file. Ideally the only thing that
has to be specified is firstLineHasFieldnames="true" separator=";".
Autodetecting the separator and encoding would be even more elegant.

If nobody else has this in the works I will start building such a patch next
week.

Best Regards


On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James <james.d...@ingrambook.com>wrote:

> Helmut,
>
> I recently submitted SOLR-2549 (
> https://issues.apache.org/jira/browse/SOLR-2549) to handle both
> fixed-width and delimited flat files.  To be honest, I only needed
> fixed-width support for my app so this might not support everything you
> mention for delimited files, but it should be a good start.
>
> In particular, you might need to enhance this to handle the double quotes
> (I had though a delimiter regex along these lines might handle it:
>  (?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no
> doubt has errors, and maybe as you say a simple regex might not work at all
> ) ... I also didn't do anything with encodings but I'm not sure this will be
> an issue either...
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
> Sent: Thursday, June 09, 2011 2:32 PM
> To: solr-user@lucene.apache.org
> Subject: Processing/Indexing CSV
>
> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.
>
> Using a combination of
> LineEntityProcessor<
> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>  and RegexTransformer<
> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
> as
> proposed in
>
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
> not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>

Reply via email to