s/provide and/provide any/ig ,-)

On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen <
helmut...@googlemail.com> wrote:

> Hi,
>
> just looked at your code. Definitely an improvement :-)
>
> The problem with the double-quotes is, that the delimiter (let's say ',')
> might be part of the column value. The goal is to process something like
> this without any tricky configuration
>
> name1,name2,name3
> val1,"val2,...",val3
> ...
>
> The user should not have to provide and before-hand knowledge regarding the
> column layout or the encoding of the CSV file. Ideally the only thing that
> has to be specified is firstLineHasFieldnames="true" separator=";".
> Autodetecting the separator and encoding would be even more elegant.
>
> If nobody else has this in the works I will start building such a patch
> next week.
>
> Best Regards
>
>
> On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James <james.d...@ingrambook.com>wrote:
>
>> Helmut,
>>
>> I recently submitted SOLR-2549 (
>> https://issues.apache.org/jira/browse/SOLR-2549) to handle both
>> fixed-width and delimited flat files.  To be honest, I only needed
>> fixed-width support for my app so this might not support everything you
>> mention for delimited files, but it should be a good start.
>>
>> In particular, you might need to enhance this to handle the double quotes
>> (I had though a delimiter regex along these lines might handle it:
>>  (?:[\"]?[,]|[\"]$)  ... note this is a sample I just cooked up quick and no
>> doubt has errors, and maybe as you say a simple regex might not work at all
>> ) ... I also didn't do anything with encodings but I'm not sure this will be
>> an issue either...
>>
>> James Dyer
>> E-Commerce Systems
>> Ingram Content Group
>> (615) 213-4311
>>
>> -----Original Message-----
>> From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com]
>> Sent: Thursday, June 09, 2011 2:32 PM
>> To: solr-user@lucene.apache.org
>> Subject: Processing/Indexing CSV
>>
>> Hi,
>>
>> there seems to be no way to index CSV using the DataImportHandler.
>>
>> Using a combination of
>> LineEntityProcessor<
>> http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>>  and RegexTransformer<
>> http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
>> as
>> proposed in
>>
>> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
>> not working for real world CSV files.
>>
>> E.g. many CSV files have double-quotes enclosing some but not all columns
>> -
>> there is no elegant way to segment this using a simple regular expression.
>>
>> As CSV is still very common esp. in E-Commerce scenarios, I propose that
>> Solr provides a CSVEntityProcessor that:
>> 1) Handles the case of CSV files with/without and with some double-quote
>> enclosed columns
>> 2) Allows for a configurable column separator (';',',','\t' etc.)
>> 3) Allows for a leading row containing column headings
>> 4) If there is a leading row with column headings provides a possibility
>> to
>> address columns by their column names and map them to Solr fields (similar
>> to the XPathEntityProcessor)
>> 5) Auto-detects encoding of the file (UTF-8 etc.)
>>
>> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>>
>> If there is no such entity processor in the works i will develop one ...
>> So
>> please let me know.
>>
>> Regards
>>
>
>

Reply via email to