On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen
<helmut...@googlemail.com> wrote:
> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.

Looking over the features you want, it looks like you're starting from
a CSV file (as opposed to CSV stored in a database).
Is there a reason that you need to use DIH and can't directly use the
CSV loader?
http://wiki.apache.org/solr/UpdateCSV


-Yonik
http://www.lucidimagination.com



> Using a combination of
> LineEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>  and 
> RegexTransformer<http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
> as
> proposed in
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
> not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>

Reply via email to