Hi,

there seems to be no way to index CSV using the DataImportHandler.

Using a combination of
LineEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
 and 
RegexTransformer<http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
as
proposed in
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
not working for real world CSV files.

E.g. many CSV files have double-quotes enclosing some but not all columns -
there is no elegant way to segment this using a simple regular expression.

As CSV is still very common esp. in E-Commerce scenarios, I propose that
Solr provides a CSVEntityProcessor that:
1) Handles the case of CSV files with/without and with some double-quote
enclosed columns
2) Allows for a configurable column separator (';',',','\t' etc.)
3) Allows for a leading row containing column headings
4) If there is a leading row with column headings provides a possibility to
address columns by their column names and map them to Solr fields (similar
to the XPathEntityProcessor)
5) Auto-detects encoding of the file (UTF-8 etc.)

This would make it A LOT easier to use Solr for E-Commerce scenarios.

If there is no such entity processor in the works i will develop one ... So
please let me know.

Regards

Reply via email to