Re: Processing/Indexing CSV

2011-06-10 Thread Erick Erickson
Well, here's a place to start if you want to patch the code: http://wiki.apache.org/solr/HowToContribute If you do want to take this on, hop on over to the dev list and start a discussion. I'd start with some posts on that list before entering or working on a JIRA issue, just ask for some

Re: Processing/Indexing CSV

2011-06-10 Thread Helmut Hoffer von Ankershoffen
Hi, thanks for the Intro, will do next week :-) greetings from berlin On Fri, Jun 10, 2011 at 2:49 PM, Erick Erickson erickerick...@gmail.comwrote: Well, here's a place to start if you want to patch the code: http://wiki.apache.org/solr/HowToContribute If you do want to take this on, hop

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi, to make my point more clear: if the CSV has a fixed schema / column layout, using the RegexTransformer is of course a possibility (however awkward). But if you want to implement a (more or less) schema free shopping search engine ... regards On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von

RE: Processing/Indexing CSV

2011-06-09 Thread Dyer, James
Helmut, I recently submitted SOLR-2549 (https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width and delimited flat files. To be honest, I only needed fixed-width support for my app so this might not support everything you mention for delimited files, but it should be a

Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley
On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, there seems to be no way to index CSV using the DataImportHandler. Looking over the features you want, it looks like you're starting from a CSV file (as opposed to CSV stored in a database). Is

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi, just looked at your code. Definitely an improvement :-) The problem with the double-quotes is, that the delimiter (let's say ',') might be part of the column value. The goal is to process something like this without any tricky configuration name1,name2,name3 val1,val2,...,val3 ... The user

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
s/provide and/provide any/ig ,-) On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, just looked at your code. Definitely an improvement :-) The problem with the double-quotes is, that the delimiter (let's say ',') might be part of the

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. DIH is flexible enough for building the importing part of such a thing but misses elegant handling of CSV data ... Regards On Thu, Jun 9,

Re: Processing/Indexing CSV

2011-06-09 Thread Yonik Seeley
On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in

Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler
On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different

Re: Processing/Indexing CSV

2011-06-09 Thread Helmut Hoffer von Ankershoffen
Hi, btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH regarding the CSV format (James Dyer) and the effort to maintain the CSVLoader (Ken Krugler). How about merging your efforts and migrating the CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-) Best

Re: Processing/Indexing CSV

2011-06-09 Thread Ken Krugler
On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote: Hi, btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH regarding the CSV format (James Dyer) and the effort to maintain the CSVLoader (Ken Krugler). How about merging your efforts and migrating the