Re: Splitting fields
Hmmm, I wonder if a custom Transformer would help here? It can be inserted into a chain of transformers in DIH. Essentially, you subclass Transformer and implement one method (transformRow) and do anything you want. The input is a map of String, Object that is a simple representation of the Solr document. You can add/subtract/whatever you want to that map and then just return it. The map in transformRow has all the changes by any other entries in the transform chain at this point, and your changes are passed on to the next transformer in the chain. The only restriction I know of is that the document has to conform to the schema when all is said and done. Best Erick On Fri, May 27, 2011 at 6:47 AM, Joe Fitzgerald joe_fitzger...@oxfordcorp.com wrote: Hello, I am in an odd position. The application server I use has built-in integration with SOLR. Unfortunately, its native capabilities are fairly limited, specifically, it only supports a standard/pre-defined set of fields which can be indexed. As a result, it has left me kludging how I work with Solr and doing things like putting what I'd like to be multiple, separate fields into a single Solr field. As an example, I may put a customer id and name into a single field called 'custom1'. Ideally, I'd like this information to be returned in separate fields...and even better would be for them to be indexed as separate fields but I can live without the latter. Currently, I'm building out a json representation of this information which makes it easy for me to deal with when I extract the results...but it all feels wrong. I do have complete control over the actual Solr installation (just not the indexing call to Solr), so I was hoping there may be a way to configure Solr to take my single field and split it up into a different field for each key in my json representation. I don't see anything native to Solr that would do this for me but there are a few features that I thought sounded similar and was hoping to get some opinions on how I may be able to move forward with this... Poly fields, such as the spatial location, might help? Can I build my own poly-field that would split up the main field into subfields? Do poly-fields let me return the subfields? I don't quite have my head around polyfields yet. Another option although I suspect this won't be considered a good approach, but what about extending the copyField functionality of schema.xml to support my needs? It would seem not entirely unreasonable that copyField would provide a means to extract only a portion of the contents of the source field to place in the destination field, no? I'm sure people more familiar with Solr's architecture could explain why this isn't really an appropriate thing for Solr to handle (just because it could doesn't mean it should)... The other - and probably best -- option would be to leverage Solr directly, bypassing the native integration of my application server, which we've already done for most cases. I'd love to go this route but I'm having a hard time figuring out how to easily accomplish the same functionality provided by my app server integration...perhaps someone on the list could help me with this path forward? Here is what I'm trying to accomplish: I'm indexing documents (text, pdf, html...) but I need to include fields in the results of my searches which are only available from a db query. I know how to have Solr index results from a db query, but I'm having trouble getting it to index the documents that are associated to each record of that query (full path/filename is one of the fields of that query). I started to try to use the dataImport handler to do this, by setting up a FileDataSource in addition to my jdbc data source. I tried to leverage the filedatasource to populate a sub-entity based on the db field that contains the full path/filename, but I wasn't sure how to specify the db field from the root query/entity. Before I spent too much time, I also realized I wasn't sure how to get Solr to deal with binary file types this way either which upon further reading seemed like I would need to leverage Tika - can that be done within the confines of dataimporthandler? Any advice is greatly appreciated. Thanks in advance, Joe
Re: Splitting fields
Hi, Write a custom UpdateProcessor, which gives you full control of the SolrDocument prior to indexing. The best would be if you write a generic FieldSplitterProcessor which is configurable on what field to take as input, what delimiter or regex to split on and finally what fields to write the result to. This way other may re-use your code for their splitting needs. See http://wiki.apache.org/solr/UpdateRequestProcessor and http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 27. mai 2011, at 15.47, Joe Fitzgerald wrote: Hello, I am in an odd position. The application server I use has built-in integration with SOLR. Unfortunately, its native capabilities are fairly limited, specifically, it only supports a standard/pre-defined set of fields which can be indexed. As a result, it has left me kludging how I work with Solr and doing things like putting what I'd like to be multiple, separate fields into a single Solr field. As an example, I may put a customer id and name into a single field called 'custom1'. Ideally, I'd like this information to be returned in separate fields...and even better would be for them to be indexed as separate fields but I can live without the latter. Currently, I'm building out a json representation of this information which makes it easy for me to deal with when I extract the results...but it all feels wrong. I do have complete control over the actual Solr installation (just not the indexing call to Solr), so I was hoping there may be a way to configure Solr to take my single field and split it up into a different field for each key in my json representation. I don't see anything native to Solr that would do this for me but there are a few features that I thought sounded similar and was hoping to get some opinions on how I may be able to move forward with this... Poly fields, such as the spatial location, might help? Can I build my own poly-field that would split up the main field into subfields? Do poly-fields let me return the subfields? I don't quite have my head around polyfields yet. Another option although I suspect this won't be considered a good approach, but what about extending the copyField functionality of schema.xml to support my needs? It would seem not entirely unreasonable that copyField would provide a means to extract only a portion of the contents of the source field to place in the destination field, no? I'm sure people more familiar with Solr's architecture could explain why this isn't really an appropriate thing for Solr to handle (just because it could doesn't mean it should)... The other - and probably best -- option would be to leverage Solr directly, bypassing the native integration of my application server, which we've already done for most cases. I'd love to go this route but I'm having a hard time figuring out how to easily accomplish the same functionality provided by my app server integration...perhaps someone on the list could help me with this path forward? Here is what I'm trying to accomplish: I'm indexing documents (text, pdf, html...) but I need to include fields in the results of my searches which are only available from a db query. I know how to have Solr index results from a db query, but I'm having trouble getting it to index the documents that are associated to each record of that query (full path/filename is one of the fields of that query). I started to try to use the dataImport handler to do this, by setting up a FileDataSource in addition to my jdbc data source. I tried to leverage the filedatasource to populate a sub-entity based on the db field that contains the full path/filename, but I wasn't sure how to specify the db field from the root query/entity. Before I spent too much time, I also realized I wasn't sure how to get Solr to deal with binary file types this way either which upon further reading seemed like I would need to leverage Tika - can that be done within the confines of dataimporthandler? Any advice is greatly appreciated. Thanks in advance, Joe
Re: Splitting fields
I'd go for this option as well. The example update processor can't make it more easier and it's a very flexible approach. Judging from the patch in SOLR-2105 it should still work with the current 3.2 branch. https://issues.apache.org/jira/browse/SOLR-2105 Hi, Write a custom UpdateProcessor, which gives you full control of the SolrDocument prior to indexing. The best would be if you write a generic FieldSplitterProcessor which is configurable on what field to take as input, what delimiter or regex to split on and finally what fields to write the result to. This way other may re-use your code for their splitting needs. See http://wiki.apache.org/solr/UpdateRequestProcessor and http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_sect ion -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 27. mai 2011, at 15.47, Joe Fitzgerald wrote: Hello, I am in an odd position. The application server I use has built-in integration with SOLR. Unfortunately, its native capabilities are fairly limited, specifically, it only supports a standard/pre-defined set of fields which can be indexed. As a result, it has left me kludging how I work with Solr and doing things like putting what I'd like to be multiple, separate fields into a single Solr field. As an example, I may put a customer id and name into a single field called 'custom1'. Ideally, I'd like this information to be returned in separate fields...and even better would be for them to be indexed as separate fields but I can live without the latter. Currently, I'm building out a json representation of this information which makes it easy for me to deal with when I extract the results...but it all feels wrong. I do have complete control over the actual Solr installation (just not the indexing call to Solr), so I was hoping there may be a way to configure Solr to take my single field and split it up into a different field for each key in my json representation. I don't see anything native to Solr that would do this for me but there are a few features that I thought sounded similar and was hoping to get some opinions on how I may be able to move forward with this... Poly fields, such as the spatial location, might help? Can I build my own poly-field that would split up the main field into subfields? Do poly-fields let me return the subfields? I don't quite have my head around polyfields yet. Another option although I suspect this won't be considered a good approach, but what about extending the copyField functionality of schema.xml to support my needs? It would seem not entirely unreasonable that copyField would provide a means to extract only a portion of the contents of the source field to place in the destination field, no? I'm sure people more familiar with Solr's architecture could explain why this isn't really an appropriate thing for Solr to handle (just because it could doesn't mean it should)... The other - and probably best -- option would be to leverage Solr directly, bypassing the native integration of my application server, which we've already done for most cases. I'd love to go this route but I'm having a hard time figuring out how to easily accomplish the same functionality provided by my app server integration...perhaps someone on the list could help me with this path forward? Here is what I'm trying to accomplish: I'm indexing documents (text, pdf, html...) but I need to include fields in the results of my searches which are only available from a db query. I know how to have Solr index results from a db query, but I'm having trouble getting it to index the documents that are associated to each record of that query (full path/filename is one of the fields of that query). I started to try to use the dataImport handler to do this, by setting up a FileDataSource in addition to my jdbc data source. I tried to leverage the filedatasource to populate a sub-entity based on the db field that contains the full path/filename, but I wasn't sure how to specify the db field from the root query/entity. Before I spent too much time, I also realized I wasn't sure how to get Solr to deal with binary file types this way either which upon further reading seemed like I would need to leverage Tika - can that be done within the confines of dataimporthandler? Any advice is greatly appreciated. Thanks in advance, Joe
Splitting fields
Hello, I am in an odd position. The application server I use has built-in integration with SOLR. Unfortunately, its native capabilities are fairly limited, specifically, it only supports a standard/pre-defined set of fields which can be indexed. As a result, it has left me kludging how I work with Solr and doing things like putting what I'd like to be multiple, separate fields into a single Solr field. As an example, I may put a customer id and name into a single field called 'custom1'. Ideally, I'd like this information to be returned in separate fields...and even better would be for them to be indexed as separate fields but I can live without the latter. Currently, I'm building out a json representation of this information which makes it easy for me to deal with when I extract the results...but it all feels wrong. I do have complete control over the actual Solr installation (just not the indexing call to Solr), so I was hoping there may be a way to configure Solr to take my single field and split it up into a different field for each key in my json representation. I don't see anything native to Solr that would do this for me but there are a few features that I thought sounded similar and was hoping to get some opinions on how I may be able to move forward with this... Poly fields, such as the spatial location, might help? Can I build my own poly-field that would split up the main field into subfields? Do poly-fields let me return the subfields? I don't quite have my head around polyfields yet. Another option although I suspect this won't be considered a good approach, but what about extending the copyField functionality of schema.xml to support my needs? It would seem not entirely unreasonable that copyField would provide a means to extract only a portion of the contents of the source field to place in the destination field, no? I'm sure people more familiar with Solr's architecture could explain why this isn't really an appropriate thing for Solr to handle (just because it could doesn't mean it should)... The other - and probably best -- option would be to leverage Solr directly, bypassing the native integration of my application server, which we've already done for most cases. I'd love to go this route but I'm having a hard time figuring out how to easily accomplish the same functionality provided by my app server integration...perhaps someone on the list could help me with this path forward? Here is what I'm trying to accomplish: I'm indexing documents (text, pdf, html...) but I need to include fields in the results of my searches which are only available from a db query. I know how to have Solr index results from a db query, but I'm having trouble getting it to index the documents that are associated to each record of that query (full path/filename is one of the fields of that query). I started to try to use the dataImport handler to do this, by setting up a FileDataSource in addition to my jdbc data source. I tried to leverage the filedatasource to populate a sub-entity based on the db field that contains the full path/filename, but I wasn't sure how to specify the db field from the root query/entity. Before I spent too much time, I also realized I wasn't sure how to get Solr to deal with binary file types this way either which upon further reading seemed like I would need to leverage Tika - can that be done within the confines of dataimporthandler? Any advice is greatly appreciated. Thanks in advance, Joe