Re: Splitting fields

2011-05-31 Thread Erick Erickson
Hmmm, I wonder if a custom Transformer would help here? It can be inserted into
a chain of transformers in DIH.

Essentially, you subclass Transformer and implement one method (transformRow)
and do anything you want. The input is a map of String, Object that
is a simple
representation of the Solr document. You can add/subtract/whatever you
want to that
map and then just return it.

The map in transformRow has all the changes by any other entries in
the transform
chain at this point, and your changes are passed on to the next
transformer in the chain.

The only restriction I know of is that the document has to conform to
the schema when
all is said and done.

Best
Erick

On Fri, May 27, 2011 at 6:47 AM, Joe Fitzgerald
joe_fitzger...@oxfordcorp.com wrote:
 Hello,



 I am in an odd position.  The application server I use has built-in
 integration with SOLR.  Unfortunately, its native capabilities are
 fairly limited, specifically, it only supports a standard/pre-defined
 set of fields which can be indexed.  As a result, it has left me
 kludging how I work with Solr and doing things like putting what I'd
 like to be multiple, separate fields into a single Solr field.



 As an example, I may put a customer id and name into a single field
 called 'custom1'.  Ideally, I'd like this information to be returned in
 separate fields...and even better would be for them to be indexed as
 separate fields but I can live without the latter.  Currently, I'm
 building out a json representation of this information which makes it
 easy for me to deal with when I extract the results...but it all feels
 wrong.



 I do have complete control over the actual Solr installation (just not
 the indexing call to Solr), so I was hoping there may be a way to
 configure Solr to take my single field and split it up into a different
 field for each key in my json representation.



 I don't see anything native to Solr that would do this for me but there
 are a few features that I thought sounded similar and was hoping to get
 some opinions on how I may be able to move forward with this...



 Poly fields, such as the spatial location, might help?  Can I build my
 own poly-field that would split up the main field into subfields?  Do
 poly-fields let me return the subfields?  I don't quite have my head
 around polyfields yet.



 Another option although I suspect this won't be considered a good
 approach, but what about extending the copyField functionality of
 schema.xml to support my needs?  It would seem not entirely unreasonable
 that copyField would provide a means to extract only a portion of the
 contents of the source field to place in the destination field, no?  I'm
 sure people more familiar with Solr's architecture could explain why
 this isn't really an appropriate thing for Solr to handle (just because
 it could doesn't mean it should)...

 The other - and probably best -- option would be to leverage Solr
 directly, bypassing the native integration of my application server,
 which we've already done for most cases.  I'd love to go this route but
 I'm having a hard time figuring out how to easily accomplish the same
 functionality provided by my app server integration...perhaps someone on
 the list could help me with this path forward?  Here is what I'm trying
 to accomplish:



 I'm indexing documents (text, pdf, html...) but I need to include fields
 in the results of my searches which are only available from a db query.
 I know how to have Solr index results from a db query, but I'm having
 trouble getting it to index the documents that are associated to each
 record of that query (full path/filename is one of the fields of that
 query).



 I started to try to use the dataImport handler to do this, by setting up
 a FileDataSource in addition to my jdbc data source.  I tried to
 leverage the filedatasource to populate a sub-entity based on the db
 field that contains the full path/filename, but I wasn't sure how to
 specify the db field from the root query/entity.  Before I spent too
 much time, I also realized I wasn't sure how to get Solr to deal with
 binary file types this way either which upon further reading seemed like
 I would need to leverage Tika - can that be done within the confines of
 dataimporthandler?



 Any advice is greatly appreciated.  Thanks in advance,



 Joe




Re: Splitting fields

2011-05-31 Thread Jan Høydahl
Hi,

Write a custom UpdateProcessor, which gives you full control of the 
SolrDocument prior to indexing. The best would be if you write a generic 
FieldSplitterProcessor which is configurable on what field to take as input, 
what delimiter or regex to split on and finally what fields to write the result 
to. This way other may re-use your code for their splitting needs.

See http://wiki.apache.org/solr/UpdateRequestProcessor and 
http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:

 Hello,
 
 
 
 I am in an odd position.  The application server I use has built-in
 integration with SOLR.  Unfortunately, its native capabilities are
 fairly limited, specifically, it only supports a standard/pre-defined
 set of fields which can be indexed.  As a result, it has left me
 kludging how I work with Solr and doing things like putting what I'd
 like to be multiple, separate fields into a single Solr field.
 
 
 
 As an example, I may put a customer id and name into a single field
 called 'custom1'.  Ideally, I'd like this information to be returned in
 separate fields...and even better would be for them to be indexed as
 separate fields but I can live without the latter.  Currently, I'm
 building out a json representation of this information which makes it
 easy for me to deal with when I extract the results...but it all feels
 wrong.
 
 
 
 I do have complete control over the actual Solr installation (just not
 the indexing call to Solr), so I was hoping there may be a way to
 configure Solr to take my single field and split it up into a different
 field for each key in my json representation.
 
 
 
 I don't see anything native to Solr that would do this for me but there
 are a few features that I thought sounded similar and was hoping to get
 some opinions on how I may be able to move forward with this...
 
 
 
 Poly fields, such as the spatial location, might help?  Can I build my
 own poly-field that would split up the main field into subfields?  Do
 poly-fields let me return the subfields?  I don't quite have my head
 around polyfields yet.
 
 
 
 Another option although I suspect this won't be considered a good
 approach, but what about extending the copyField functionality of
 schema.xml to support my needs?  It would seem not entirely unreasonable
 that copyField would provide a means to extract only a portion of the
 contents of the source field to place in the destination field, no?  I'm
 sure people more familiar with Solr's architecture could explain why
 this isn't really an appropriate thing for Solr to handle (just because
 it could doesn't mean it should)...
 
 The other - and probably best -- option would be to leverage Solr
 directly, bypassing the native integration of my application server,
 which we've already done for most cases.  I'd love to go this route but
 I'm having a hard time figuring out how to easily accomplish the same
 functionality provided by my app server integration...perhaps someone on
 the list could help me with this path forward?  Here is what I'm trying
 to accomplish:
 
 
 
 I'm indexing documents (text, pdf, html...) but I need to include fields
 in the results of my searches which are only available from a db query.
 I know how to have Solr index results from a db query, but I'm having
 trouble getting it to index the documents that are associated to each
 record of that query (full path/filename is one of the fields of that
 query).
 
 
 
 I started to try to use the dataImport handler to do this, by setting up
 a FileDataSource in addition to my jdbc data source.  I tried to
 leverage the filedatasource to populate a sub-entity based on the db
 field that contains the full path/filename, but I wasn't sure how to
 specify the db field from the root query/entity.  Before I spent too
 much time, I also realized I wasn't sure how to get Solr to deal with
 binary file types this way either which upon further reading seemed like
 I would need to leverage Tika - can that be done within the confines of
 dataimporthandler?
 
 
 
 Any advice is greatly appreciated.  Thanks in advance,
 
 
 
 Joe
 



Re: Splitting fields

2011-05-31 Thread Markus Jelsma
I'd go for this option as well. The example update processor can't make it 
more easier and it's a very flexible approach. Judging from the patch in 
SOLR-2105 it should still work with the current 3.2 branch.

https://issues.apache.org/jira/browse/SOLR-2105


 Hi,
 
 Write a custom UpdateProcessor, which gives you full control of the
 SolrDocument prior to indexing. The best would be if you write a generic
 FieldSplitterProcessor which is configurable on what field to take as
 input, what delimiter or regex to split on and finally what fields to
 write the result to. This way other may re-use your code for their
 splitting needs.
 
 See http://wiki.apache.org/solr/UpdateRequestProcessor and
 http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_sect
 ion
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:
  Hello,
  
  
  
  I am in an odd position.  The application server I use has built-in
  integration with SOLR.  Unfortunately, its native capabilities are
  fairly limited, specifically, it only supports a standard/pre-defined
  set of fields which can be indexed.  As a result, it has left me
  kludging how I work with Solr and doing things like putting what I'd
  like to be multiple, separate fields into a single Solr field.
  
  
  
  As an example, I may put a customer id and name into a single field
  called 'custom1'.  Ideally, I'd like this information to be returned in
  separate fields...and even better would be for them to be indexed as
  separate fields but I can live without the latter.  Currently, I'm
  building out a json representation of this information which makes it
  easy for me to deal with when I extract the results...but it all feels
  wrong.
  
  
  
  I do have complete control over the actual Solr installation (just not
  the indexing call to Solr), so I was hoping there may be a way to
  configure Solr to take my single field and split it up into a different
  field for each key in my json representation.
  
  
  
  I don't see anything native to Solr that would do this for me but there
  are a few features that I thought sounded similar and was hoping to get
  some opinions on how I may be able to move forward with this...
  
  
  
  Poly fields, such as the spatial location, might help?  Can I build my
  own poly-field that would split up the main field into subfields?  Do
  poly-fields let me return the subfields?  I don't quite have my head
  around polyfields yet.
  
  
  
  Another option although I suspect this won't be considered a good
  approach, but what about extending the copyField functionality of
  schema.xml to support my needs?  It would seem not entirely unreasonable
  that copyField would provide a means to extract only a portion of the
  contents of the source field to place in the destination field, no?  I'm
  sure people more familiar with Solr's architecture could explain why
  this isn't really an appropriate thing for Solr to handle (just because
  it could doesn't mean it should)...
  
  The other - and probably best -- option would be to leverage Solr
  directly, bypassing the native integration of my application server,
  which we've already done for most cases.  I'd love to go this route but
  I'm having a hard time figuring out how to easily accomplish the same
  functionality provided by my app server integration...perhaps someone on
  the list could help me with this path forward?  Here is what I'm trying
  to accomplish:
  
  
  
  I'm indexing documents (text, pdf, html...) but I need to include fields
  in the results of my searches which are only available from a db query.
  I know how to have Solr index results from a db query, but I'm having
  trouble getting it to index the documents that are associated to each
  record of that query (full path/filename is one of the fields of that
  query).
  
  
  
  I started to try to use the dataImport handler to do this, by setting up
  a FileDataSource in addition to my jdbc data source.  I tried to
  leverage the filedatasource to populate a sub-entity based on the db
  field that contains the full path/filename, but I wasn't sure how to
  specify the db field from the root query/entity.  Before I spent too
  much time, I also realized I wasn't sure how to get Solr to deal with
  binary file types this way either which upon further reading seemed like
  I would need to leverage Tika - can that be done within the confines of
  dataimporthandler?
  
  
  
  Any advice is greatly appreciated.  Thanks in advance,
  
  
  
  Joe


Splitting fields

2011-05-27 Thread Joe Fitzgerald
Hello,

 

I am in an odd position.  The application server I use has built-in
integration with SOLR.  Unfortunately, its native capabilities are
fairly limited, specifically, it only supports a standard/pre-defined
set of fields which can be indexed.  As a result, it has left me
kludging how I work with Solr and doing things like putting what I'd
like to be multiple, separate fields into a single Solr field.

 

As an example, I may put a customer id and name into a single field
called 'custom1'.  Ideally, I'd like this information to be returned in
separate fields...and even better would be for them to be indexed as
separate fields but I can live without the latter.  Currently, I'm
building out a json representation of this information which makes it
easy for me to deal with when I extract the results...but it all feels
wrong.

 

I do have complete control over the actual Solr installation (just not
the indexing call to Solr), so I was hoping there may be a way to
configure Solr to take my single field and split it up into a different
field for each key in my json representation.

 

I don't see anything native to Solr that would do this for me but there
are a few features that I thought sounded similar and was hoping to get
some opinions on how I may be able to move forward with this...

 

Poly fields, such as the spatial location, might help?  Can I build my
own poly-field that would split up the main field into subfields?  Do
poly-fields let me return the subfields?  I don't quite have my head
around polyfields yet.

 

Another option although I suspect this won't be considered a good
approach, but what about extending the copyField functionality of
schema.xml to support my needs?  It would seem not entirely unreasonable
that copyField would provide a means to extract only a portion of the
contents of the source field to place in the destination field, no?  I'm
sure people more familiar with Solr's architecture could explain why
this isn't really an appropriate thing for Solr to handle (just because
it could doesn't mean it should)...

The other - and probably best -- option would be to leverage Solr
directly, bypassing the native integration of my application server,
which we've already done for most cases.  I'd love to go this route but
I'm having a hard time figuring out how to easily accomplish the same
functionality provided by my app server integration...perhaps someone on
the list could help me with this path forward?  Here is what I'm trying
to accomplish:

 

I'm indexing documents (text, pdf, html...) but I need to include fields
in the results of my searches which are only available from a db query.
I know how to have Solr index results from a db query, but I'm having
trouble getting it to index the documents that are associated to each
record of that query (full path/filename is one of the fields of that
query).

 

I started to try to use the dataImport handler to do this, by setting up
a FileDataSource in addition to my jdbc data source.  I tried to
leverage the filedatasource to populate a sub-entity based on the db
field that contains the full path/filename, but I wasn't sure how to
specify the db field from the root query/entity.  Before I spent too
much time, I also realized I wasn't sure how to get Solr to deal with
binary file types this way either which upon further reading seemed like
I would need to leverage Tika - can that be done within the confines of
dataimporthandler?

 

Any advice is greatly appreciated.  Thanks in advance,

 

Joe