RE: Mapping and Capture in ExtractingRequestHandler
Hi Erick, Can you please give me little more information about SolrJ program and how to use it to construct a Solr document ? Thanks and Regards, Swapna. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, December 21, 2011 2:28 AM To: solr-user@lucene.apache.org Subject: Re: Mapping and Capture in ExtractingRequestHandler When you start getting into complex HTML extraction, you're probably better off using a SolrJ program with a forgiving HTML parser and extracting the relevant bits yourself and construction a SolrDocument. FWIW, Erick On Tue, Dec 20, 2011 at 12:54 AM, Swapna Vuppala swapna.vupp...@arup.com wrote: Hi, I understand that we can specify parameters in ExtractingRequestHandler in solrconfig.xml to capture HTML tags of a particular type and map them to desired solr fields, like something below. str name=capturediv/str str name=fmap.divmysolrfield/str The above setting will capture content in div tags and copy to the solr field mysolrfield. What am interested is in capturing div tags with a particular class name to a solr field. When extracting content from outlook messages, I would like to capture the content within div class=message-body to go into a solr field and the content within div class=attachment-entry to go into another solr field. Can someone please let me know how to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Mapping and Capture in ExtractingRequestHandler
Hi, I understand that we can specify parameters in ExtractingRequestHandler in solrconfig.xml to capture HTML tags of a particular type and map them to desired solr fields, like something below. str name=capturediv/str str name=fmap.divmysolrfield/str The above setting will capture content in div tags and copy to the solr field mysolrfield. What am interested is in capturing div tags with a particular class name to a solr field. When extracting content from outlook messages, I would like to capture the content within div class=message-body to go into a solr field and the content within div class=attachment-entry to go into another solr field. Can someone please let me know how to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
RE: Trim and copy a solr field
Hi Juan, I think UpdateProcessor is what I would be needing. Can you please tell me more about it, as to how it works and all ? Thanks and Regards, Swapna. -Original Message- From: Juan Grande [mailto:juan.gra...@gmail.com] Sent: Thursday, December 15, 2011 11:43 PM To: solr-user@lucene.apache.org Subject: Re: Trim and copy a solr field Hi Swapna, Do you want to modify the *indexed* value or the *stored* value? The analyzers modify the indexed value. To modify the stored value, the only option that I'm aware of is to write an UpdateProcessor that changes the document before it's indexed. *Juan* On Tue, Dec 13, 2011 at 2:05 AM, Swapna Vuppala swapna.vupp...@arup.comwrote: Hi Juan, Thanks for the reply. I tried using this, but I don't see any effect of the analyzer/filter. I tried copying my Solr field to another field of the type defined below. Then I indexed couple of documents with the new schema, but I see that both fields have got the same value. Am looking at the indexed data in Luke. Am assuming that analyzers process the field value (as specified by various filters etc) and then store the modified value. Is that true ? What else could I be missing here ? Thanks and Regards, Swapna. -Original Message- From: Juan Grande [mailto:juan.gra...@gmail.com] Sent: Monday, December 12, 2011 11:50 PM To: solr-user@lucene.apache.org Subject: Re: Trim and copy a solr field Hi Swapna, You could try using a copyField to a field that uses PatternReplaceFilterFactory: fieldType class=solr.TextField name=path_location analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.* replacement=$1/ /analyzer /fieldType The regular expression may not be exactly what you want, but it will give you an idea of how to do it. I'm pretty sure there must be some other ways of doing this, but this is the first that comes to my mind. *Juan* On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.com wrote: Hi, I have a Solr field that contains the absolute path of the file that is indexed, which will be something like file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg. Am interested in indexing the location in a separate field. I was looking for some way to trim the field value from last occurrence of char /, so that I can get the location value, something like file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2, and store it in a new field. Can you please suggest some way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Sorting and searching on a field
Hi, I have a field in Solr that I want to be sortable. But at the same time, I want to be able to search on that field without using wild cards. Is that possible ? For example, if I have a field Subject with a value This is my first subject, searching in solr as subject:first should give me this result. And the field Subject should be sortable. I have read about the option of copying this to a different field, using one for searching by tokenizing, and one for sorting. But am looking for to be able to do both things on the same field. Can someone please point to a way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
RE: Trim and copy a solr field
Hi Juan, Thanks for the reply. I tried using this, but I don't see any effect of the analyzer/filter. I tried copying my Solr field to another field of the type defined below. Then I indexed couple of documents with the new schema, but I see that both fields have got the same value. Am looking at the indexed data in Luke. Am assuming that analyzers process the field value (as specified by various filters etc) and then store the modified value. Is that true ? What else could I be missing here ? Thanks and Regards, Swapna. -Original Message- From: Juan Grande [mailto:juan.gra...@gmail.com] Sent: Monday, December 12, 2011 11:50 PM To: solr-user@lucene.apache.org Subject: Re: Trim and copy a solr field Hi Swapna, You could try using a copyField to a field that uses PatternReplaceFilterFactory: fieldType class=solr.TextField name=path_location analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.* replacement=$1/ /analyzer /fieldType The regular expression may not be exactly what you want, but it will give you an idea of how to do it. I'm pretty sure there must be some other ways of doing this, but this is the first that comes to my mind. *Juan* On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.comwrote: Hi, I have a Solr field that contains the absolute path of the file that is indexed, which will be something like file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg. Am interested in indexing the location in a separate field. I was looking for some way to trim the field value from last occurrence of char /, so that I can get the location value, something like file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2, and store it in a new field. Can you please suggest some way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Trim and copy a solr field
Hi, I have a Solr field that contains the absolute path of the file that is indexed, which will be something like file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg. Am interested in indexing the location in a separate field. I was looking for some way to trim the field value from last occurrence of char /, so that I can get the location value, something like file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2, and store it in a new field. Can you please suggest some way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses