Re: Multiple Word Facets
Thanks guys, the solr.ShingleFilterFactory did work to get me multiple terms per facet but now I am seeing some redundancy in the facets numbers. See below... Highway (62) Highway System (59) National (59) National Highway (59) National Highway System (59) System (59) See what's going on here? How can I make my multi token facets smarter so that the tokens aren't duplicated? Thanks in advance, Adam On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote: Facets are generated from indexed terms. Depending on your need/use-case: You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field. Or You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms. --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote: From: Adam Estrada estrada.a...@gmail.com Subject: Multiple Word Facets To: solr-user@lucene.apache.org Date: Wednesday, October 27, 2010, 4:43 AM All, I am a new to Solr faceting and stuck on how to get multiple-word facets returned from a standard Solr query. See below for what is currently being returned. lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=title int name=Federal89/int int name=EFLHD87/int int name=Eastern87/int int name=Lands87/int int name=Highways84/int int name=FHWA60/int int name=Transportation32/int int name=GIS22/int int name=Planning19/int int name=Asset15/int int name=Environment15/int int name=Management14/int int name=Realty12/int int name=Highway11/int int name=HEP10/int int name=Program9/int int name=HEPGIS7/int int name=Resources7/int int name=Roads7/int int name=EEI6/int int name=Environmental6/int int name=Right6/int int name=Way6/int ...etc... There are many terms in there that are 2 or 3 word phrases. For example, Eastern Federal Lands Highway Division all gets broken down in to the individual words that make up the total group of words. I've seen quite a few websites that do what it is I am trying to do here so any suggestions at this point would be great. See my schema below (copied from the example schema). fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Similar for type=query. Please advise on how to group or cluster document terms so that they can be used as facets. Many thanks in advance, Adam Estrada
Re: Multiple Word Facets
Ahhh...I see! I am doing my testing crawling a couple websites using Nutch and in doing so I am assigning my facets to the title field which is type=text. Are you saying that I will need to manually generate the content for my facet field? I can see the reason and need for doing it that way but I really need for my faceting to happen dynamically based on the content in the field which in this case is the title of a URL. Thanks again for all the tips on getting this working for me. Adam On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil jayendra.patil@gmail.com wrote: The Shingle Filter Breaks the words in a sentence into a combination of 2/3 words. For faceting field you should use :- field name=facet_field *type=string* indexed=true stored=true multiValued=true/ The type of the field should be *string *so that it is not tokenised at all. On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada estrada.a...@gmail.comwrote: Thanks guys, the solr.ShingleFilterFactory did work to get me multiple terms per facet but now I am seeing some redundancy in the facets numbers. See below... Highway (62) Highway System (59) National (59) National Highway (59) National Highway System (59) System (59) See what's going on here? How can I make my multi token facets smarter so that the tokens aren't duplicated? Thanks in advance, Adam On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote: Facets are generated from indexed terms. Depending on your need/use-case: You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field. Or You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms. --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote: From: Adam Estrada estrada.a...@gmail.com Subject: Multiple Word Facets To: solr-user@lucene.apache.org Date: Wednesday, October 27, 2010, 4:43 AM All, I am a new to Solr faceting and stuck on how to get multiple-word facets returned from a standard Solr query. See below for what is currently being returned. lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=title int name=Federal89/int int name=EFLHD87/int int name=Eastern87/int int name=Lands87/int int name=Highways84/int int name=FHWA60/int int name=Transportation32/int int name=GIS22/int int name=Planning19/int int name=Asset15/int int name=Environment15/int int name=Management14/int int name=Realty12/int int name=Highway11/int int name=HEP10/int int name=Program9/int int name=HEPGIS7/int int name=Resources7/int int name=Roads7/int int name=EEI6/int int name=Environmental6/int int name=Right6/int int name=Way6/int ...etc... There are many terms in there that are 2 or 3 word phrases. For example, Eastern Federal Lands Highway Division all gets broken down in to the individual words that make up the total group of words. I've seen quite a few websites that do what it is I am trying to do here so any suggestions at this point would be great. See my schema below (copied from the example schema). fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Similar for type=query. Please advise on how to group or cluster document terms so that they can be used as facets. Many thanks in advance, Adam Estrada
Multiple Word Facets
All, I am a new to Solr faceting and stuck on how to get multiple-word facets returned from a standard Solr query. See below for what is currently being returned. lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=title int name=Federal89/int int name=EFLHD87/int int name=Eastern87/int int name=Lands87/int int name=Highways84/int int name=FHWA60/int int name=Transportation32/int int name=GIS22/int int name=Planning19/int int name=Asset15/int int name=Environment15/int int name=Management14/int int name=Realty12/int int name=Highway11/int int name=HEP10/int int name=Program9/int int name=HEPGIS7/int int name=Resources7/int int name=Roads7/int int name=EEI6/int int name=Environmental6/int int name=Right6/int int name=Way6/int ...etc... There are many terms in there that are 2 or 3 word phrases. For example, Eastern Federal Lands Highway Division all gets broken down in to the individual words that make up the total group of words. I've seen quite a few websites that do what it is I am trying to do here so any suggestions at this point would be great. See my schema below (copied from the example schema). fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Similar for type=query. Please advise on how to group or cluster document terms so that they can be used as facets. Many thanks in advance, Adam Estrada
Importing SlashDot Data
All, I have a new Windows 7 machine and have been trying to import an RSS feed like in the SlashDot example that is included in the software. My dataConfig file looks fine. dataConfig dataSource type=HttpDataSource / document entity name=slashdot pk=link url=http://rss.slashdot.org/Slashdot/slashdot; processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=DateFormatTransformer field column=source xpath=/RDF/channel/title commonField=true / field column=source-link xpath=/RDF/channel/link commonField=true / field column=subject xpath=/RDF/channel/subject commonField=true / field column=title xpath=/RDF/item/title / field column=link xpath=/RDF/item/link / field column=description xpath=/RDF/item/description / field column=creator xpath=/RDF/item/creator / field column=item-subject xpath=/RDF/item/subject / field column=date xpath=/RDF/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/RDF/item/department / field column=slash-section xpath=/RDF/item/section / field column=slash-comments xpath=/RDF/item/comments / /entity /document /dataConfig == And when I choose to perform a full import, absolutely nothing happens. Here is the debug code. Sep 17, 2010 4:09:04 PM org.apache.solr.core.SolrCore execute INFO: [rss] webapp=/solr path=/select params={start=0dataConfig=dataConfig%0d %0a%09dataSource+type%3DHttpDataSource+/%0d%0a%09document%0d%0a%09%09enti ty+name%3Dslashdot%0d%0a%09%09%09%09pk%3Dlink%0d%0a%09%09%09%09url%3Dhttp:/ /rss.slashdot.org/Slashdot/slashdot %0d%0a%09%09%09%09processor%3DXPathEntityPr ocessor%0d%0a%09%09%09%09forEach%3D/RDF/channel+|+/RDF/item%0d%0a%09%09%09%09 transformer%3DDateFormatTransformer%0d%0a%09%09%09%09%0d%0a%09%09%09field+co lumn%3Dsource+xpath%3D/RDF/channel/title+commonField%3Dtrue+/%0d%0a%09%09 %09field+column%3Dsource-link+xpath%3D/RDF/channel/link+commonField%3Dtrue +/%0d%0a%09%09%09field+column%3Dsubject+xpath%3D/RDF/channel/subject+comm onField%3Dtrue+/%0d%0a%09%09%09%0d%0a%09%09%09field+column%3Dtitle+xpath%3 D/RDF/item/title+/%0d%0a%09%09%09field+column%3Dlink+xpath%3D/RDF/item/li nk+/%0d%0a%09%09%09field+column%3Ddescription+xpath%3D/RDF/item/descriptio n+/%0d%0a%09%09%09field+column%3Dcreator+xpath%3D/RDF/item/creator+/%0d% 0a%09%09%09field+column%3Ditem-subject+xpath%3D/RDF/item/subject+/%0d%0a%0 9%09%09field+column%3Ddate+xpath%3D/RDF/item/date+dateTimeFormat%3D-MM -dd'T'hh:mm:ss+/%0d%0a%09%09%09field+column%3Dslash-department+xpath%3D/RD F/item/department+/%0d%0a%09%09%09field+column%3Dslash-section+xpath%3D/RD F/item/section+/%0d%0a%09%09%09field+column%3Dslash-comments+xpath%3D/RDF/ item/comments+/%0d%0a%09%09/entity%0d%0a%09/document%0d%0a/dataConfig%0d %0averbose=oncommand=full-importdebug=onqt=/dataimportrows=10} status=0 QTi me=293 Can someone please explain what might be going on here? What's with all the %0d%0a%09%09's? Thanks in advance, Adam