Re: Multiple Word Facets

2010-10-27 Thread Adam Estrada
Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
terms per facet but now I am seeing some redundancy in the facets
numbers. See below...

Highway (62)
Highway System (59)
National (59)
National Highway (59)
National Highway System (59)
System (59)

See what's going on here? How can I make my multi token facets smarter
so that the tokens aren't duplicated?

Thanks in advance,
Adam

On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote:
 Facets are generated from indexed terms.

 Depending on your need/use-case:

 You can use a additional separate String field (which is not tokenized) for 
 facets, populate it via copyField. Search on tokenized field facet on 
 non-tokenized field.

 Or

 You can add solr.ShingleFilterFactory to your index analyzer to form multiple 
 word terms.

 --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote:

 From: Adam Estrada estrada.a...@gmail.com
 Subject: Multiple Word Facets
 To: solr-user@lucene.apache.org
 Date: Wednesday, October 27, 2010, 4:43 AM
 All,
 I am a new to Solr faceting and stuck on how to get
 multiple-word
 facets returned from a standard Solr query. See below for
 what is
 currently being returned.

 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=title
 int name=Federal89/int
 int name=EFLHD87/int
 int name=Eastern87/int
 int name=Lands87/int
 int name=Highways84/int
 int name=FHWA60/int
 int name=Transportation32/int
 int name=GIS22/int
 int name=Planning19/int
 int name=Asset15/int
 int name=Environment15/int
 int name=Management14/int
 int name=Realty12/int
 int name=Highway11/int
 int name=HEP10/int
 int name=Program9/int
 int name=HEPGIS7/int
 int name=Resources7/int
 int name=Roads7/int
 int name=EEI6/int
 int name=Environmental6/int
 int name=Right6/int
 int name=Way6/int
 ...etc...

 There are many terms in there that are 2 or 3 word phrases.
 For
 example, Eastern Federal Lands Highway Division all gets
 broken down
 in to the individual words that make up the total group of
 words. I've
 seen quite a few websites that do what it is I am trying to
 do here so
 any suggestions at this point would be great. See my schema
 below
 (copied from the example schema).

     fieldType name=text
 class=solr.TextField positionIncrementGap=100
       analyzer type=index
          tokenizer
 class=solr.WhitespaceTokenizerFactory/
     filter
 class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
         filter
 class=solr.StopFilterFactory

 ignoreCase=true

 words=stopwords.txt

 enablePositionIncrements=true

 /
     filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1 catenateWords=0
 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=1/
         filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
       /analyzer

 Similar for type=query. Please advise on how to group or
 cluster
 document terms so that they can be used as facets.

 Many thanks in advance,
 Adam Estrada







Re: Multiple Word Facets

2010-10-27 Thread Adam Estrada
Ahhh...I see! I am doing my testing crawling a couple websites using
Nutch and in doing so I am assigning my facets to the title field
which is type=text. Are you saying that I will need to manually
generate the content for my facet field? I can see the reason and need
for doing it that way but I really need for my faceting to happen
dynamically based on the content in the field which in this case is
the title of a URL.

Thanks again for all the tips on getting this working for me.

Adam

On Wed, Oct 27, 2010 at 9:19 AM, Jayendra Patil
jayendra.patil@gmail.com wrote:
 The Shingle Filter Breaks the words in a sentence into a combination of 2/3
 words.

 For faceting field you should use :-
 field name=facet_field *type=string* indexed=true stored=true
 multiValued=true/

 The type of the field should be *string *so that it is not tokenised at all.

 On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada estrada.a...@gmail.comwrote:

 Thanks guys, the solr.ShingleFilterFactory did work to get me multiple
 terms per facet but now I am seeing some redundancy in the facets
 numbers. See below...

 Highway (62)
 Highway System (59)
 National (59)
 National Highway (59)
 National Highway System (59)
 System (59)

 See what's going on here? How can I make my multi token facets smarter
 so that the tokens aren't duplicated?

 Thanks in advance,
 Adam

 On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote:
  Facets are generated from indexed terms.
 
  Depending on your need/use-case:
 
  You can use a additional separate String field (which is not tokenized)
 for facets, populate it via copyField. Search on tokenized field facet on
 non-tokenized field.
 
  Or
 
  You can add solr.ShingleFilterFactory to your index analyzer to form
 multiple word terms.
 
  --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote:
 
  From: Adam Estrada estrada.a...@gmail.com
  Subject: Multiple Word Facets
  To: solr-user@lucene.apache.org
  Date: Wednesday, October 27, 2010, 4:43 AM
  All,
  I am a new to Solr faceting and stuck on how to get
  multiple-word
  facets returned from a standard Solr query. See below for
  what is
  currently being returned.
 
  lst name=facet_counts
  lst name=facet_queries/
  lst name=facet_fields
  lst name=title
  int name=Federal89/int
  int name=EFLHD87/int
  int name=Eastern87/int
  int name=Lands87/int
  int name=Highways84/int
  int name=FHWA60/int
  int name=Transportation32/int
  int name=GIS22/int
  int name=Planning19/int
  int name=Asset15/int
  int name=Environment15/int
  int name=Management14/int
  int name=Realty12/int
  int name=Highway11/int
  int name=HEP10/int
  int name=Program9/int
  int name=HEPGIS7/int
  int name=Resources7/int
  int name=Roads7/int
  int name=EEI6/int
  int name=Environmental6/int
  int name=Right6/int
  int name=Way6/int
  ...etc...
 
  There are many terms in there that are 2 or 3 word phrases.
  For
  example, Eastern Federal Lands Highway Division all gets
  broken down
  in to the individual words that make up the total group of
  words. I've
  seen quite a few websites that do what it is I am trying to
  do here so
  any suggestions at this point would be great. See my schema
  below
  (copied from the example schema).
 
      fieldType name=text
  class=solr.TextField positionIncrementGap=100
        analyzer type=index
           tokenizer
  class=solr.WhitespaceTokenizerFactory/
      filter
  class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=false/
          filter
  class=solr.StopFilterFactory
 
  ignoreCase=true
 
  words=stopwords.txt
 
  enablePositionIncrements=true
 
  /
      filter
  class=solr.WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1 catenateWords=0
  catenateNumbers=0
  catenateAll=0 splitOnCaseChange=1/
          filter
  class=solr.RemoveDuplicatesTokenFilterFactory/
        /analyzer
 
  Similar for type=query. Please advise on how to group or
  cluster
  document terms so that they can be used as facets.
 
  Many thanks in advance,
  Adam Estrada
 
 
 
 
 




Multiple Word Facets

2010-10-26 Thread Adam Estrada
All,
I am a new to Solr faceting and stuck on how to get multiple-word
facets returned from a standard Solr query. See below for what is
currently being returned.

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=title
int name=Federal89/int
int name=EFLHD87/int
int name=Eastern87/int
int name=Lands87/int
int name=Highways84/int
int name=FHWA60/int
int name=Transportation32/int
int name=GIS22/int
int name=Planning19/int
int name=Asset15/int
int name=Environment15/int
int name=Management14/int
int name=Realty12/int
int name=Highway11/int
int name=HEP10/int
int name=Program9/int
int name=HEPGIS7/int
int name=Resources7/int
int name=Roads7/int
int name=EEI6/int
int name=Environmental6/int
int name=Right6/int
int name=Way6/int
...etc...

There are many terms in there that are 2 or 3 word phrases. For
example, Eastern Federal Lands Highway Division all gets broken down
in to the individual words that make up the total group of words. I've
seen quite a few websites that do what it is I am trying to do here so
any suggestions at this point would be great. See my schema below
(copied from the example schema).

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=1/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

Similar for type=query. Please advise on how to group or cluster
document terms so that they can be used as facets.

Many thanks in advance,
Adam Estrada


Importing SlashDot Data

2010-09-17 Thread Adam Estrada
All,

I have a new Windows 7 machine and have been trying to import an RSS feed
like in the SlashDot example that is included in the software. My dataConfig
file looks fine.


dataConfig
dataSource type=HttpDataSource /
document
entity name=slashdot
pk=link
url=http://rss.slashdot.org/Slashdot/slashdot;
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=DateFormatTransformer

field column=source xpath=/RDF/channel/title
commonField=true /
field column=source-link xpath=/RDF/channel/link
commonField=true /
field column=subject xpath=/RDF/channel/subject
commonField=true /

field column=title xpath=/RDF/item/title /
field column=link xpath=/RDF/item/link /
field column=description xpath=/RDF/item/description /
field column=creator xpath=/RDF/item/creator /
field column=item-subject xpath=/RDF/item/subject /
field column=date xpath=/RDF/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
field column=slash-department xpath=/RDF/item/department /
field column=slash-section xpath=/RDF/item/section /
field column=slash-comments xpath=/RDF/item/comments /
/entity
/document
/dataConfig
==

And when I choose to perform a full import, absolutely nothing happens. Here
is the debug code.

Sep 17, 2010 4:09:04 PM org.apache.solr.core.SolrCore execute
INFO: [rss] webapp=/solr path=/select
params={start=0dataConfig=dataConfig%0d
%0a%09dataSource+type%3DHttpDataSource+/%0d%0a%09document%0d%0a%09%09enti
ty+name%3Dslashdot%0d%0a%09%09%09%09pk%3Dlink%0d%0a%09%09%09%09url%3Dhttp:/
/rss.slashdot.org/Slashdot/slashdot
%0d%0a%09%09%09%09processor%3DXPathEntityPr
ocessor%0d%0a%09%09%09%09forEach%3D/RDF/channel+|+/RDF/item%0d%0a%09%09%09%09
transformer%3DDateFormatTransformer%0d%0a%09%09%09%09%0d%0a%09%09%09field+co
lumn%3Dsource+xpath%3D/RDF/channel/title+commonField%3Dtrue+/%0d%0a%09%09
%09field+column%3Dsource-link+xpath%3D/RDF/channel/link+commonField%3Dtrue
+/%0d%0a%09%09%09field+column%3Dsubject+xpath%3D/RDF/channel/subject+comm
onField%3Dtrue+/%0d%0a%09%09%09%0d%0a%09%09%09field+column%3Dtitle+xpath%3
D/RDF/item/title+/%0d%0a%09%09%09field+column%3Dlink+xpath%3D/RDF/item/li
nk+/%0d%0a%09%09%09field+column%3Ddescription+xpath%3D/RDF/item/descriptio
n+/%0d%0a%09%09%09field+column%3Dcreator+xpath%3D/RDF/item/creator+/%0d%
0a%09%09%09field+column%3Ditem-subject+xpath%3D/RDF/item/subject+/%0d%0a%0
9%09%09field+column%3Ddate+xpath%3D/RDF/item/date+dateTimeFormat%3D-MM
-dd'T'hh:mm:ss+/%0d%0a%09%09%09field+column%3Dslash-department+xpath%3D/RD
F/item/department+/%0d%0a%09%09%09field+column%3Dslash-section+xpath%3D/RD
F/item/section+/%0d%0a%09%09%09field+column%3Dslash-comments+xpath%3D/RDF/
item/comments+/%0d%0a%09%09/entity%0d%0a%09/document%0d%0a/dataConfig%0d
%0averbose=oncommand=full-importdebug=onqt=/dataimportrows=10} status=0
QTi
me=293

Can someone please explain what might be going on here? What's with all the
%0d%0a%09%09's?

Thanks in advance,
Adam


<    1   2