Spellchecking - Is there a way to do this?

2009-12-06 Thread Germán Biozzoli
Hello everybody

1. Have tons of digitalized text with the logical errors in OCR process
2. Have indexed with Solr and is working OK.
3. Have added spellchecker index-based for words and phrases with the
hope to offer suggestions with suspicious possible new query
expressions, or related query expressions to the actual one with the
intention to find documents that have the original expression but
contains OCR errors (the user originally have search for state and
democracy and the interface will offer stete and demcraci as an
alternate query expression)

My first problem appears because I need suggestions inclusive when the
expression has returned results. It's seems that only appear
suggestions when there are no results. Is there a way to do so?

The second question is: For the purposes that I've mentioned, is the
best way to use spellchecker or mlt component? Or some other (as a
fuzzy query)?

Thanks a lot
German


Re: Problem with Query Parser

2009-10-18 Thread Germán Biozzoli
Thanks Ahmet. Definitely using analyzer appears the english porter as
the killer ;)
Regards
German

On Sun, Oct 18, 2009 at 7:30 AM, AHMET ARSLAN iori...@yahoo.com wrote:

 Hi everybody

 I have a simple but (for me) annoying problem. I'm happy
 user of Solr
 1.4 with a small collection of documents. Today one of the
 users has
 reported that a query returns documents that are
 non-pertinent to the
 expression. I have spanish, portuguese and english text
 inside the
 collection. Using the Solr administration interface I've
 found that
 she was right, if I search for the spanish term
 represion, I found
 just only the word root, I mean it returns every document
 with the
 term repres. Using the admin-debug search I found this:


 lst name=debug
 str
 name=rawquerystringdescription:represion/str
 str
 name=querystringdescription:represion/str
 str
 name=parsedquerydescription:repres/str
 str
 name=parsedquery_toStringdescription:repres/str

 the ion part of the term was deleted by the query parser.
 The first
 question is: I don´t know now where should I see to
 correct this, at
 the schema.xml or at the solrconfig.xml.

 The only thing that is suspicious to me is the
 EnglishPorter.

 Yes you are right. ion part of the term was deleted by it. You can verify 
 this using /admin/analysis.jsp page. It will tell you which 
 TokenFilterFactory removes it.

 I've deleted from the configuration but nothing changes. Should
 I reindex the collection to see the changes?

 Yes re-index is necessary.

 Should I delete also from the index section?

 You should remove English porter from both query and index analyzer.

 What I will loose deleting English porter?

 You will lose stemming functionality. But since you have spanish, portuguese 
 and english documents using English porter for all the documents is not 
 meaningful.







Problem with Query Parser

2009-10-17 Thread Germán Biozzoli
Hi everybody

I have a simple but (for me) annoying problem. I'm happy user of Solr
1.4 with a small collection of documents. Today one of the users has
reported that a query returns documents that are non-pertinent to the
expression. I have spanish, portuguese and english text inside the
collection. Using the Solr administration interface I've found that
she was right, if I search for the spanish term represion, I found
just only the word root, I mean it returns every document with the
term repres. Using the admin-debug search I found this:


lst name=debug
str name=rawquerystringdescription:represion/str
str name=querystringdescription:represion/str
str name=parsedquerydescription:repres/str
str name=parsedquery_toStringdescription:repres/str

the ion part of the term was deleted by the query parser. The first
question is: I don´t know now where should I see to correct this, at
the schema.xml or at the solrconfig.xml.

At schema, description is

field name=description type=text indexed=true
multiValued=true stored=true/

and text is:

fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

/fieldtype

The only thing that is suspicious to me is the EnglishPorter. I've
deleted from the configuration but nothing changes. Should I reindex
the collection to see the changes? Should I delete also from the index
section? What I will loose deleting English porter?

Thanks a lot for the help
German


Re: Newbie problem ordering results

2009-08-11 Thread Germán Biozzoli
Sure

fieldtype name=string class=solr.StrField sortMissingLast=true
omitNorms=true/

The strange thing is that I could sort by another fields that is
defined using string, but not by another defined as some tokenized
field and after that copied as string.

I attach the schema.xml for the case is there another error and the
error log says the following

INFO: UnInverted multi-valued field
{field=date,memSize=70356,tindexSize=40,time=381,phase1=381,nTerms=99,bigTerms=5,termInstances=4330,uses=0}
11/08/2009 12:42:31 org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field
{field=local_medium,memSize=70088,tindexSize=56,time=10,phase1=10,nTerms=30,bigTerms=2,termInstances=2461,uses=0}
11/08/2009 12:42:31 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={rows=20wt=jsonfacet.field=contributorfacetfacet.field=subjectfacetfacet.field=provenancefacet.field=local_statefacet.field=datefacet.field=local_mediumfacet.limit=15q=text:fisicastart=0facet.mincount=1fl=id,title,contributor,subject,provenance,date,coverage,publisher,score,local_state,local_urlsort=score+descfacet=true}
hits=312 status=0 QTime=4963
11/08/2009 12:51:46 org.apache.solr.core.SolrCore execute

*** This is the order by date desc that is working OK and it's defined as string

INFO: [] webapp=/solr path=/select
params={rows=20wt=jsonq=text:fisicastart=0sort=date+descfl=id,title,contributor,subject,provenance,date,coverage,publisher,score,local_state,local_url}
hits=312 status=0 QTime=174


11/08/2009 12:52:38 org.apache.solr.common.SolrException log
GRAVE: java.lang.RuntimeException: there are more terms than documents
in field contributororder, but it's impossible to sort on tokenized
fields
at 
org.apache.lucene.search.FieldCacheImpl$8.createValue(FieldCacheImpl.java:518)
at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:81)
at 
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:491)
at 
org.apache.solr.search.MissingLastOrdComparator.setNextReader(MissingStringLastComparatorSource.java:181)
at 
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:92)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:242)
at org.apache.lucene.search.Searcher.search(Searcher.java:173)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at

Thanks a lot
German


On Tue, Aug 11, 2009 at 1:56 AM, Avlesh Singhavl...@gmail.com wrote:
 Can you please post the fieldType definition for the string field in your
 schema.xml?

 Cheers
 Avlesh

 On Tue, Aug 11, 2009 at 9:52 AM, Germán Biozzoli
 germanbiozz...@gmail.comwrote:

 Hello everybody

 I have the following (resumed) schema:

 
    field name=title type=text indexed=true stored=true
 multiValued=true/
   field name=titleorder type=string indexed=true stored=true
 multiValued=true/
   field name=contributor type=text indexed=true stored=true
 multiValued=true/
   field name=contributorfacet type=textFacetN indexed=true
 stored=true multiValued=true/
   field name=contributororder type=string indexed=true
 stored=true multiValued=true/
 .
 
 copyField source=title dest=text /
 copyField source=title dest=titleorder /
 copyField source=contributor dest=text /
 copyField source=contributor dest=contributorfacet /
 copyField source=contributor dest=contributororder /
 ...

 I use for instance contributor for searching, contributorfacet for
 faceting and order for ordering results, but when I try to order using
 contributororder, Solr says that cannot order by a tokenized
 field...(?)

 I'm using Solr 1.4 nightly. Is this a bug? I believe that in previous
 versions I have this issue working...

 Regards and thanks
 Germán


?xml version=1.0 encoding=UTF-8 ?
schema name=Test version=1.1
  types
fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/
fieldtype name=boolean class=solr.BoolField sortMissingLast=true omitNorms=true/
fieldtype name=integer class=solr.IntField omitNorms=true/
fieldtype name=long class=solr.LongField omitNorms=true/
fieldtype name=float class=solr.FloatField omitNorms=true/
fieldtype name=double class=solr.DoubleField omitNorms=true/
fieldtype name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/
fieldtype name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/
fieldtype name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/
fieldtype name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/
fieldtype name=date class=solr.DateField sortMissingLast=true omitNorms=true/

fieldtype name=text_ws class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldtype

Newbie problem ordering results

2009-08-10 Thread Germán Biozzoli
Hello everybody

I have the following (resumed) schema:


field name=title type=text indexed=true stored=true
multiValued=true/
   field name=titleorder type=string indexed=true stored=true
multiValued=true/
   field name=contributor type=text indexed=true stored=true
multiValued=true/
   field name=contributorfacet type=textFacetN indexed=true
stored=true multiValued=true/
   field name=contributororder type=string indexed=true
stored=true multiValued=true/
.

copyField source=title dest=text /
copyField source=title dest=titleorder /
copyField source=contributor dest=text /
copyField source=contributor dest=contributorfacet /
copyField source=contributor dest=contributororder /
...

I use for instance contributor for searching, contributorfacet for
faceting and order for ordering results, but when I try to order using
contributororder, Solr says that cannot order by a tokenized
field...(?)

I'm using Solr 1.4 nightly. Is this a bug? I believe that in previous
versions I have this issue working...

Regards and thanks
Germán