Re: Better highlighting fragmenter

2007-01-03 Thread Michael Imbeault
I for one would be interested in such a fragmenter, as the default one 
is lacking and doesnt produce acceptable results for most applications.


Michael

Mike Klaas wrote:

I've written an unpolished custom fragmenter for highlighting which is
more expensive than the BasicFragmenter that ships with lucene, but
generates more natural candidate fragments (it will tend to produce
beginning/ends of sentences).

Would there be interest in the community in releasing it and/or
including it in Solr?

-Mike
   




Re: Spellchecker in Solr

2006-12-07 Thread Michael Imbeault
I was at the origin of the thread you mentionned; I still didnt made any 
progress toward integrating a spell suggestion function in Solr; but 
then again, I'm a java and a lucene novice (though I'm learning fast 
thanks to all the help on the mailing list!). By all means, if you think 
you can do this, share with the community; to me its the last 'must 
have' feature that would make Solr perfect out of the box (its still 
awesome without this, mind you!).


I think the option you describe is the easiest / best one to implement.

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Otis Gospodnetic wrote:

Hi,

A month ago, the topic of a spell checker in Solr came up (c.f. 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html ).

Has anyone made any progress with that?  If not, I'll have to do this to 
scratch my own itch.
Because I'm in a hurry with this, I think I will go with the "chop terms into 
n-grams in the client, and send the term + the n-grams to Solr for indexing", as 
described here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg01264.html .

I will then query this index for alternative spelling suggestions just like I'd 
query any other Solr instance (the idea being I'd search this index in parallel 
with the search of the index with the actual data I want to find).  I will not, 
at this time, modify or write any spell checker request handlers that add 
spelling suggestions to the response.

If anyone has any comments not covered in that thread above, I'm all eyes.

Otis




  


Re: Solr and Oracle

2006-11-23 Thread Michael Imbeault
I index documents I have in a mysql database via xml. You can build your 
xml documents on the fly with the data from your database and index 
that, no problem at all.


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Nicolas St-Laurent wrote:

Hi,

Does someone use Solr to index/search in a database instead of XML 
documents ? I search for information about this and I don't find any. 
Actually, I index huge Oracle tables with Lucene with a custom made 
indexer/search engine. But I would prefer to use Solr instead. If 
someone can give me a hint on how to do this, I will appreciate.


Thanks,

Nicolas St-Laurent



Re: Index & search questions; special cases

2006-11-18 Thread Michael Imbeault


CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.  If you
extract that class into some more specificly named "CommonGramsFilter",
all you need after that to use it in Solr is a simple little
"FilterFactory" so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...  


Chris, thanks for the tips (or should I say, detailed explanation!). I 
actually got it working! It was a pain at first (never did any java, and 
all this ant, junit, war, jar, java, .classes are confusing!). I had 
some compile errors that I cleaned up. Playing around with the filter in 
the admin panel analyser yields expected results; I can't thank you 
enough for your help. I now use :




generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0"/>
words="stopwords-complete.txt" ignoreCase="true"/>
ignoreCase="true"/>


And it works perfectly.

If Solr is interested in the filter, just tell me (and how should I do 
to contribute it).


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212




http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the "create" method should return
a new "CommonGramsFilter" instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.
  


Re: Index & search questions; special cases

2006-11-13 Thread Michael Imbeault

Hello everyone,

Thanks for all your answers; synonyms based approaches won't work 
because the medical / research field is evolving way too fast; it would 
become unmaintainable very quickly, and the list would be huge. Anyway, 
I can't rely on score because I'm sorting by date, so I need to 
eliminate the 'hiv' in one part of the doc and '1' in another part 
problem completely (if I want docs that fits HIV-1, or Polymyxin B, or 
hepatitis A - I don't want docs that fits 'A patient was cured of 
hepatitis C' if I search for 'hepatitis a').

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.
  
Is this a filter that I could implement easily into Solr? I never did 
java, but it can't be that complicated I guess. Any help would be 
appreciated.



That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" because the Analyzer would treat that as a
single token "in_the".
  


This looks like exactly what I'm looking for. Is it related to the above 
'nutch pre-filtering'? This way if I stopword single letters and 
numbers, it would still index 'hepatitis_a' as a single token, and match 
a search on 'hepatitis a' (non-phrase search) without hitting 'a patient 
has hepatitis'? I guess i'd have to apply the filter to the query too, 
so it turns the query into hepatitis_a?


Basically, its another way to what I proposed as a solution - rewrite 
the query to include phrase queries when you find a stopword, if you 
index them anyway. Still, this solution looks better, as the size of the 
index would probably be smaller than if I didn't stopword single letters 
at all? For reference, what I proposed was:


My thought is to parse the user query and rephrase it to do phrase 
searches on nearby terms containing single letters / numbers. If an 
user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND 
hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If 
not, indications on how to do it myself would be appreciated - I can't 
say I have a clue right now (never did java, the only lucene programming 
I did was via a php bridge).


Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212





Re: Sentence level searching

2006-11-12 Thread Michael Imbeault
So basically its just as I thought it was, thanks for the help :) I had 
checked the wiki before asking, but it lacks details and is often vague, 
or presuppose that you have knowledge about some specific terms without 
explaining them. Its all clear now, thanks to you ;)


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Chris Hostetter wrote:

: Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not
: exactly sure of how to add multiple values to a single field (aside from
: fieldcopy). The code I'm thinking of using :

If you look at the exampledocs, "features" and "cat" are both multivalued
fields... you just list multiple s with the same name in your

: Field in schema.xml : 
:
: Where am I supposed to configure the value of the gap?
: positionIncrementGap in the fieldtype definition is my guess, but I'm

correct.

: not sure. Also, am I supposed to put multivalued in the fieldtype
: definition? Alternatively, could I put positionIncrementGap in the
:  that I posted just above?

I *think* positionIncrementGap has to be set by on the fieldtype ... but
i'm not 100% certain of that.

multiValued and the other field attributes (indexed, stored,
compressed, omitNorms) can be set on the field or inherited from the
fieldtype.

More info can be found in the comments of the example schema.xml, as well
as these wiki pages...

http://wiki.apache.org/solr/SchemaXml
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


-Hoss

  


Re: Index & search questions; special cases

2006-11-12 Thread Michael Imbeault

Chris Hostetter wrote:

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)
  

For clarification, I'm using the filter both at index and query time.

Have you by any chance played with the "Analysis" page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
"debugQuery=on" option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.
  
Will try that, played with it in the past, but not for this particular 
problem, good idea :)

: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
: ("1 hepatitis" AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
"HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
'HIV "1 hepatits"' if that's what they ment.)
  
Sadly I can't rely on users smartness for this :) I have concerns that 
for stuff like Hepatitis A, it will match just about every document 
containing hepatitis and the very common 'a' word, anywhere in the 
document. I can't stopword single letters, cause then there would be no 
way to find documents about 'hepatitis c' and not about 'hepatitis b' 
for example. I will test my solution and report; if you have any other 
ideas, just tell me.


And thanks for the help! :)



Re: Sentence level searching

2006-11-12 Thread Michael Imbeault

Hello everyone,

Solr puts a configurable gap between values of the same field, so you
could index every sentence as a separate value of a multi-valued
field.
Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not 
exactly sure of how to add multiple values to a single field (aside from 
fieldcopy). The code I'm thinking of using :


   PHP code to build the XML

   (loop for each sentence)
   $abstract_element = $dom->createElement('field');
   $abstract_element->setAttribute('name', 'abstract');
   $abstract_text = 
$dom->createTextNode($array['abstract']);

   $abstract_element->appendChild($abstract_text);
   (end loop)
   $doc->appendChild($abstract_element);

Field in schema.xml : stored="false" multivalued="true" />


Where am I supposed to configure the value of the gap? 
positionIncrementGap in the fieldtype definition is my guess, but I'm 
not sure. Also, am I supposed to put multivalued in the fieldtype 
definition? Alternatively, could I put positionIncrementGap in the 
 that I posted just above?


Thanks for the help,
Michael





Index & search questions; special cases

2006-11-12 Thread Michael Imbeault

Hello again,

- Let's say I index "HIV-1" with class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which 
after parsing by the above filter would yield HIV1 or HIV 1) also find 
documents which have HIV and the number "1" somewhere in the document, 
but not directly after HIV? If so, how should I fix this? I could boost 
score by proximity, but I'm doing a sort on date anyway, so I guess it 
would be pointless to do so.


- Somewhat related : Let's say I index "Polymyxin B". If I stopword 
single letters, would a phrase search ("Polymyxin B") still find the 
right documents (I don't think so, but still)? If not, I'll have to 
index single letters; how do I prevent the same problem as in the first 
question (i.e., a search on Polymyxin B yielding documents with 
Polymyxin and B, but not close to one another).


My thought is to parse the user query and rephrase it to do phrase 
searches on nearby terms containing single letters / numbers. If an user 
search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR 
("1 hepatitis" AND hiv). Is it a sensible solution?


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Sentence level searching

2006-11-12 Thread Michael Imbeault

Hello everyone,

I'm trying to do some sentence-level searching with Solr; basically, I 
want to find if two words are in the same sentence. As I read on the 
Lucene mailing list, there's many ways to do this, including but not 
limited to :


-inserting special boundary terms to denote the start and end of a 
sentence. It is unclear to me what kind of query should be used to fetch 
results from within one sentence (something like: start_sentence_token 
word1 word2 end_sentence_token)?
-increase token position at a sentence boundary by a large factor 
(1000?) so that "x y"~500 (or more) won't match across sentence boundaries.


Is there an existing filter class that I could use to do this, or should 
I first parse my text fields with PHP and some NLP tool, and index the 
result (for the first case)? For the second case (increment token 
position), how should I do this within Solr?


Is there any plans to implement such functionality as standard?

Thanks for the help,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Spellchecker in Solr?

2006-10-31 Thread Michael Imbeault
I had #1 in mind. Everything in my mainIndex is supposed to be correctly 
spelled, so I just want to use that as a source for spelling 
suggestions. I'd check for suggestions on low numbers of results (no 
results, or very few for a one word query).


#2 would be even better but as you said, its a lot trickier. For my 
needs, just a spelling suggester would be perfect. Would it require java 
programming, or could I get away with it with the current Solr (adding 
n-gram fields and querying on them)?


Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Chris Hostetter wrote:

: Has anybody successfully implemented a Lucene spellchecker within Solr?
: If so, could you give details on how one would achieve this?

There's really two ways to interpret that question ...
  1) built a spell correction suggestion application powered by Solr,
 where you manually feed it the data as documents and the mainIndex is
 the source of suggestion data.
  2) Embeded sepll correction suggestion in Solr, so that request handlers
 can return suggested alternatives allong with the results from your
 mainIndex.

#1 would probably be pretty easy as people have mentioned.

#2 would be a lot trickier...

request handlers can certainly keep state, and could even write to files
if they wanted to to preserve state accross JVM instances to maintain a
permenant dictionary store ... and i suppose you could use a newSearcher
Listener to know when documents have been added so you can scan them for
new words to update your dictionary ... but off the top of my head it
sounds like it would get pretty complicated.



-Hoss

  


Re: Spellchecker in Solr?

2006-10-30 Thread Michael Imbeault
I had the very same article in mind - how would it be simpler in Solr 
than in Lucene? A spellchecker is pretty much standard in every major 
search engine nowadays - with one, Solr would be the best, hands down 
(even if it already is :P).


Are your plans to build this anything concrete, or is it just at the 'i 
might do this in the future' stage?

Thanks,
--

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Kevin Lewandowski wrote:
I have not done one but have been planning to do it based on this 
article:

http://today.java.net/pub/a/today/2005/08/09/didyoumean.html

With Solr it would be much simpler than the java examples they give.

On 10/30/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Hello everyone,

Has anybody successfully implemented a Lucene spellchecker within Solr?
If so, could you give details on how one would achieve this?

If not, is it planned to make it as standard within Solr? Its a feature
almost every Solr application would want to use, so I think it would be
a nice idea. Sadly, I'm no Java developer, so I fear I won't be the one
coding that :(

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212






Spellchecker in Solr?

2006-10-30 Thread Michael Imbeault

Hello everyone,

Has anybody successfully implemented a Lucene spellchecker within Solr? 
If so, could you give details on how one would achieve this?


If not, is it planned to make it as standard within Solr? Its a feature 
almost every Solr application would want to use, so I think it would be 
a nice idea. Sadly, I'm no Java developer, so I fear I won't be the one 
coding that :(


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Michael Imbeault
Excellent news; as you guessed, my schema was (for some reason) set to 
version 1.0. This also caused some of the problems I had with the 
original SolrPHP (parsing the wrong response).


But better yet, the 800 seconds query is now running in 0.5-2 seconds! 
Amazing optimization! I can now do faceting on journal title (17 000 
different titles) and last author (>400 000 authors), + 12 date range 
queries, in a very reasonable time (considering im on a test windows 
desktop box and not a server).


The only problem is if I add first author, I get a 
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will 
get away on a server with more than the current 500 megs I can allocate 
to Tomcat.


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:

On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the 
maxsize).


The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.


Here's the field i'm using in schema.xml :



That looks fine...


This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false 



That looks OK too.
I assume that you didn't change the fieldtype definition for "string",
and that the schema has version="1.1"?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik



Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault
I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).


Here's the field i'm using in schema.xml :


This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false


I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik



Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault
Dude, stop being so awesome (and the whole Solr team). Seriously! Every 
problem / request (MoreLikeThis class, change AND/OR preference 
programatically, etc) I've submitted to this mailing list has received a 
quick, more-than-I-ever-expected answer.


I'll subscribe to the dev list (been reading it off and on), but I'm 
afraid I couldn't code my way of a paper bag in Java. I'll contribute to 
the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats 
the least I can do!


Btw, Any plans for a facets cache?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has > 400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html 



-Yonik



Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault

Thanks for all the great answers.


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?
You misunderstood. I'm doing faceting on first author, and last author 
of the list. Life science papers have authors list, and the first one is 
usually the guy who did most of the work, and the last one is usually 
the boss of the lab. I already have untokenized author fields for that 
using copyField.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index 
is ...
once you made that cahnge did your filterCache hitrate increase? .. 
do you

have any evictions (you can check on the "Statistics" page)
It was at the default (16000) and it hit the ceiling so to speak. I did 
maxSize=1600 (for testing purpose) and now size : 17038 and 0 
evictions. For a single facet field (journal name) with a limit of 5 and 
12 faceted query fields (range on publication date), I now have 0.5 
seconds search, which is not too bad. The filtercache size is pretty 
much constant no matter how many queries I do.


However, if I try to add another facet field (such as first_author), 
something strange happens. 99% CPU, the filter cache is filling up 
really fast, hitratio goes to hell, no disk activity, and it can stay 
that way for at least 30 minutes (didn't test longer, no point really). 
It turns out that journal_name has 17038 different tokens, which is 
manageable, but first_author has > 400 000. I don't think this will ever 
yield good performance, so i might only do journal_name facets.


Any reasons why facets tries to preload every term in the field?

I have noticed that facets are not cached. Facets off, cached query take 
0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. 
Any plans for a facets cache? I know that facets is still a very early 
feature, but its already awesome; my application is maybe irrealistic.


Thanks,
Michael



Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Another followup: I bumped all the caches in solrconfig.xml to

 size="1600384"
 initialSize="400096"
 autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and 
first author fields, + 12 range date facets, sub 0.3 seconds for 
queries). I'll check on the full index tomorrow (it's indexing right 
now, 400docs/sec!). However, I still don't have an idea what are these 
values representing, and how I should estimate what values I should set 
them to. Originally I thought it was the size of the cache in kb, and 
someone on the list told me it was number of items, but I don't quite 
get it. Better documentation on that would be welcomed :)


Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset is 
too large...


Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik



Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Yonik Seeley wrote:

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.



Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.


Thanks for the support,

Michael


Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault
Just a little follow-up - I did a little more testing, and the query 
takes 20 seconds no matter what - If there's one document in the results 
set, or if I do a query that returns all 13 documents.


It seems something isn't right... it looks like solr is doing faceted 
search on the whole index no matter what's the result set when doing 
facets on a string field. I must be doing something wrong?


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Michael Imbeault wrote:
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; 
this is from a huge (15 millions articles) database and names of 
authors are rare and heterogeneous. On a query that takes (without 
facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the 
documents indexed (I've been getting java.lang.OutOfMemoryError with 
the full index). ~40 seconds for a faceted search on 2 (string) 
fields. Range queries on a slong field is more acceptable (even with a 
dozen of them, query time is still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a 
query.


Thanks,



Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; this 
is from a huge (15 millions articles) database and names of authors are 
rare and heterogeneous. On a query that takes (without facets) 0.1 
seconds, it jumps to ~20 seconds with just 1% of the documents indexed 
(I've been getting java.lang.OutOfMemoryError with the full index). ~40 
seconds for a faceted search on 2 (string) fields. Range queries on a 
slong field is more acceptable (even with a dozen of them, query time is 
still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a query.


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: MoreLikeThis class in Lucene within Solr?

2006-09-13 Thread Michael Imbeault
Thanks for the answer; and try to enjoy your vacation / travel! Can't 
wait to be able to interface with MoreLikeThis within Solr!


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Erik Hatcher wrote:


On Sep 12, 2006, at 3:41 PM, Michael Imbeault wrote:
I haven't looked at the specifics of how MoreLikeThis determine which 
items are similar; I'm mainly wondering about performance here. 
Yesterday I tried to code myself a poor man's similarity class (which 
was nothing more than doing a search with OR between words and 
sorting by score), and the performance was abysmal (well, I kinda 
expected it. 1000+ words queries on a 15 millions docs collection, 
you don't expect miracles). At first glance I think it searches for 
the most 'relevant' words, I'm I right? What kind of performance are 
you getting with it?


Performance with MoreLikeThis is not an issue.  It has many parameters 
to tune how many terms are used in the query it builds, and it pulls 
these terms in an extremely efficient manner from the Lucene index.


I'm doing some traveling soon, which is always a good time to hack on 
something tractable like adding MoreLikeThis to Solr.  So your wish 
may be granted in a week :)


Erik



Re: MoreLikeThis class in Lucene within Solr?

2006-09-12 Thread Michael Imbeault
Thanks for that Eric; It looks like a very good implementation of the 
class. If you ever find time to add it to the query handlers in Solr, 
I'm sure it would be wonderful for tons of users (solr has tons of 
users, right? it definitively should!).


I haven't looked at the specifics of how MoreLikeThis determine which 
items are similar; I'm mainly wondering about performance here. 
Yesterday I tried to code myself a poor man's similarity class (which 
was nothing more than doing a search with OR between words and sorting 
by score), and the performance was abysmal (well, I kinda expected it. 
1000+ words queries on a 15 millions docs collection, you don't expect 
miracles). At first glance I think it searches for the most 'relevant' 
words, I'm I right? What kind of performance are you getting with it?


Thanks a lot,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Erik Hatcher wrote:
I use MoreLikeThis in a custom request handler for Collex, for example 
the three items shown at the bottom left here:


<http://svn.sourceforge.net/viewvc/patacriticism/collex/trunk/src/solr/org/nines/TermQueryRequestHandler.java?revision=391&view=markup> 



I would like to get MoreLikeThis hooked into the 
StandardRequestHandler just like highlighting and facets are now.  One 
of these days I'll carve out time to do that if no one beats me to 
it.  It would not be difficult to do, it would just take some time to 
iron out how to parameterize it cleanly for general-purpose use.


Erik



MoreLikeThis class in Lucene within Solr?

2006-09-11 Thread Michael Imbeault
Ok, so hopefully I resolved my problems posting to this mailing list and 
this won't show up in some thread, but as a new topic!


Is it possible in any way to use the MoreLikeThis class with solr 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html)? 
Right now I'm determining similar docs by just querying for the whole 
body with OR between words, and it's not very efficient performance 
wise. I never coded in Java so I really don't know where I should start...


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Got it working! And some questions

2006-09-11 Thread Michael Imbeault

Hello Erik,

Thanks for add that feature! "do" is fine with me, if "op" is already 
used (not sure about this one).


Erik Hatcher wrote:


On Sep 10, 2006, at 10:47 PM, Michael Imbeault wrote:
 I'm still a little disappointed that I can't change the OR/AND 
parsing by just changing some parameter (like I can do for the number 
of results returned, for example); adding a OR between each word in 
the text i want to compare sounds suboptimal, but i'll probably do it 
that way; its a very minor nitpick, solr is awesome, as I said before.


I'm the one that added support for controlling the default operator of 
Solr's query parser, and I hadn't considered the use case of 
controlling that setting from a request parameter.  It should be easy 
enough to add.  I'll take a look at adding that support and commit it 
once I have it working.


What parameter name should be used for this?do=[AND|OR] (for 
default operator)?  We have df for default field.


Erik


--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Got it working! And some questions

2006-09-10 Thread Michael Imbeault
First of all, it seems the mailing list is having some troubles? Some of 
my posts end up in the wrong thread (even new threads I post), I don't 
receive them in my mail, and they're present only in the 'date archive' 
of http://www.mail-archive.com, and not in the 'thread' one? I don't 
receive some of the other peoples post in my mail too, problems started 
last week I think.


Secondly, Chris, thanks for all the useful answers, everything is much 
clearer now. This info should be added to the wiki I think; should I do 
it? I'm still a little disappointed that I can't change the OR/AND 
parsing by just changing some parameter (like I can do for the number of 
results returned, for example); adding a OR between each word in the 
text i want to compare sounds suboptimal, but i'll probably do it that 
way; its a very minor nitpick, solr is awesome, as I said before.


@ Brian Lucas: Don't worry, solrPHP was still 99.9% functional, great 
work; part of it sending a doc at a time was my fault; I was following 
the exact sequence (add to array, submit) displayed in the docs. The 
only thing that could be added is a big "//TODO: change this code" 
before sections you have to change to make it work for a particular 
schema. I'm pretty sure the custom header curl submit works for everyone 
else than me; I'm on a windows test box with WAMP on it, so it may be 
caused by that. I'll send you tomorrow the changes I done to the code 
anyway; as I said, nothing major.


Chris Hostetter wrote:

: - What is the loadFactor variable of HashDocSet? Should I optimize it too?

this is the same as the loadFactor in a HashMap constructor -- but i don't
think it has much affect on performance since the HashDocSets never
"grow".

I personally have never tuned the loadFactor :)

: - What's the units on the size value of the caches? Megs, number of
: queries, kilobytes? Not described anywhere.

"entries" ... the number of items allowed in the cache.

: - Any way to programatically change the OR/AND preference of the query
: parser? I set it to AND by default for user queries, but i'd like to set
: it to OR for some server-side queries I must do (find related articles,
: order by score).

you mean using StandardRequestHandler? ... not that i can think of off the
top of my head, but typicaly it makes sense to just configure what you
want for your "users" in the schema, and then make any machine generated
queries be explicit.

: - Whats the difference between the 2 commits type? Blocking and
: non-blocking. Didn't see any differences at all, tried both.

do you mean the waitFlush and waitSearcher options?
if either of those is true, you shouldn't get a response back from the
server untill they have finished.  if they are false, then the server
should respond instantly even if it takes several seconds (or maybe even
minutes) to complete the operation (optimizes can take a while in some
cases -- as can opening newSearchers if you have a lot of cache warming
configured)

: - Every time I do an  command, I get the following in my
: catalina logs - should I do anything about it?

the optimize command needs to be well formed XML, try ""
instead of just ""

: - Any benefits of setting the allowed memory for Tomcat higher? Right
: now im allocating 384 megs.

the more memory you've got, the more cachng you can support .. but if
your index changes so frequently compared to the rate of *unique*
queries you get that your caches never fill up, it may not matter.




-Hoss
  

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Doc add limit problem, old issue

2006-09-09 Thread Michael Imbeault
Fixed my problem, the implementation of solPHP was faulty. It was 
sending one doc at a time (one curl per doc) and the system quickly ran 
out of resources. Now I modified it to send by batch (1000 at a time) 
and everything is #1!


Michael Imbeault wrote:
Old issue (see 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), 
but I'm experiencing the same exact thing on windows xp, latest 
tomcat. I noticed that the tomcat process gobbles memory (10 megs a 
second maybe) and then jams at 125 megs. Can't find a fix yet. I'm 
using a php interface and curl to post my xml, one document at a time, 
and commit every 100 document. Indexing 3 docs, it hangs at maybe 
5000. Anyone got an idea on this one? It would be helpful. I may try 
to switch to jetty tomorrow if nothing works :(



--

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Got it working! And some questions

2006-09-09 Thread Michael Imbeault
First of all, in reference to 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg00808.html , 
I got it working! The problem(s) was coming from solPHP; the 
implementation in the wiki isn't really working, to be honest, at least 
for me. I had to modify it significantly at multiple places to get it 
working. Tomcat 5.5, WAMP and Windows XP.


The main problem was that addIndex was sending 1 doc at a time to solr; 
it would cause a problem after a few thousand docs because i was running 
out of resources. I modified solr_update.php to handle batch queries, 
and i'm now sending batches of 1000 docs at a time. Great indexing speed.


Had a slight problem with the curl function of solr_update.php; the 
custom HTTP header wasn't recognized; I now use curl_setopt($ch, 
CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string); - 
much simpler, and now everything works!


Up so far I indexed 15.000.000 documents (my whole collection, 
basically) and the performance i'm getting is INCREDIBLE (sub 100ms 
query time without warmup and no optimization at all on a 7 gigs index - 
and with the cache, it gets stupid fast)! Seriously, Solr amaze me every 
time I use it. I increased HashDocSet Maxsize to 75000, will continue to 
optimize this value - it helped a great deal. I will try disMaxHandler 
soon too; right now the standard one is great. And I will index with a 
better stopword file; the default one could really use improvements.


Some questions (couldn't find the answer in the docs):

- Is the solr php in the wiki working out of the box for anyone? Else we 
could modify the wiki...


- What is the loadFactor variable of HashDocSet? Should I optimize it too?

- What's the units on the size value of the caches? Megs, number of 
queries, kilobytes? Not described anywhere.


- Any way to programatically change the OR/AND preference of the query 
parser? I set it to AND by default for user queries, but i'd like to set 
it to OR for some server-side queries I must do (find related articles, 
order by score).


- Whats the difference between the 2 commits type? Blocking and 
non-blocking. Didn't see any differences at all, tried both.


- Every time I do an  command, I get the following in my 
catalina logs - should I do anything about it?


9-Sep-2006 2:24:40 PM org.apache.solr.core.SolrException log
SEVERE: Exception during commit/optimize:java.io.EOFException: no more 
data available - expected end tag  to close start tag 
 from line 1, parser stopped on START_TAG seen ... @1:10


- Any benefits of setting the allowed memory for Tomcat higher? Right 
now im allocating 384 megs.


Can't wait to try the new Faceted Queries... seriously, solr is really, 
really awesome up so far. Thanks for all your work, and sorry for all 
the questions!


--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Doc add limit problem, old issue

2006-09-05 Thread Michael Imbeault
Old issue (see 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), 
but I'm experiencing the same exact thing on windows xp, latest tomcat. 
I noticed that the tomcat process gobbles memory (10 megs a second 
maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php 
interface and curl to post my xml, one document at a time, and commit 
every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone 
got an idea on this one? It would be helpful. I may try to switch to 
jetty tomorrow if nothing works :(


--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Doc add limit, im experiencing it too

2006-09-05 Thread Michael Imbeault
Old issue (see 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), 
but I'm experiencing the same exact thing on windows xp, latest tomcat. 
I noticed that the tomcat process gobbles memory (10 megs a second 
maybe) and then jams at 125 megs. Can't find a fix yet. I'm using a php 
interface and curl to post my xml, one document at a time, and commit 
every 100 document. Indexing 3 docs, it hangs at maybe 5000. Anyone 
got an idea on this one? It would be helpful. I may try to switch to 
jetty tomorrow if nothing works :(


--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212