Re: Storing info about the index in the index

2005-02-17 Thread Sanyi
 you could use a special document in the index to do this.

I was thinking about this way, but I feel this solution very ugly :)

 You could also keep a .properties or .xml file alongside the index.

Can I store such a file inside the index directory?
Will Lucene delete my file at some event?
(at optimize, or whatever)

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateFilter on UnStored field

2005-02-14 Thread Sanyi
 Following up on PA's reply.  Yes, DateFilter works on *indexed* values, 
 so whether a field is stored or not is irrelevant.

Great news, thanx!

 However, DateFilter will not work on fields indexed as 2004-11-05.  
 DateFilter only works on fields that were indexed using the DateField.

Well, can you post here a short example?
When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ?
Does it take strings like 2004-11-05?

 One option is to use a QueryFilter instead, filtering with a 
 RangeQuery.

I've read somewhere that classic range filtering can easily exceed the maximum 
number of boolean
query clauses. I need to filter a very large range of dates with day accuracy 
and I don't want to
increase the max. clause count to very high values. So, I decided to use 
DateFilter which has no
such problems AFAIK.

How much impact does DateFilter have on search times?

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateFilter on UnStored field

2005-02-14 Thread Sanyi
 DateField has a utility method to return a String:
 
   DateField.timeToString(file.lastModified())
 
 You'd use that String to pass to Field.UnStored.
 
 I recommend, though, that you use a different format, such as the 
 -MM-DD format you're using.

Well, I read -MM-DD format string from a database.
So, I need to know how to convert -MM-DD to DateField.timeToString()'s 
result format.
Or I have to convert -MM-DD to file.lastModified()'s format which I can 
pass to
DateField.timeToString().
What is the easiest solution?

 In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter 
 which would do the trick for you.  If you want to stick with Lucene 
 1.4.x, that's fine... just grab the code for that filter and use it as 
 a custom filter - its compatible with 1.4.x.

So, why do you recommend RangeFilter over DateFilter?
Does it require less index data or/and has it better performance?
(I'm using 1.4.2)

 It depends on whether you instantiate a new filter for each search.  
 Building a filter requires scanning through the terms in the index to 
 build BitSet for the documents that fall in that range.  Filters are 
 best used over multiple searches.

Simply saying:
I let the user to enter the search string on a HTML form, then I call my custom 
lucene-based java
class through command line (the calling method may change to the PHP-to-JAVA 
bridge if it'll be
perfect for my needs).
So, every search is a whole new round. New HTML FORM post - new command line 
JVM call - new
index searcher, etc...

The OS is caching the index file pretty well (only the memory size is the limit 
of course).

Will my implementation's performance drop down a lot when I implement 
DateFilter?

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DateFilter on UnStored field

2005-02-13 Thread Sanyi
Hi!

Does DateFilter work on fields indexed as UnStored?
Can I filter an UnStored field with values like 2004-11-05 ?

Regards,
Sanyi



__ 
Do you Yahoo!? 
Yahoo! Mail - 250MB free storage. Do more. Manage less. 
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-08 Thread Sanyi
Thanx a lot!

Sanyi

--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Howdy,
 [...]




__ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-07 Thread Sanyi
Hi!

Can you please explain how did you implement the java and php part to let them 
communicate through
this bridge?
The brige's project summary talks about java application-server or a 
dedicated java process
and I'm not into Java that much.
Currenty I'm using a self-written command-line search program and it outputs 
its results to the
standard output.
I guess your solution must be better ;)

If the communication parts of your code aren't top secret, can you please 
share them with me/us?

Regards,
Sanyi




__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Hi!

What is the simplest way to add synonyms for AND/OR/NOT operators?
I'd like to support two sets of operator words, so people can use either the 
original english
operators and my custom ones for our local language.

Thank you for your attention!
Sanyi



__ 
Do you Yahoo!? 
Send holiday email and support a worthy cause. Do good. 
http://celebrity.mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Hi!

I think we're talking about different things.
My question is about using synonyms for AND/OR/NOT operators, not about 
synonyms of words in the
index.
For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT

So, the user can enter:
(cat OR kitty) AND black AND tail

and either:

(cat OORR kitty) AANNDD black AANNDD tail

Both sets of operators must work.
It must be some kind of a query parser modification/parametering, so there is 
nothing to do with
the index.

I hope I was more specific now ;)

Thanx,
Sanyi




--- Erik Hatcher [EMAIL PROTECTED] wrote:

 On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
  What is the simplest way to add synonyms for AND/OR/NOT operators?
  I'd like to support two sets of operator words, so people can use 
  either the original english
  operators and my custom ones for our local language.
 
 There are two options that I know of: 1) add synonyms during indexing 
 and 2) add synonyms during querying.  Generally this would be done 
 using a custom analyzer.
 
 If the synonym mappings are static and you don't mind a larger index, 
 adding them during indexing avoids the complexity of rewriting the 
 query.  Injecting synonyms during querying allows the synonym mappings 
 to change dynamically, though does produce more complex queries.  
 Here's an example you'll find with the source code distribution of 
 Lucene in Action which uses WordNet to look up synonyms.
 
   Erik
 
 p.s. I'm sensitive to over-marketing Lucene in Action in this forum as 
 it would bother me to constantly see an advertisement.  You can be sure 
 that any mentions of it from me will coincide with concrete examples 
 (which are freely available) that are directly related to questions 
 being asked.
 
 
 % ant -emacs SynonymAnalyzerViewer
 Buildfile: build.xml
 
 check-environment:
 
 compile:
 
 build-test-index:
 
 build-perf-index:
 
 prepare:
 
 SynonymAnalyzerViewer:
 
Using a custom SynonymAnalyzer, two fixed strings are
analyzed with the results displayed.  Synonyms, from the
WordNet database, are injected into the same positions
as the original words.
 
See the Analysis chapter for more on synonym injection and
position increments.  The Tools and extensions chapter covers
the WordNet feature found in the Lucene sandbox.
 
 Press return to continue...
 
 Running lia.analysis.synonym.SynonymAnalyzerViewer...
 
 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] 
 [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile]
 2: [brown] [brownness] [brownish]
 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] 
 [discombobulate] [confuse] [confound] [befuddle] [bedevil]
 4: [jumps]
 5: [over] [o] [across]
 6: [lazy] [faineant] [indolent] [otiose] [slothful]
 7: [dogs]
 
 1: [oh]
 2: [we]
 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] 
 [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] 
 [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] 
 [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] 
 [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] 
 [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] 
 [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] 
 [stupify] [suffer] [sustain] [take] [trounce] [undergo]
 4: [both]
 5: [kinds]
 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area]
 7: [western] [westerly]
 8: [bb]
 
 BUILD SUCCESSFUL
 Total time: 10 seconds
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Dress up your holiday email, Hollywood style. Learn more. 
http://celebrity.mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Synonyms for AND/OR/NOT operators

2004-12-21 Thread Sanyi
Well, I guess I'd better recognize and replace the operator synonyms to their 
original format
before passing them to QueryParser. I don't feel comfortable tampering with 
Lucene's source code.

Anyway, thanx for the answers.

Sanyi

--- Morus Walter [EMAIL PROTECTED] wrote:

 Erik Hatcher writes:
  On Dec 21, 2004, at 3:04 AM, Sanyi wrote:
   What is the simplest way to add synonyms for AND/OR/NOT operators?
   I'd like to support two sets of operator words, so people can use 
   either the original english
   operators and my custom ones for our local language.
  
  There are two options that I know of: 1) add synonyms during indexing 
  and 2) add synonyms during querying.  Generally this would be done 
  using a custom analyzer.
 
 I guess you missunderstood the question.
 
 I think he want's to know how to create a query parser understanding 
 something like 'a UND b' as well as 'a AND b' to support localized 
 operator names (german in this case).
 
 AFAIK that can only be done by copying query parsers javacc-source and
 adding the operators there.
 Shouldn't be difficult, though it's a bit ugly since it implies code
 duplication. And there will be no way of choosing the operators dynamically
 at runtime. One will need to have different query parsers for different
 languages.
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Hi!

I'm testing Lucene 1.4.2 on two very different configs, but with the same index.
I'm very surprised by the results: Both systems are searching at about the same 
speed, but I'd
expect (and I really need) to run Lucene a lot faster on my stronger config.

Config #1 (a notebook):
WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester

Config #2 (a desktop PC):
SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
15000RPM U320 SCSI
winchester

You can see that the hardware of #2 is at least twice better/faster than #1.
I'm searching the reason and the solution to take advantage of the better 
hardware compared to the
poor notebook.
Currently #2 can't amazingly outperform the notebook (#1).

The question is: What can be worse in #2 than on the poor notebook?

I can imagine only software problems.
Which are the sotware parts then?
1. The OS
Is SuSE 9.1 a LOT slower than WinXP pro?
2. The file system
Is reisefs a LOT slower than NTFS?

Regards,
Sanyi




__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 Interesting, what are your merge settings

Sorry, I didn't mention that I was talking about search performance.
I'm using the same, fully optimized index on both systems.
(I've generated both indexes with the same code from the same database on the 
actual OS)

 which JDK are you using?

I'm using the same Sun JDK on both systems.
I've tried so far:
j2sdk1.4.2_04 _05 and _06.
I didn't notice speed differences between these subversions.
Do you know about significant speed differences between them I should notice?

 Have you tried with hyperthreading turned off on #2?

No, but I will try it if the problem isn't in the file system.
I hope that the reason of slowness is reiserfs, because it is the easiest to 
change.

What file systems are you people using Lucene on? And what are your experiences?

Regards,
Sanyi




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 Could you try XP on your desktop

Sure, but I'll only do that I run out of ideas.

 so your desktop is actually using
 a 1.5GHz CPU for the search.

No, this is not true. It uses a 3.0GHz P4 then.
(HT means that you have two 3.0GHz P4s)

So, it is still surprising to me.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 The notebook is quite good, e.g. the Pentium-M might be faster than
 your Pentium 4. At least it has a similar speed, because of it better
 internal design. Never compare cpus of different types by their
 frequency. 

Ok, this might be true, but:

All of my other tests where the CPU is involved, are running a LOT faster on 
the desktop PC with
the 3GHz P4.
Even other JAVA parts are running a LOT faster. (twice as fast nearly)
So, we can't even say that the JAVA VM takes no advantage of the 3GHz P4 
compared to the 1.8GHz
Pentium-M.
Everything is a LOT faster, except searching with lucene. (which is also a bit 
faster, but
slightly)

 Maybe your index is small enough to fit into the cache provided by the 
 operating systems. So you wouldn't recognize any difference between your
 hard disks.

It is a 3GByte index and I always reboot between tests, so cahcing is not the 
case.

 I don't think so. I'm using Windows 2000 pro and SuSE 9.0 and 
 (from my memory) Linux seems to be sightly faster, but I can't
 provide any benchmark now.

Are you using reiserfs with SuSE?

Regards,
Sanyi



__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 How large is the index?   If it's less than a couple of GByte then it 
 will be entirely in memory

It is 3GBytes big and it will grow a lot.
I have to search from the HDD which is very fast compared to the notebook's HDD.

Average seek time:
Notebook: 8-9ms
Desktop: 3.9ms

Data read:
Notebook: max. ~20MBytes/sec
Desktop: 60-80MBytes/sec

So, if the bottleneck is the HDD, it has to be 2x-3x faster on the desktop 
system.
Except if reiserfs is a lot slower than NTFS.

 For example (and this is only an example) looking up a hostname in the 
 DNS will take about the same time on almost any machine you can get hold of.

Ok, but I have very simple and pure tests and everything is measured 
part-by-part.
..and every parts speeds up a lot on the desltop system, except the lucene 
search part.

 You don't say how you're measuring search performance and you don't say 
 what you're seeing.

I call my java program from command line on both systems, like:
search hello
Then it searches for bravo and collects the elapsed milliseconds between every 
call to anything.
Then it displays the results. It is very simple.

 Also, what's the load on the system while you're 
 running the tests?   gkrellm on Linux is very useful as an overall view 
 -- are you CPU bound, are you seeing lots of disk traffic?   Is the 
 system actually more-or-less idle?

Thanx for the hint. Since my search searches for only 30 hits, it completes too 
fastly to let me
monitor it real-time.
Anyway, if reiserfs will prove to be fast enough, I'll search for other reasons 
and will perform
longer tests for real-time monitoring.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 simply load your index into a
 RAMDirectory instead of using FSDirectory. 

I have 3GByte RAM and my index is 3GByte big currently. (it'll be soon about 
4GByte)
So, I have to find out this another way.

 First off, 1.8GHz Pentium-M machines are supposed to run at about the
 speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
 lower, but they tend to perform much better than rated.   I recommend
 you take a general benchmark of both machines testing both disk speed
 and cpu speed to get a baseline performance comparision.

I think that it a good general benchmark that almost everything runs at least 
twice as fast on the
3.0GHz P4 except lucene search.

I can tell one more interesting info:
I have a MySQL table with ~20million records.
I throw a DROP INDEX on that table, MySQL rebuilds the whole huge table into a 
tempfile.
It completes in 30 minutes on both systems.
It doesn't matter again that the 15kRPM U320 HDD is 2x-3x as fast.
Very surprising again.
Hmm... reiserfs must be very-very slow, or I'm completly lost :)

 I also suggest turning of HT for your benchmarks and performance testing.

I'll try this later and I really hope it won't be the reason.

 Secondly, while the second machine appears to be twice as fast, the
 disk could actually perform slower on the Linux box, especially if the
 notebook drive has a big (8M) cache like most 7200RPM ata disk drives
 do. 

Both drives have 8M cache.

 I imagine that if you hit the index with lots of simultaneous
 searches, that the Linux box would hold its own for much longer than
 the XP box simply due to the random seek performance of the scsi disk
 combined with scsi command queueing.

Are you saying that SCSI command queuing wastes more time than a 15kRPM 3.9ms 
HDD can gain over a
7.2kRPM 8-9ms HDD?
It sounds terrible and I hope it isn't true.

 RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
 xeons have a much slower bus than the newer p4-m processors.  Memory
 speed will be affected accordingly.

It is not a Xeon, just a P4 3.0GHz HT.

 I haven't heard of a hard disk referred to as a winchester disk in a
 very long time :)

;)

 Once you have an idea of how the two machines actually compare
 performance-wise, you can then judge how they perform index
 operations.

Lucene indexing completes in 13-15 hours on the desktop system while it 
completes in about 29-33
hours on the notebook.

Now, combine it with the DROP INDEX tests completing in the same amount of time 
on both and find
out why is the search only slightly faster :)

 Until then, all your measurements are subjective and you
 don't gain much by comparing the two indexing processes.

I'm worried about searching. Indexing is a lot faster on the desktop config.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Thanx for the replies to you all.
I was looking for someone with the same experiences as mine ones, but it seems 
that I'll have to
test this myself.
I'll try out my ideas and the most interesting ideas from you guys.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-20 Thread Sanyi
 why reindex?

Well, since I had different experiences with different analyzers I've tried, I 
thougt that this
problem must origin from either the indexing or a lucene bug.

 As stated at the end of my mail, I'd expect that to skip the
 first term in the enum.

Yes, this must be a problem for me, since I took this sentence from the manual 
as the starting
point:
Returns the current Term in the enumeration. Initially invalid, valid after 
next() called for the
first time.

So, it seems that it was a bug in the docs, not the api itself.

 Is that, what you miss or do you loose
 more than one term?

It seemed to me that it was skipping more stuff, but I'd better not say this, 
since I didn't know
that the term is valid even before the first next(), so I could've been 
misleaded by my own
chaotic experiences.

Since my code was completly restructured since then, I don't have all the 
surrounging stuff needed
for further testing.

Anyway, we've found a docs bug thanks to you and my code is cleaner and better 
the other way.

Thanx!





__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-17 Thread Sanyi
Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too 
buggy to use.
I'm now reimplementing my code using WildcardTermEnum.wildcardEquals which 
seems to be better so
far.

--- Sanyi [EMAIL PROTECTED] wrote:

 Hi!
 
 I have following problem with 1.4.2:
 I'm searching for c?ca (using StandardAnalyzer) and one of the hits looks 
 something like this:
 blabla c0ca c0la etc.. etc...
 (those big o-s are zero characters)
 Now, I'm enumerating the terms using WildcardTermEnum and all I get is:
 
 caca
 ccca
 ceca
 cica
 coca
 crca
 csca
 cuca
 cyca
 
 It doesn't know about c0ca at all.
 Is there any solution to come over this problem?
 
 Thanks,
 Sanyi
 
 
   
 __ 
 Do you Yahoo!? 
 The all-new My Yahoo! - Get yours free! 
 http://my.yahoo.com 
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Sanyi
 - leave the current implementation, raising an exception;
 - handle the exception and limit the boolean query to the first 1024
 (or what ever the limit is) terms;
 - select, between the possible terms, only the first 1024 (or what
 ever the limit is) more meaningful ones, leaving out all the others.

I like this idea and I would finalize to myself like this:
I'd also create a default rule for that to avoid handling exceptions for people 
who're happy with
the default behavior:

Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..
So, it'll automatically lower the search overhead and will still search fine 
without throwing
exceptions.
(for people who prefer the widest search range and do not care about the huge 
overhead, we could
leave a boolean switch for keeping not the longest, but the shortest fragments)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Anyone implemented custom hit ranking?

2004-11-13 Thread Sanyi
Hi!

I have problems with short text ranking. I've read about same raking problems 
in the list
archives, but found only hints and toughts (adjust DefaultSimilarity, 
Similarity, etc...), not
complete solutions with source code.
Anyone implemented a good solution for this problem? (example: my search 
application returns about
10-20 pages of 1-2 word hits for hello, and then it starts to list the longer 
texts)
I've implemented a very simple solution: I boost documents shorter than 300 
chars with
1/300*doclength at index time. Now it works a lot better. In fact, I can't see 
any problems now.
Anyway, I think this is not the solution, this is a patch or workaround.
So, I'd be interested in some kind of well designed complete solution for this 
problem.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase search for more than 4 words throws exception in QueryParser

2004-11-12 Thread Sanyi
It works for me too on linux. Thanks for the test!

--- Morus Walter [EMAIL PROTECTED] wrote:

 Sanyi writes:
  
  How to perform phrase searches for more than four words?
  
  This works well with 1.4.2:
  aa bb cc dd
  I pass the query as a command line parameter on XP: \aa bb cc dd\
  QueryParser translates it to: text:aa text:bb text:cc text:dd
  Runs, searches, finds proper matches.
  
  This throws exeption in QueryParser:
  aa bb cc dd ee
  I pass the query as a command line parameter on XP: \aa bb cc dd ee\
  The exception's text is:
  : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, 
  column
  13.  Encountered: EOF after : \aa bb cc dd
  
 Works for me on linux:
 java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g 
 h i j k l m n o p
 q r s t u v w x y z'
 a b c d e f g h i j k l m n o p q r s t u v w x y z
 
 Must be an XP command line problem.
 
 HTH
   Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Sanyi
 It is normally possible to reduce the numbers of such complaints a lot 
 by imposing a minimum prefix length

I've alread limited it to a minimum of 5 characters (abcde*).
I can still easily find (for the first try) situations where it starts to 
search for minutes.
While another 5 char. partial words are searching for a second.
So, this is not a solution at all.

 and eg. doubling or tripling the max. nr. of clauses.

This is the only useful thing I could do and the other way I've found is 
similar: Unlimiting the
number of clauses, but limiting the memory given for java.
It'll the throw an exception if things are getting too hard for the searcher.

Anyway, this avoids DoS attacks, but results in very poor user interface and 
search abiliy.
For example: rareword AND commonfragment* would still refuse to work.
I won't be able to explain it to my users, since they don't need my technical 
reasons. They'll
only notice that dodge AND vip* fails to search instead of returning 1000 
documents.

If I unlimit everything and don't care about possible DoS attacks, it is still 
poor.
It'll search for dodge AND vip* for two minutes, just because vip* is too 
common in the entire
document set.
It doesn't matter that dodge is pretty rare and we're AND-ing it with vip*.




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Yes, I understand all of this, but I don't want to set it to MaxInt, since it 
can easily lead to
(even accidental) DoS attacks.

What I'm saying is that there is no reason for the optimizer to expand wild* to 
more than 1024
variations when I search for somerareword AND wild*, since somerareword is 
only present in let's
say 100 documents, so wild* should only expand to words beginning with wild 
in those 100
documents, then it should work fine with the default 1024 clause limit.

But it doesn't, so I can choose between unuseable queries or accidental DoS 
attacks.

--- Will Allen [EMAIL PROTECTED] wrote:

 Any wildcard search will automatically expand your query to the number of 
 terms it find in the
 index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for each of the terms 
 that exist in your
 index.
 
 It is because of this, that you quickly reach the 1024 limit of clauses.  I 
 automatically set it
 to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
 has a 1024 Clauses
 limit by default which is good enough for me, but I still think it works 
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire document set of 
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it to throw an 
 exception.
 It should first restrict the search for the 500 documents containing the word 
 spectrum, then
 it
 should collect the variants of cab* withing these documents, which turns 
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the search should 
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still throws 
 a TooManyClauses
 exception instead of saying No results, since there is no nonexistingword 
 in my document
 set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an 
 accidental reply to a
 wrong thread..
 
 
   
 __ 
 Do you Yahoo!? 
 Check out the new Yahoo! Front Page. 
 www.yahoo.com 
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
 That's the point: there is no query optimizer in Lucene.

Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your 
the viewpoint of a
user. You know my users aren't technicians, so answers like yours won't make 
them happy.
They will only see that I randomly don't allow them to search (with the 1024 
limit). They won't
understand why am I displaying Please restrict your search a bit more.. when 
they've just
searched for dodge AND vip* and there are only a few documents mathcing this 
criteria.

So, is the only way to make them able to search happily by setting the max. 
clause limit to
MaxInt?!




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Phrase search for more than 4 words throws exception in QueryParser

2004-11-11 Thread Sanyi
Hi!

How to perform phrase searches for more than four words?

This works well with 1.4.2:
aa bb cc dd
I pass the query as a command line parameter on XP: \aa bb cc dd\
QueryParser translates it to: text:aa text:bb text:cc text:dd
Runs, searches, finds proper matches.

This throws exeption in QueryParser:
aa bb cc dd ee
I pass the query as a command line parameter on XP: \aa bb cc dd ee\
The exception's text is:
: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column
13.  Encountered: EOF after : \aa bb cc dd

It doesn't matter what words I enter, the only thing that matters is the number 
of words which can
be four at max.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stopword AND validword throws exception

2004-11-10 Thread Sanyi
Hi!

I've left out custom stopwords from my index using the 
StopAnalyzer(customstopwords).
Now, when I try to searh the index the same way 
(StopAnalyzer(customstopwords)), it seems to act
strange:

This query works as expected:
validword AND stopword
(throws out the stopword part and searches for validword)

This query seems to crash:
stopword AND validword
(java.lang.ArrayIndexOutOfBoundsException: -1)

Maybe it can't handle the case if it had to remove the very first part of the 
query?!
Can anyone else test this for me? How can I overcome this problem?

(lucene-1.4-final.jar)

Thanks for your time!

Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Sanyi
Thanx for your replies guys.

Now, I was trying to locate the latest patch for this problem group, and the 
last thread I've
read about this is:
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820
It ends with an open question from Morus:
If you want me to change the patch, let me know. That no big deal.

Did you change the patch since then?

In other words: What is the latest development in this topic?
Can I simply download the latest compiled development version of lucene.jar and 
will it fix my
problem?

The lastest builds I could find are these:
http://cvs.apache.org/builds/jakarta-lucene/nightly/2003-09-09/

It seems to be quite old, so please help me out!

Thanx,
Sanyi

--- Morus Walter [EMAIL PROTECTED] wrote:

 Sanyi writes:
  
  This query works as expected:
  validword AND stopword
  (throws out the stopword part and searches for validword)
  
  This query seems to crash:
  stopword AND validword
  (java.lang.ArrayIndexOutOfBoundsException: -1)
  
  Maybe it can't handle the case if it had to remove the very first part of 
  the query?!
  Can anyone else test this for me? How can I overcome this problem?
  
 see bug:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=9110
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stopword AND validword throws exception

2004-11-10 Thread Sanyi
 But the fix seems to be included in 1.4.2.
 see 
 http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4
 item 5

Thank you! I'm just downloading 1.4.2.
I hope it'll work ;)

Sanyi




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-10 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]