Re: Writing a stemmer

2004-06-03 Thread Leo Galambos
Erik Hatcher [EMAIL PROTECTED] wrote: __ How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. I'm sorry for a self-promo ;), but the stemmer of egothor project can

Re: Tool for analyzing analyzers

2004-06-02 Thread Leo Galambos
Zilverline [EMAIL PROTECTED] wrote: __ get more out of lucene, such as incremental indexing, to name one. On Hello, as far as I know, the incremental indexing could be a real bottleneck if you implemented your system without some knowledge about Lucene internals. The respective

Re: thanks for your mail

2004-02-16 Thread Leo Galambos
Could an admin filter out hema's e-mails, please? THX Leo [EMAIL PROTECTED] wrote: Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a): Thus I do not know how it could be O(1). ~ O(1) is what I have observed through experiments with indexing of several million documents. What did you exactly measured? Just the time of the insert operation (incl. merge(), of course)? Was it a test on real

Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a): --- Leo Galambos [EMAIL PROTECTED] wrote: Otis Gospodnetic napsal(a): Thus I do not know how it could be O(1). ~ O(1) is what I have observed through experiments with indexing of several million documents. What did you exactly measured

Re: Lucene with Postgres db

2004-02-01 Thread Leo Galambos
Have you tried a special add-on for pgsql - http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ Lucene is faster than tsearch (I hope so), but tsearch neednot be synchronized with the main DB...up to you. Cheers, Leo Ankur Goel wrote: Hi, I have to search the documents which are stored

Re: IndexHTML example on Jakarta Site

2004-01-02 Thread Leo Galambos
Colin McGuigan wrote: It creates an index, but when I search using http://localhost:8000/luceneweb/ The page works but I do not get any replies. Can it read your index? See indexLocation in configuration.jsp 1. How do you specify which directory is to be searched snip I agree with Erik,

Re: What about Spindle

2003-12-03 Thread Leo Galambos
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl whatever you like. API: http://www.egothor.org/api/robot/ Console - demo (*.dundee.ac.uk): http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F Leo Zhou, Oliver wrote: I think it is common task

Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
Really? And what model is used/implemented by Lucene? THX Leo Otis Gospodnetic wrote: Lucene does not implement vector space model. Otis --- [EMAIL PROTECTED] wrote: Hi, does Lucene implement a Vector Space Model? If yes, does anybody have an example of how using it? Cheers, Ralf -- NEU

Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
The model implies the quality, thus it does matter. ad several important models) Are any of them implemented in Lucene? Chong, Herb wrote: does it matter? vector space is only one of several important ones. Herb -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent

Re: Document Clustering

2003-11-11 Thread Leo Galambos
Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also

Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote: Erik Hatcher wrote: Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the

Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote: I have some extensions to Lucene that I've not yet commited which make it possible to easily define synthetic IndexReaders (not currently supported). So you could do things that way, once I check these in. But is this really better than just ANDing the clauses together?

Re: Lucene features

2003-09-07 Thread Leo Galambos
Erik Hatcher wrote: On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote: And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two

Re: Lucene features

2003-09-05 Thread Leo Galambos
But Drill Down searching is very desirable. It's where you're able to search within the results of a previous search. I'm assuming that I'll have to implement that myself, by keeping a copy of the previous Hits list, and only returning results that are in both lists. And for the second time

Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Leo Galambos
Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). What strategy do you use in nutch? THX -g- Doug Cutting wrote: As the index grows, disk i/o becomes the bottleneck.

Re: How can I index JSP files?

2003-07-27 Thread Leo Galambos
If I understand the Enigma code well, they say, that you must write a crawler ;-) -g- To index the content of JSPs that a user would see using a Web browser, you would need to write an application that acts as a Web client, in order to mimic the Web browser behaviour. Once you have such an

Re: High Capacity (Distributed) Crawler

2003-06-10 Thread Leo Galambos
Otis Gospodnetic wrote: What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best. I'd imagine one would want to use it in the PUSH mode

Re: High Capacity (Distributed) Crawler

2003-06-09 Thread Leo Galambos
? Where is it hosted? It would be nice to see a few alternative implementations of a robust and scalable java web crawler with the ability to index whatever it fetches. Thanks, Otis --- Leo Galambos [EMAIL PROTECTED] wrote: Hi. I would like to write $SUBJ (HCDC), because LARM does not offer many

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
I see. Are you looking for this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html On the other hand, if n is not fixed, you still have a problem. As far as I read this list it seems, that Lucene reads a dictionary (of terms) into memory, and it also allocates

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit. Frank Burough -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Thursday, June 05, 2003 5:55 PM To: Lucene Users List Subject: Re: String similarity search vs

Re: Where to get stopword lists?

2003-06-06 Thread Leo Galambos
Ulrich Mayring wrote: Hello, does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. What does mean ``good''? It depends on your corpus IMHO. The best way, how one can get a ``good'' stop-list, is an analysis that's based on idf. Thus, index

Re: Lowercasing wildcards - why?

2003-05-31 Thread Leo Galambos
I'm sorry, I did not read the complete thread. Do you mean - analyzer == stemmer? Does it really work? If I was a stemmer, I would let searche intact. ;-) -g- [EMAIL PROTECTED] wrote: Hi Les, We ended up modifying the QueryParser to pass prefix and suffix queries through the Analyzer. For

Re: Lowercasing wildcards - why?

2003-05-31 Thread Leo Galambos
Leo Galambos [EMAIL PROTECTED]To: Lucene Users List [EMAIL PROTECTED

Re: Search for similar terms

2003-05-31 Thread Leo Galambos
. Thanks, Dario - Original Message - From: Leo Galambos [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, May 30, 2003 4:25 PM Subject: Re: Search for similar terms You need DASG+Lev over the dictionary. The boundary could be the highest idf of the terms. It was solved

Re: I: incremental index

2003-03-28 Thread Leo Galambos
Adding a new document does not immediately modify an index, so the time it takes to add a new document to an existing index is not proportional to the index size. It is constant. The execution time of optimize() is proportional to the index size, so you want to do that only if you really

Re: Regarding Setup Lucine for my site

2003-03-06 Thread Leo Galambos
1. 2 threads per request may improve speed up to 50% Hmm? Could you clarify? During indexing, multithreading may speed things up (splitting docs to index in 2 or more sets, indexing separately, combining indexing). But... isn't that a good thing? Or are you saying that it'd be good to have

Re: Potential Lucene drawbacks

2003-03-06 Thread Leo Galambos
If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK. This is the point that's more clear to me now.

Re: Regarding Setup Lucine for my site

2003-03-05 Thread Leo Galambos
On Tue, 4 Mar 2003, Otis Gospodnetic wrote: Even if you could replace C:\. with http:// it wouldn't be a good solution, as directory structures and file paths do not always map directly to URLs. Yes, but it is not the case of Samuel's configuration and 99.99% of others. The fact is,

Re: Regarding Setup Lucine for my site

2003-03-05 Thread Leo Galambos
org.apache.lucene.demo.IndexHTML wich was provided with the documentation. Is there any problem using this demo class for a web production site? I'm an application developer and it would be hard to understand the hole lucene code to use it. It would be almost imposible You can use it, but: if

A thought: netique

2003-02-28 Thread Leo Galambos
Hi, I was away and when I read what I missed, well...ehm... have you read http://sustainability.open.ac.uk/gary/papers/netique.htm? i.e., see Caution when quoting other messages while replying to them. BTW: I would also vote for a strict standard, when Re: prefix must be used in replies. Just

Re: Wildchar based search?? |

2003-02-02 Thread Leo Galambos
On Sat, 1 Feb 2003, Rishabh Bajpai wrote: also, i rememebr readin somewhere that one had to build the index in some special way, but since you say no; i will take that. i anyways dont rememebr where I read it, so no point asking about something if I am myself not sure I remember only one

Re: Stop-word in phrase (BUG?)

2003-01-27 Thread Leo Galambos
Hi. In this phrase word 'and' occurs which is a stop-word. they may take AND as a keyword in a query. IMHO your query is taken as boolean query. I hope this helps. -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: Lucene Benchmarks and Information

2002-12-21 Thread Leo Galambos
On Fri, 20 Dec 2002, Doug Cutting wrote: The max a reader will keep open is: mergeFactor * log_base_mergeFactor(N) * files_per_segment A writer will open: (1 + mergeFactor) * files_per_segment I am not sure if you must open all files (i.e. writer would need just 2*f_p_s if you

HTML saga continues...

2002-12-12 Thread Leo Galambos
So, I have tried this with Lucene: 1) original JavaCC LL(k) HTML parser 2) SWING's HTML parser In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K. But I cannot process complete collection (5M) and finish my hard stress tests of Lucene. Is there anyone

Re: SV: Indexing HTML

2002-12-07 Thread Leo Galambos
I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar

Re: Lucene Speed under diff JVMs

2002-12-05 Thread Leo Galambos
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote: I'm using the class that Otis wrote (see message from about 3 weeks ago) for testing the scalability of lucene (more results on that later) and I May I ask you where one can get the source code? I cannot find it in archive. Thank you -g- --

Performance (figures)

2002-11-30 Thread Leo Galambos
.ms.mff.cuni.cz/draw.png Absolute values If someone is able to say how often I would call optimize(), I can recalculate the results. Now the 2nd round of tests is running (without optimize()). -g- BTW: All figures, (C) 2002 Leo Galambos. Do not copy until I am sure that the testsvalues

optimize()

2002-11-26 Thread Leo Galambos
How does it affect overall performance, when I do not call optimize()? THX -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: optimize()

2002-11-26 Thread Leo Galambos
2002, Otis Gospodnetic wrote: This was just mentioned a few days ago. Check the archives. Not needed for indexing, good to do after you are done indexing, as the index reader needs to open and search through less files. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: How does it affect