Re: Writing a stemmer
Erik Hatcher [EMAIL PROTECTED] wrote: __ How proficient must I be in a language for which I wish to write the stemmer? I would venture to say you would need to be an expert in a language to write a decent stemmer. I'm sorry for a self-promo ;), but the stemmer of egothor project can be adapted to any language, and you needn't be a language expert. Moreover, the stemmer achieves better F-measure than Porter's stemmers. Cheers, Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tool for analyzing analyzers
Zilverline [EMAIL PROTECTED] wrote: __ get more out of lucene, such as incremental indexing, to name one. On Hello, as far as I know, the incremental indexing could be a real bottleneck if you implemented your system without some knowledge about Lucene internals. The respective test is here: http://www.egothor.org/twiki/bin/view/Know/LuceneIssue Cheers, Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: thanks for your mail
Could an admin filter out hema's e-mails, please? THX Leo [EMAIL PROTECTED] wrote: Received your mail we will get back to you shortly - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index advice...
Otis Gospodnetic napsal(a): Thus I do not know how it could be O(1). ~ O(1) is what I have observed through experiments with indexing of several million documents. What did you exactly measured? Just the time of the insert operation (incl. merge(), of course)? Was it a test on real documents? THX Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index advice...
Otis Gospodnetic napsal(a): --- Leo Galambos [EMAIL PROTECTED] wrote: Otis Gospodnetic napsal(a): Thus I do not know how it could be O(1). ~ O(1) is what I have observed through experiments with indexing of several million documents. What did you exactly measured? Just the time of the insert operation (incl. merge(), of course)? Was it a test on real documents? I didn't really measure anything, I only observed this, as my focus was something else, not performance measurements. It is true that every time an insert/add triggers a merge operation, things will slow down, but from what I recall (and this was about 1 year ago), the overall performance was steady as the index grew. Try the same test with mergeFactor=2, you will see the difference. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene with Postgres db
Have you tried a special add-on for pgsql - http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ Lucene is faster than tsearch (I hope so), but tsearch neednot be synchronized with the main DB...up to you. Cheers, Leo Ankur Goel wrote: Hi, I have to search the documents which are stored in postgres db. Can someone give a clue how to go about it? Thanks Ankur Goel Brickred Technologies B-2 IInd Floor, Sector-31 Noida,India P:+91-1202456361 C:+91-9810161323 E:[EMAIL PROTECTED] http://www.brickred.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexHTML example on Jakarta Site
Colin McGuigan wrote: It creates an index, but when I search using http://localhost:8000/luceneweb/ The page works but I do not get any replies. Can it read your index? See indexLocation in configuration.jsp 1. How do you specify which directory is to be searched snip I agree with Erik, that you would rather use an application which is ready for use in a minute. IMHO Lucene is library/API and unless you are a JAVA developer, it does not fit your needs. Some applications are listed here: http://dmoz.org/Computers/Programming/Languages/Java/Server-Side/Search_Engines/ Omit the Lucene link, else you will be in an endless loop... ;-) If you must use Lucene, try to find something for you here: http://jakarta.apache.org/lucene/docs/powered.html You may be interested in i2a, but their demo (@24.9.177.111) is dead right now. Cheers, Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What about Spindle
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl whatever you like. API: http://www.egothor.org/api/robot/ Console - demo (*.dundee.ac.uk): http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F Leo Zhou, Oliver wrote: I think it is common task to index a jsp based web site. A lot of poeple ask how to do so on this mailing list. However, Lucene does not have a ready to use web crawler. My question is that has anybody used Spindle to index a jsp based web site or is there any other tools out there. Thanks, Oliver -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 11:25 AM To: Lucene Users List Subject: Re: What about Spindle You should ask Spindle author(s). The error doesn't look like something that is related to Lucene, really. Otis --- Zhou, Oliver [EMAIL PROTECTED] wrote: What about Spindle? Has anybody used it to crawle a jsp based web site? Do I need to intall listlib.jar to do so? I got error message Jsp Translate:Unable to find setter method for attribue:class when I tried to run listlib-example.jsp in wsad. Thanks, Oliver - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector Space Model in Lucene?
Really? And what model is used/implemented by Lucene? THX Leo Otis Gospodnetic wrote: Lucene does not implement vector space model. Otis --- [EMAIL PROTECTED] wrote: Hi, does Lucene implement a Vector Space Model? If yes, does anybody have an example of how using it? Cheers, Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector Space Model in Lucene?
The model implies the quality, thus it does matter. ad several important models) Are any of them implemented in Lucene? Chong, Herb wrote: does it matter? vector space is only one of several important ones. Herb -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 4:00 AM To: Lucene Users List Subject: Re: Vector Space Model in Lucene? Really? And what model is used/implemented by Lucene? THX Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what document based clustering is? Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also analyze identical pages (hit #9 and #10) - this could be also taken as clustering AFAIK. For instance, Doug wrote some papers about clustering (if I remember it correctly) - see his bibliography. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Doug Cutting wrote: Erik Hatcher wrote: Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Why not just AND together the first and second query? That way they're both incorporated in the ranking. Filters are good when you don't want it to affect the ranking, and also when the first query is a criterion that you'll reuse for many queries (e.g., language=french), since the bit vectors can be cached (as by QueryFilter). You probably missed the start of our discussion - we are talking about this: q1 - q2 which means NOT q1 OR q2, versus q2 - q1 which means q1 OR NOT q2. It causes the issue, and it also shows why you cannot use the simple AND, because q1 AND q2 != NOT q1 OR q2 != q1 OR NOT q2. Leo BTW: I didn't see the logic formulas for many years, so it is without any guarantee ;-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Doug Cutting wrote: I have some extensions to Lucene that I've not yet commited which make it possible to easily define synthetic IndexReaders (not currently supported). So you could do things that way, once I check these in. But is this really better than just ANDing the clauses together? It would take some big experiments to know, but my guess is that it doesn't make much difference to compute a local IDF for such things. In this case, I think that the operator would be evaluated as an implication and not AND (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). Obviously, you have to use an filter to filter out false hits (in case of q1-q2, the formula is true when q1 is false, so it is not what you really need), but it is not an issue with the auxiliary index. On the other hand, it is a feeling and it needs a test, you are right. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
Erik Hatcher wrote: On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote: And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return good hit lists, because you will not consider the information value of the first query which was given to you by a user. Yes, you're right. Getting the scores of a second query based on the scores of the first query is probably not trivial, but probably possible with Lucene. And that combined with a QueryFilter would do the trick I suspect. Somehow the scores of the first query could be remembered and used as a boost (or other type of factor) the scores of the second query. Well, I do not want to be a pessimist, but the boost vector is not a good solution due to CWI statistics. On the other hand, it is much better than the simple QueryFilter which, in fact, works as 0/1 boost. Example: I use this notation: inverted_list_term:{list of W values, - denotes W=0, for 12 documents in a collection} A:{23[16]--27} B:{--[38]} C:{18[2-]45239812} If your first query is B, the subset of documents (denoted by brackets - namely, the 3rd and 4th doc) is selected, and if your second query is A C, then you cannot use global IDFs, because in the subset, the IDF factors are different. Globally, A is better distriminator, but in the subset, C is better. This fact is then reflected by the hit list you generate, and I guess, the quality will be also affected by this. The example shows, that you would rather export the subset to an auxiliary index (RAMDirectory?) and then use this structure instead of the original index. Obviously, it will solve the issue of speed you mentioned. Unfortunately, I am not sure, if you can export the inverted lists when you read them. In egothor, I would use a listener in Rider class, in Lucene, I would have to rewrite some classes and it could be a real problem. Maybe, there is a solution I do not see... Your turn ;-) Cheers, Leo Am I off base here? Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where AND is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly). Right... but you'd still need the filtering capability as well, I would think - at least for performance reasons. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
But Drill Down searching is very desirable. It's where you're able to search within the results of a previous search. I'm assuming that I'll have to implement that myself, by keeping a copy of the previous Hits list, and only returning results that are in both lists. And for the second time today QueryFilter. It allows narrowing the documents queried to only the documents from a previous Query. I guess, it would not be an ideal solution - the first query does two things a) it selects a subset from the corpus; b) it assigns a relevance to each document of this subset. Your solution omits the second point. It implies, the solution will not return good hit lists, because you will not consider the information value of the first query which was given to you by a user. For instance, neologism George Bush (1st2nd query) would return different order of hits than George Bush neologism. Other examples, Prague Berlin flight (I must go there, and I prefer an airplane) versus flight Prague Berlin (I must fly, and I prefer Berlin). Thus I think, Chris would implement something more complex than QueryFilter. If not, the results will be poorer than with the commercial packages he may get. He could use a different model where AND is not an associative operator (i.e. some modification of the extended Boolean model). It implies, he would implement it in Similarity.java (if I remember that class name correctly). Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). What strategy do you use in nutch? THX -g- Doug Cutting wrote: As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and finally merge them all into a single index at the end. But you need lots of i/o capacity for this to pay off. Doug Dan Quaroni wrote: Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I index JSP files?
If I understand the Enigma code well, they say, that you must write a crawler ;-) -g- To index the content of JSPs that a user would see using a Web browser, you would need to write an application that acts as a Web client, in order to mimic the Web browser behaviour. Once you have such an application, you should be able to point it to the desired JSP, retrieve the contents that the JSP generates, parse it, and feed it to Lucene. I am a newbie to lucene and I would like to enable searching capability to my website which is written entirely with JSP and servlets. Does anyone have any experience parsing JSP files in order to create in index for/by Lucene? I would greatly appreciate any help with the matter. THanx Russ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: High Capacity (Distributed) Crawler
Otis Gospodnetic wrote: What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best. I'd imagine one would want to use it in the PUSH mode (e.g. the crawler fetches a web page and adds it to the searchable index). How does PULL mode work? I've never heard of web crawlers being used in the PULL mode. What exactly does that mean, could you please describe it? It is a long story, so I will assume, that everything runs on a single box - it is the most simple case. [x] will denote points, where Lucene may have problems with a fast implementation, I guess. Crawler: The crawler stores meta and body of all documents. If you want to retrieve the document meta or body (knowing its URI), it costs O(1) (2 seeks and 2 read requests in auxiliary data structures). After this retrieval you also get a direct handle to meta and body - then the price of retrieval becomes O(1), but no extra seeks in any structures. The handle is persistent and is related to URI. The meta and body is updated as soon as the crawler fetches a new fresh copy. Engine: engine stores the handle for each document. Moreover it knows the last (highest) handle, which is stored in the main index. So the trick is this: 1) build up an auxiliary index from new documents. The new documents are documents which have their handle greater than the last handle which is known to the engine, thus you can iterate them easily - this process can run in a separate thread 2) consult the changes. You read meta, which are stored in index, and test if they are obsolete (note: you have already got the handle, so it smokes). If so, you denote the respective document as deleted and its new version (if any) is appended to another index - the index of changes. The insertion to the index runs in a separate thread, so the main thread is not blocked. BTW: [x] The documents, which are not modified, may modify their ranks (depthrank, pagerank, frequencyrank etc) in this round. [x] The two auxiliary indices are then merged with the main index. Obviously, the weak point is the test if anything is changed. This can be easily solved with the index dynamization I use. Despite Lucene, I order barrels (segments in your terminology) by their size. I do not want to describe all the details - I hate long e-mails ;-), but the dynamization guarantees that: a) the query time is never worse than 8x, comparing with fully-optimalized index (if you buy 8x faster HW, you overcome this easily) b) the documents, which are often modified, are stored in small barrels of the main index. It means, that their actualization is fast. So, I process only the small barrels once a day, and the larger ones less often. If we say, that 5M of docs are updated daily, PULL mode can handle this load in few minutes. Unfortunately, the slowest point is the HTML parser which may run few hours :-(. If you want to actualize other 10^10 crap pages once a month, it can be done too, but it is out of my first assumption above ;-). -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: High Capacity (Distributed) Crawler
Hi Otis. The first beta is done (without NIO). It needs, however, further testing. Unfortunatelly, I could not find enough servers which I may hit. I wanted to commit the robot as a part of egothor (it will use it in PULL mode), but we have a nice weather here, so I lost any motivation to play with PC ;-). What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best. -g- Otis Gospodnetic wrote: Leo, Have you started this project? Where is it hosted? It would be nice to see a few alternative implementations of a robust and scalable java web crawler with the ability to index whatever it fetches. Thanks, Otis --- Leo Galambos [EMAIL PROTECTED] wrote: Hi. I would like to write $SUBJ (HCDC), because LARM does not offer many options which are required by web/http crawling IMHO. Here is my list: 1. I would like to manage the decision what will be gathered first - this would be based on pageRank, number of errors, connection speed etc. etc. 2. pure JAVA solution without any DBMS/JDBC 3. better configuration in case of an error 4. NIO style as it is suggested by LARM specification 5. egothor's filters for automatic processing of various data formats 6. management of Expires HTTP-meta headers, heuristic rules which will describe how fast a page can expire (.php often expires faster than .html) 7. reindexing without any data exports from a full-text index 8. open protocol between the crawler and a full-text engine If anyone wants to join (or just extend the wish list), let me know, please. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: String similarity search vs. typcial IR application...
I see. Are you looking for this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html On the other hand, if n is not fixed, you still have a problem. As far as I read this list it seems, that Lucene reads a dictionary (of terms) into memory, and it also allocates one file handle for each of the acting terms. It implies you would not break the terms up into n-grams and, as a result, you would use a slow look-up over the dictionary. I do not know if I express it correctly, but my personal feeling is, that you would rather write your application from scratch. BTW: If you have nice terms, you could find all their n-grams occurencies in the dictionary, and compute a boost factor for each of the inverted lists. I.e., bbc is a term in a query, and for i-list of abba, the factor is 1 (bigram bb is there), for i-list of bbb, the factor is 2 (bb 2x). Then you use the Similarity class, and it is solved. Nevertheless, if the n-grams are not nice and the query is long, you will lost a lot of time in the dictionary look-up phase. -g- PS: I'm sorry for my English, just learning... Jim Hargrave wrote: Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting. Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms. Jim [EMAIL PROTECTED] 06/05/03 03:55PM AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: Our application is a string similarity searcher where the query is an input string and we want to find all fuzzy variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences. I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates. In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. == - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: String similarity search vs. typcial IR application...
Exact matches are not ideal for DNA applications, I guess. I am not a DNA expert, but those guys often need a feature that is termed ``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, and I think that Lucene has many drawbacks which disallow effective implementations. On the other hand, I am very interested in a method you mentioned. Could you give me a reference, please? Thank you. -g- [*] why do you use the label ``fuzzy''? It has nothing to do with fuzzy logic or fuzzy IR, I guess. Frank Burough wrote: I have seen some interesting work done on storing DNA sequence as a set of common patterns with unique sequence between them. If one uses an analyzer to break sequence into its set of patterns and unique sequence then Lucene could be used to search for exact pattern matches. I know of only one sequence search tool that was based on this approach. I don't know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit. Frank Burough -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Thursday, June 05, 2003 5:55 PM To: Lucene Users List Subject: Re: String similarity search vs. typcial IR application... AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: Our application is a string similarity searcher where the query is an input string and we want to find all fuzzy variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences. I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates. In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Where to get stopword lists?
Ulrich Mayring wrote: Hello, does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. What does mean ``good''? It depends on your corpus IMHO. The best way, how one can get a ``good'' stop-list, is an analysis that's based on idf. Thus, index your documents, list all the terms with low idf out, save them in a file and use them in next indexing round. Just a thought... -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lowercasing wildcards - why?
I'm sorry, I did not read the complete thread. Do you mean - analyzer == stemmer? Does it really work? If I was a stemmer, I would let searche intact. ;-) -g- [EMAIL PROTECTED] wrote: Hi Les, We ended up modifying the QueryParser to pass prefix and suffix queries through the Analyzer. For us, it was about stemming. If you decide to use an analyzer that incorporated stemming, there are cases where wildcard queries will not return the expected results. Example: searcher will probably get stemmed to search. A search on searche* should hit the term searcher, but, it won't, all instances of searcher having been stemmed to search at index time. Our solution was to remove the trailing wildcard and send searche to the analyzer, then tack the wildcard character back on there and create the PrefixQuery object with the new search string search*. DaveB Leslie Hughes [EMAIL PROTECTED]To: '[EMAIL PROTECTED]' ion.com.au [EMAIL PROTECTED] cc: 05/30/03 01:09 AM Subject: Lowercasing wildcards - why? Please respond to Lucene Users List Hi, I was just wondering what the rationale is behind lowercasing wildcard queries produced by QueryParser? It's just that my data is all upper case and my analyser doesn't lowercase so it seems a bit odd that I have to call setLowercaseWildcardTerms(false). Couldn't queryparser leave the terms unnormalised or better still pass them through the analyser? I'm sure there's a good reason for it though. Les - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lowercasing wildcards - why?
Ah, I got it. THX. In the good old days, the wildcards were used as a fix for missing stemming module. I am not sure if you can combine these two opposite approaches successfully. I see the following drawbacks of your solution. Example: built* (-built) could be changed to build* (no built, but -builder, building, etc.), and precision will go down drastically. You probably use a stemmer with one important bug (a.k.a. feature) - overstemming, so here is another example: political* (-political, politically) is transformed to polic* (-policer, policy, policies, policement etc.) by Porter alg., and the precision is again affected drastically -g- [EMAIL PROTECTED] wrote: Your analyzers can optionally incorporate stemming, along with the other things that analyzers do (lowercasing, etc...). The stemming algorithms are all different. This searcher example was made up, but, there are instances where stemming at index time and not stemming wildcard searches will result in lost hits. Specifically, we encountered this situation using the optional Snoball analyzers (which work great, by the way). DaveB Leo Galambos [EMAIL PROTECTED]To: Lucene Users List [EMAIL PROTECTED] 05/30/03 10:26 AMcc: Please respond toSubject: Re: Lowercasing wildcards - why? Lucene Users List I'm sorry, I did not read the complete thread. Do you mean - analyzer == stemmer? Does it really work? If I was a stemmer, I would let searche intact. ;-) -g- [EMAIL PROTECTED] wrote: Hi Les, We ended up modifying the QueryParser to pass prefix and suffix queries through the Analyzer. For us, it was about stemming. If you decide to use an analyzer that incorporated stemming, there are cases where wildcard queries will not return the expected results. Example: searcher will probably get stemmed to search. A search on searche* should hit the term searcher, but, it won't, all instances of searcher having been stemmed to search at index time. Our solution was to remove the trailing wildcard and send searche to the analyzer, then tack the wildcard character back on there and create the PrefixQuery object with the new search string search*. DaveB Leslie Hughes [EMAIL PROTECTED]To: '[EMAIL PROTECTED]' ion.com.au [EMAIL PROTECTED] cc: 05/30/03 01:09 AM Subject: Lowercasing wildcards - why? Please respond to Lucene Users List Hi, I was just wondering what the rationale is behind lowercasing wildcard queries produced by QueryParser? It's just that my data is all upper case and my analyser doesn't lowercase so it seems a bit odd that I have to call setLowercaseWildcardTerms(false). Couldn't queryparser leave the terms unnormalised or better still pass them through the analyser? I'm sure there's a good reason for it though. Les - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search for similar terms
http://cs.felk.cvut.cz/psc/members.html http://cs.felk.cvut.cz/psc/event/1998/p13.html or contact prof. Melichar for more details: http://webis.felk.cvut.cz/people/melichar.html -g- Dario Dentale wrote: Hi, can you suffer me a link with an overview document of this method? I couldn't find. Thanks, Dario - Original Message - From: Leo Galambos [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, May 30, 2003 4:25 PM Subject: Re: Search for similar terms You need DASG+Lev over the dictionary. The boundary could be the highest idf of the terms. It was solved by prof. Melichar, you can find the construction of the automaton in his papers. -g- Dario Dentale wrote: Hi, anybody knows which is the best way to implements in Lucene a fuctionality (that Google has) like this: Search text- notebok Answer- Did you mean: notebook ? Thanks, Dario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I: incremental index
Adding a new document does not immediately modify an index, so the time it takes to add a new document to an existing index is not proportional to the index size. It is constant. The execution time of optimize() is proportional to the index size, so you want to do that only if you really need it. The Lucene article on http://www.onjava.com/ from March 5th describes this in more detail. Otis, I am not sure, if anything about constants is constant in non-constant IR systems :-) I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a time you need to createwrite one monolithic segment of k documents, m is merge factor you use, and k is the number of documents which are already in index. As you can see, the function grows with k. Can you explain me, why addition of one document takes constant time? Thank you -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding Setup Lucine for my site
1. 2 threads per request may improve speed up to 50% Hmm? Could you clarify? During indexing, multithreading may speed things up (splitting docs to index in 2 or more sets, indexing separately, combining indexing). But... isn't that a good thing? Or are you saying that it'd be good to have multi-threaded search functionality for single search? (in my experience searching is seldom the slow part) you may improve indexing and searching. Indexing, because the merge operation will lock just one thread and smaller part of an index while other threads are still working; searching, because you can distribute the query to more barrels. In both cases you save up to 50% of time (I assume mergefactor=2). 2. Merger is hard coded In a way that is bad because... ? (ie. what is the specific problem... I assume you mean index merging functionality?) Because you cannot process local and/or remote barrels -- all must be local in Lucene object model. That is the serious bug IMHO. 4. you cannot implement dissemination + wrappers for internet servers which would serve as static barrels. Could you explain this bit more thoroughly (or pointers on longer explanation)? Read more about dissemination, metasearch engines (i.e. Savvysearch), dDIRs (i.e. Harvest). BTW, let's go to a pub and we can talk til morning :) (it is a serious offer, because I would like to know more about IR). This example is about metasearch (the simplest case of dDIRs): Can Lucene allow that a barrel (index segment?) is static and a query is solved via wrapper, that sends the query ${QUERY} to http://www.google.com/search?hl=enie=UTF-8oe=UTF-8q=${QUERY} and then reads the HTML output as a result? 5. Document metadata cannot be stored as a programmer wants, he must translate the object to a set of fields Yes? I'd think that possibility of doing separate fields is a good thing; after all, all a plain text search engine needs to provide (to be considered one) is indexing of plain text data, right? I talked about metadata. When metadata object knows how to achieve its persistence, why would one translate anything to fields and then back? Why would you touch the users metadata at all? You need flat fields for indexing, and what's around -- it is not your problem :). Lucene is something between CMS and CIS, you say that it's closer to CIS, but when you need metadata in fields, you are closer to CMS IMHO. 6. Lucene cannot implement your own dynamization (sorry, I must sound real thick here). Could you elaborate on this... what do you mean by dynamization? Read more about Dynamization of Decomposable Searching Problems. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential Lucene drawbacks
If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK. This is the point that's more clear to me now. There is confusion about what Lucene is and what it is not. Lucene does not even try to be what those services you mentioned are. Their goals are different, they are a different set of tools. Lucene's focus is on indexing text and searching it. It is not a tool to query other existing search I do not think so. It is all about the object model you use. If you are not able to solve the simplest case, how can you distribute the engine across the network? I do not mean the simple RMI gateways which marshall parameters and send them through a network pipe, I mean the true system that could beat google (and it is another topic...). Moreover, I think that Lucene can do much more than you think Otis :). Egothor can do that, so why not Lucene? -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding Setup Lucine for my site
On Tue, 4 Mar 2003, Otis Gospodnetic wrote: Even if you could replace C:\. with http:// it wouldn't be a good solution, as directory structures and file paths do not always map directly to URLs. Yes, but it is not the case of Samuel's configuration and 99.99% of others. The fact is, that Lucene is only a library, and sandbox utilities which are of different quality. :-) -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding Setup Lucine for my site
org.apache.lucene.demo.IndexHTML wich was provided with the documentation. Is there any problem using this demo class for a web production site? I'm an application developer and it would be hard to understand the hole lucene code to use it. It would be almost imposible You can use it, but: if you need something special (snippets, coloring, different URL mapping, handling of your local charset, etc. etc.) you must include code from sandbox or write it from scratch AFAIK. for my develop phase timings to try to do this. * Regarding you comment: Lucene does not index web pages. I thougth lucene main goal was to index web pages ¿? and as an after thougth it should be able to index text files or some other information (for example mail databases). Regards Lucene *can* index HTML pages, if you use programs which build Lucene index from HTML documents. The programs exist. On the other hand, if you extend Lucene with your hacks, you will find out that the model of Lucene is unknown and many parts are hard-coded. It boosts speed, but it disallows future enhancements (I could name the parts, I hope we do not start flamewar here). and thanks for your comments!!! I'm considering egothor search engine. I succesfully set a web application for searching my web site but I didn't see a mailing list or a forum with the level of I had PhD exam, and many questions went throught ICQ, you know, it is faster for me than e-mails... -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A thought: netique
Hi, I was away and when I read what I missed, well...ehm... have you read http://sustainability.open.ac.uk/gary/papers/netique.htm? i.e., see Caution when quoting other messages while replying to them. BTW: I would also vote for a strict standard, when Re: prefix must be used in replies. Just a thought. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildchar based search?? |
On Sat, 1 Feb 2003, Rishabh Bajpai wrote: also, i rememebr readin somewhere that one had to build the index in some special way, but since you say no; i will take that. i anyways dont rememebr where I read it, so no point asking about something if I am myself not sure I remember only one problem that is related to indexing phase - it is ``optimize'' function. If you update your index, one cannot tell you if you must also call optimize() or not. If you do not call it, it may slow down queries (I do not know how much, but Otis told it). If you call it, it slows down the indexing phase (I have tested it and it is significant). AFAIK Lucene cannot tell you when the index becomes dirty so that you must call optimize. On the other hand it does not affect small indexes, where optimize() costs nothing. Otis, I think that this still holds. Right? -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stop-word in phrase (BUG?)
Hi. In this phrase word 'and' occurs which is a stop-word. they may take AND as a keyword in a query. IMHO your query is taken as boolean query. I hope this helps. -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Benchmarks and Information
On Fri, 20 Dec 2002, Doug Cutting wrote: The max a reader will keep open is: mergeFactor * log_base_mergeFactor(N) * files_per_segment A writer will open: (1 + mergeFactor) * files_per_segment I am not sure if you must open all files (i.e. writer would need just 2*f_p_s if you keep A-Z order in DocUIDs??). IMHO it is a bug and the point why Lucene does not scale well on huge collections of documents. I am talking about my previous tests when I used live index and concurrent query+insert+delete (I wanted to simulate real application). BTW, your mail is also an answer to previous topic how often could one call optimize(). The method would be called before the index goes to production state. And it also means that tests are irrelevant until they are made with lower mergeFactor. ...but it is possible that I missed something (I do not know Lucene as good as you). -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
HTML saga continues...
So, I have tried this with Lucene: 1) original JavaCC LL(k) HTML parser 2) SWING's HTML parser In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K. But I cannot process complete collection (5M) and finish my hard stress tests of Lucene. Is there anyone who has HTML parser that really works with Lucene? :) If you think that you have one, please let me know. I wanted to try Neko, but it looks complicated and I do not want to affect the results by ``robust'' parser. THX -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: SV: Indexing HTML
I'm not sure this is a solution to your problem. However, it seems that the HTMLParser used by the IndexHTML class has problems parsing the document (there is a test class included in the jar): java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar org.apache.lucene.demo.html.Test f01529.txt Title: Webcz.cz - Power of search Parse Aborted: Encountered \' at line 106, column 27. Was expecting one of: ArgName ... TagEnd ... /Ronnie Hi Ronnie! I know about it and the exception is handled well (see log file below). I have found a better example than 1529, try this: http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is specific, i.e. it has two titles, two base tags etc. I have not debugger here, so I cannot find the line where is the bug. If you try your magic, please, let me know about the patch. :) THX -g- adding save/d00320/f01516.html Parse Aborted: Lexical error at line 68, column 11. Encountered: \u0178 (376), after : : adding save/d00320/f01527.html Parse Aborted: Encountered = at line 83, column 48. Was expecting one of: ArgName ... TagEnd ... adding save/d00320/f01528.html -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene Speed under diff JVMs
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote: I'm using the class that Otis wrote (see message from about 3 weeks ago) for testing the scalability of lucene (more results on that later) and I May I ask you where one can get the source code? I cannot find it in archive. Thank you -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Performance (figures)
The first round of tests is presented here (more will come later): 1) http://com-os2.ms.mff.cuni.cz/proof.png Price per insert (time, space). Doc base: 5M HTML *.CZ Collection size: 300K docs were processed; then Lucene crashed (it may be my fault, but I haven't time to debug it now) Optimize() after 2000 of docs (IMHO this simulates dynamic IR environment, i.e. indexing emails, news groups etc.). For instance (see Fig. 1): collection size/time per insert() 2000/25ms 16/33ms 30/48ms It means that for collection of 16 docs you need 16*33ms=5280s. 2) http://com-os2.ms.mff.cuni.cz/draw.png Absolute values If someone is able to say how often I would call optimize(), I can recalculate the results. Now the 2nd round of tests is running (without optimize()). -g- BTW: All figures, (C) 2002 Leo Galambos. Do not copy until I am sure that the testsvalues are correct. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
optimize()
How does it affect overall performance, when I do not call optimize()? THX -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: optimize()
Did you try any tests in this area? (figures, charts...) AFAIK reader reads identical number of (giga)bytes. BTW, it could read segments in many threads. I do not see why it would be slower (until you do many delete()-s). If reader opens 1 or 50 files, it is still nothing. -g- On Tue, 26 Nov 2002, Otis Gospodnetic wrote: This was just mentioned a few days ago. Check the archives. Not needed for indexing, good to do after you are done indexing, as the index reader needs to open and search through less files. Otis --- Leo Galambos [EMAIL PROTECTED] wrote: How does it affect overall performance, when I do not call optimize()? THX -g- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]