date range suggestion anyone?
Newbie here. Or, at least it has been a couple of years I have a date ranges working, which seem to work well. But I have a question about how to form a query. I have a publication with a dateAvailable and a dateExpired. It is viewable any time between these dates. I want to supply a date range, looking for publications available in the specified range. I thought I could do: +((dateAvailable:[02/01/2004 TO 03/01/2004]) OR (dateExpired:[02/01/2004 TO 03/01/2004])) But this does not work with this combination of dates: dateAvailable=1/1/2004 searchStartDate=2/1/2004 searchEndDate=3/1/2004 dateExpired=6/1/2004 Neither the dateAvailable nor dateExpired are included within the user specified test range, even though the publication is available during the entire specified range, plus more. Anyone figured out a way to do this without enumerating all the dates? Or, do I just need more sleep. Thanks for any help. Frank - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what web crawler work best with Lucene?
Sebastian, Would you be able to show me your code? Thank you. TJ >>> [EMAIL PROTECTED] 22/Apr/2004 03:21:50 pm >>> U can also use the html parser API available from Java 1.4.2. I tried it last week with a simple program which retrieve html files and displaying all HREF links in it. I done it within a day. sebastian On Thu, 2004-04-22 at 12:18, Stephane James Vaucher wrote: > How big is the site? > > I mostly use an inhouse solution, but I've used HttpUnit for web scrapping > small sites (because of its high-level api). > > Here is a hello world example: > http://wiki.apache.org/jakarta-lucene/HttpUnitExample > > For a small/simple site, small modifications to this class could suffice. > IT WILL NOT function on large sites because of memory problems. > > For larger sites, there are questions like: > > - memory: > For example, spidering all links on every page can lead to visiting too > many links. Keeping all visited links in memory can be problematic > > - noise > If you get every page on your web site, you might be adding noise to the > search engine. Spider navigation rules can help out, like saying that you > should only follow links/index documents of a specific form like > www.mysite.com/news/article.jsp?articleid=xxx > > - speed: > Too much speed can be bad if you doing 100 hits/sec on a site could hurt > it (especially if it's not you who are the webmaster) > Too little speed can be bad if you want to make sure you quickly get new > pages. > > - categorisation: > You might want to separate information in your index. For example, you > might want a user to do a search in the documentation section or in the > press release section. This categorisation can be done by specifying > sections to the site, or a subsequent analysis of available docs. > > -up-to-date information > You'll want to think of your update schedule, so that if you add a new > page, it gets indexed quickly. This problem also occurs when you modify an > existing page, you might want the modification to be detected rapidly. > > HTH, > sv > > On Thu, 22 Apr 2004, Tuan Jean Tee wrote: > > > Have anyone implemented any open source web crawler with Lucene? I have > > a dynamic website and are looking at putting in a search tools. Your > > advice is very much appreciated. > > > > Thank you. > > > > > > IMPORTANT - > > > > This email and any attachments are confidential and may be privileged in > > which case neither is intended to be waived. If you have received this > > message in error, please notify us and remove it from your system. It is > > your responsibility to check any attachments for viruses and defects > > before opening or sending them on. Where applicable, liability is > > limited by the Solicitors Scheme approved under the Professional > > Standards Act 1994 (NSW). Minter Ellison collects personal information > > to provide and market our services. For more information about use, > > disclosure and access, see our privacy policy at www.minterellison.com. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: Wiki write access
Just thought I'd update on the wiki. I take this to mean we need to upgrade to get better control. Erik Begin forwarded message: From: "Noel J. Bergman" <[EMAIL PROTECTED]> Date: April 22, 2004 10:01:10 AM EDT To: "Jakarta Project Management Committee List" <[EMAIL PROTECTED]> Subject: RE: Wiki write access Reply-To: "Jakarta Project Management Committee List" <[EMAIL PROTECTED]> Do any of the Apache wikis lock down write access for only those with a registered profile? Is this a reasonable requirement to have made? No, they don't and yes it is. I believe there is more control in the next version of Moin Moin. --- Noel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemmer Benefits/Costs
Andrzej, Sorry for misspelling your name. My Polish sucks. Terry - Original Message - From: "Terry Steichen" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, April 22, 2004 7:56 PM Subject: Re: Stemmer Benefits/Costs > So, Andrez - Thank you for your comments - what you say makes a good deal of > sense. When you have lots of different inflections that all share the same > root, stemming can clearly provide significant (recall) benefits (in terms > of catching hidden words and/or simplifying the query). > > However, would you say that "from the perspective of English" ("with its > minimal inflection") the points I raise are correct? (You seem to say so > with the statement that stemming "usually improves recall, but lowers > precision.") > > And, would you expect significant benefits from the Egothor project code > (versus Snowball/Porter) when the text is in English (as opposed to a highly > inflectional language like Polish)? > > Regards, > > Terry > > - Original Message - > From: "Andrzej Bialecki" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Thursday, April 22, 2004 5:37 PM > Subject: Re: Stemmer Benefits/Costs > > > > Terry Steichen wrote: > > > > > I've been experimenting with the Porter and Snowball stemmers. It > > > seems to me that one of the most valuable benefits these provide is > > > the capability to generalize phrase terms. As a very simple example, > > > without the stemmer, I might need to include three phrase terms in my > > > query: "north korea", "north korean", "north koreans". But with the > > > stemmer only one will suffice. To me, that's a huge advantage. (For > > > non-phrases, the advantage doesn't seem to be so great, because much > > > the same effect can be achieved with wildcards.) > > > > That's because you look at it from the perspective of English language > > with its minimal inflection... My mother tongue is Polish - a highly > > inflectional language from the Slavic family of languages. It is normal > > for a single Polish word to have as many as 20+ different inflected > > forms (plural/singular/dual, tense, gender, mood, case, infinitive... > > enough? ;-) ). For this type of language studies show that stemming (or > > rather lemmatization - bringing words to their base grammatical forms) > > significantly improves recall in IR systems. > > > > > > > > But there seems to be a price that you also pay, in that > > > discrimination may be adversely affected. If you want to > > > discriminate between two terms that the stemmer views as derived from > > > the same root, you're out of luck (I think). The problem with this > > > > Stemming usually improves recall, but lowers precision. For some systems > > it is more desirable to provide any results, even if they are not quite > > correct, than to provide none. > > > > > is that you may start with a set of terms that don't have this > > > problem, but over time as new content is added to the index, such > > > problems may gradually get introduced - often unpredictably. And to > > > the best of my (admittedly limited) knowledge, once you've indexed > > > using a stemmer, there's no way to override it in specific instances. > > > > You can always store in your index stemmed/non-stemmed terms alongside. > > > > > > > > Appreciate any comments, thoughts on the above. > > > > For highly-inflectional languages I had _very_ good results with > > stemmers built using the code from Egothor project > > (http://www.egothor.org) - much more sophisticated than simple > > rule-based stemmers like Snowball or Porter. In fact, after proper > > training on a large corpus I was getting ~70% of correct lemmas for > > previously unseen words, and over 90% of correct (unique) stems. > > > > -- > > Best regards, > > Andrzej Bialecki > > > > - > > Software Architect, System Integration Specialist > > CEN/ISSS EC Workshop, ECIMF project chair > > EU FP6 E-Commerce Expert/Evaluator > > - > > FreeBSD developer (http://www.freebsd.org) > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemmer Benefits/Costs
So, Andrez - Thank you for your comments - what you say makes a good deal of sense. When you have lots of different inflections that all share the same root, stemming can clearly provide significant (recall) benefits (in terms of catching hidden words and/or simplifying the query). However, would you say that "from the perspective of English" ("with its minimal inflection") the points I raise are correct? (You seem to say so with the statement that stemming "usually improves recall, but lowers precision.") And, would you expect significant benefits from the Egothor project code (versus Snowball/Porter) when the text is in English (as opposed to a highly inflectional language like Polish)? Regards, Terry - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, April 22, 2004 5:37 PM Subject: Re: Stemmer Benefits/Costs > Terry Steichen wrote: > > > I've been experimenting with the Porter and Snowball stemmers. It > > seems to me that one of the most valuable benefits these provide is > > the capability to generalize phrase terms. As a very simple example, > > without the stemmer, I might need to include three phrase terms in my > > query: "north korea", "north korean", "north koreans". But with the > > stemmer only one will suffice. To me, that's a huge advantage. (For > > non-phrases, the advantage doesn't seem to be so great, because much > > the same effect can be achieved with wildcards.) > > That's because you look at it from the perspective of English language > with its minimal inflection... My mother tongue is Polish - a highly > inflectional language from the Slavic family of languages. It is normal > for a single Polish word to have as many as 20+ different inflected > forms (plural/singular/dual, tense, gender, mood, case, infinitive... > enough? ;-) ). For this type of language studies show that stemming (or > rather lemmatization - bringing words to their base grammatical forms) > significantly improves recall in IR systems. > > > > > But there seems to be a price that you also pay, in that > > discrimination may be adversely affected. If you want to > > discriminate between two terms that the stemmer views as derived from > > the same root, you're out of luck (I think). The problem with this > > Stemming usually improves recall, but lowers precision. For some systems > it is more desirable to provide any results, even if they are not quite > correct, than to provide none. > > > is that you may start with a set of terms that don't have this > > problem, but over time as new content is added to the index, such > > problems may gradually get introduced - often unpredictably. And to > > the best of my (admittedly limited) knowledge, once you've indexed > > using a stemmer, there's no way to override it in specific instances. > > You can always store in your index stemmed/non-stemmed terms alongside. > > > > > Appreciate any comments, thoughts on the above. > > For highly-inflectional languages I had _very_ good results with > stemmers built using the code from Egothor project > (http://www.egothor.org) - much more sophisticated than simple > rule-based stemmers like Snowball or Porter. In fact, after proper > training on a large corpus I was getting ~70% of correct lemmas for > previously unseen words, and over 90% of correct (unique) stems. > > -- > Best regards, > Andrzej Bialecki > > - > Software Architect, System Integration Specialist > CEN/ISSS EC Workshop, ECIMF project chair > EU FP6 E-Commerce Expert/Evaluator > - > FreeBSD developer (http://www.freebsd.org) > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemmer Benefits/Costs
Terry Steichen wrote: I've been experimenting with the Porter and Snowball stemmers. It seems to me that one of the most valuable benefits these provide is the capability to generalize phrase terms. As a very simple example, without the stemmer, I might need to include three phrase terms in my query: "north korea", "north korean", "north koreans". But with the stemmer only one will suffice. To me, that's a huge advantage. (For non-phrases, the advantage doesn't seem to be so great, because much the same effect can be achieved with wildcards.) That's because you look at it from the perspective of English language with its minimal inflection... My mother tongue is Polish - a highly inflectional language from the Slavic family of languages. It is normal for a single Polish word to have as many as 20+ different inflected forms (plural/singular/dual, tense, gender, mood, case, infinitive... enough? ;-) ). For this type of language studies show that stemming (or rather lemmatization - bringing words to their base grammatical forms) significantly improves recall in IR systems. But there seems to be a price that you also pay, in that discrimination may be adversely affected. If you want to discriminate between two terms that the stemmer views as derived from the same root, you're out of luck (I think). The problem with this Stemming usually improves recall, but lowers precision. For some systems it is more desirable to provide any results, even if they are not quite correct, than to provide none. is that you may start with a set of terms that don't have this problem, but over time as new content is added to the index, such problems may gradually get introduced - often unpredictably. And to the best of my (admittedly limited) knowledge, once you've indexed using a stemmer, there's no way to override it in specific instances. You can always store in your index stemmed/non-stemmed terms alongside. Appreciate any comments, thoughts on the above. For highly-inflectional languages I had _very_ good results with stemmers built using the code from Egothor project (http://www.egothor.org) - much more sophisticated than simple rule-based stemmers like Snowball or Porter. In fact, after proper training on a large corpus I was getting ~70% of correct lemmas for previously unseen words, and over 90% of correct (unique) stems. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stemmer Benefits/Costs
I've been experimenting with the Porter and Snowball stemmers. It seems to me that one of the most valuable benefits these provide is the capability to generalize phrase terms. As a very simple example, without the stemmer, I might need to include three phrase terms in my query: "north korea", "north korean", "north koreans". But with the stemmer only one will suffice. To me, that's a huge advantage. (For non-phrases, the advantage doesn't seem to be so great, because much the same effect can be achieved with wildcards.) But there seems to be a price that you also pay, in that discrimination may be adversely affected. If you want to discriminate between two terms that the stemmer views as derived from the same root, you're out of luck (I think). The problem with this is that you may start with a set of terms that don't have this problem, but over time as new content is added to the index, such problems may gradually get introduced - often unpredictably. And to the best of my (admittedly limited) knowledge, once you've indexed using a stemmer, there's no way to override it in specific instances. Appreciate any comments, thoughts on the above. Regards, Terry
Doing a join?
Is it possible to do a join on two fields when searching a Lucene Index. For example, I have an index of documents that have a "StudentName" and a "StudentId" field and another document that has "ClassId", "ClassName" and "StudentId". I want to do a search on "ClassId" or "ClassName" and get a list of "StudentName". Both of these documents are in one index, but are loaded from seperate files, so I can't join at creation time. Any help is greatly appreciated. Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: more rigid stopword list ?
p.s. there is no need to create a new Analyzer to tweak the stop word list. The analyzers that do stop word removal accept the list as an argument to an overloaded constructor. Erik On Apr 22, 2004, at 1:08 PM, Otis Gospodnetic wrote: Moving to lucene-user list. One of my Lucene articles includes a more comprehensive stop word list for English: http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html? page=2#references Otis --- [EMAIL PROTECTED] wrote: Dear all, for my taste the stopwords included in Lucene (e.g. StopAnalyzer.ENGLISH_STOP_WORDS, wich is usually used with the SnowballAnalyzer - and I guess also with the StandardAnalyzer) is not strict enough: For example in a sentence with "we need ..." I would consider "we" and "need" as stopwords but they are not stripped by SnowballAnalyzer or StandardAnalyzer. Now: Is there an in-built solution to use more restrictive stripping or do I better create my own analyzer in that case with a more restrictive stopword list ? If so - are you aware of more rigid lists ? (a URI would be great !) Thanks, Holger ___ The ALL NEW CS2000 from CompuServe Better! Faster! More Powerful! 250 FREE hours! Sign-on Now! http://www.compuserve.com/trycsrv/cs2000/webmail/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ka - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: more rigid stopword list ?
Moving to lucene-user list. One of my Lucene articles includes a more comprehensive stop word list for English: http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2#references Otis --- [EMAIL PROTECTED] wrote: > Dear all, > > for my taste the stopwords included in Lucene (e.g. > StopAnalyzer.ENGLISH_STOP_WORDS, wich is usually used > with the SnowballAnalyzer - and I guess also with the > StandardAnalyzer) is not strict enough: > > For example in a sentence with "we need ..." I would > consider "we" and "need" as stopwords but they are not > stripped by SnowballAnalyzer or StandardAnalyzer. > > Now: > Is there an in-built solution to use more restrictive > stripping or do I better create my own analyzer in that > case with a more restrictive stopword list ? > > If so - are you aware of more rigid lists ? (a URI > would be great !) > > Thanks, > > Holger > > ___ > The ALL NEW CS2000 from CompuServe > Better! Faster! More Powerful! > 250 FREE hours! Sign-on Now! > http://www.compuserve.com/trycsrv/cs2000/webmail/ > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ka - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Digester] DigesterMarriesLucene
Hello, There is no need to include DigesterMarriesLucene.class in that Lucene-demos Jar. You just need to make sure you add the directory where DigesterMarriesLucene.class is, to your CLASSPATH. Listing 4 in that article shows that DigesterMarriesLucene is not in any particular Java package. Therefore, do not invoke it as java org.apacheDigesterMarriesLucene, but rather: java DigesterMarriesLucene . Otis --- Samuel Tang <[EMAIL PROTECTED]> wrote: > I have read the article on the IBM website regarding using lucene > (http://www-106.ibm.com/developerworks/library/j-lucene) and followed > > the provided 'Listing 4' to make the DigesterMarriesLucene.class. I > downloaded the Digester package as well in order to parse the > imaginary > address book xml to see if it works. > > Unfortunately, I got the below error message: > > java.lang.NoClassDefFoundError: DigesterMarriesLucene > > My setup is to include the compiled DigesterMarriesLucene.class to > the > lucene-demos-1.3-final.jar file so as to run the class in Lucene by > typing in > > # java org.apache.lucene.demo.DigesterMarriesLucene > > What I should do to get rid of the errors? Are there any > documentations > available online to show me how to do the setup? > > > > ¥²±þ§Þ¡B¶¼ºq¡B¤p¬P¬P... > ®öº©¹aÁn ±¡¤ß³sô > http://ringtone.yahoo.com.hk/ > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[Digester] DigesterMarriesLucene
I have read the article on the IBM website regarding using lucene (http://www-106.ibm.com/developerworks/library/j-lucene) and followed the provided 'Listing 4' to make the DigesterMarriesLucene.class. I downloaded the Digester package as well in order to parse the imaginary address book xml to see if it works. Unfortunately, I got the below error message: java.lang.NoClassDefFoundError: DigesterMarriesLucene My setup is to include the compiled DigesterMarriesLucene.class to the lucene-demos-1.3-final.jar file so as to run the class in Lucene by typing in # java org.apache.lucene.demo.DigesterMarriesLucene What I should do to get rid of the errors? Are there any documentations available online to show me how to do the setup? 必殺技、飲歌、小星星... 浪漫鈴聲 情心連繫 http://ringtone.yahoo.com.hk/