date range suggestion anyone?

2004-04-22 Thread Frank Morton
Newbie here. Or, at least it has been a couple of years

I have a date ranges working, which seem to work well. But I have
a question about how to form a query.
I have a publication with a dateAvailable and a dateExpired. It is
viewable any time between these dates.
I want to supply a date range, looking for publications available in
the specified range. I thought I could do:
+((dateAvailable:[02/01/2004 TO 03/01/2004]) OR 
(dateExpired:[02/01/2004 TO 03/01/2004]))

But this does not work with this combination of dates:

dateAvailable=1/1/2004
searchStartDate=2/1/2004
searchEndDate=3/1/2004
dateExpired=6/1/2004
Neither the dateAvailable nor dateExpired are included within the user 
specified test
range, even though the publication is available during the entire 
specified range,
plus more.

Anyone figured out a way to do this without enumerating all the dates?  
Or, do I
just need more sleep.

Thanks for any help.

Frank

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: what web crawler work best with Lucene?

2004-04-22 Thread Tuan Jean Tee
Sebastian,

Would you be able to show me your code? Thank you.

TJ

>>> [EMAIL PROTECTED] 22/Apr/2004 03:21:50 pm >>>
U can also use the html parser API available from Java 1.4.2. I tried
it
last week with a simple program which retrieve html files and
displaying
all HREF links in it. I done it within a day.

sebastian


On Thu, 2004-04-22 at 12:18, Stephane James Vaucher wrote:
> How big is the site?
> 
> I mostly use an inhouse solution, but I've used HttpUnit for web
scrapping
> small sites (because of its high-level api).
> 
> Here is a hello world example:
> http://wiki.apache.org/jakarta-lucene/HttpUnitExample 
> 
> For a small/simple site, small modifications to this class could
suffice.
> IT WILL NOT function on large sites because of memory problems.
> 
> For larger sites, there are questions like:
> 
> - memory:
> For example, spidering all links on every page can lead to visiting
too
> many links. Keeping all visited links in memory can be problematic
> 
> - noise
> If you get every page on your web site, you might be adding noise to
the
> search engine. Spider navigation rules can help out, like saying that
you
> should only follow links/index documents of a specific form like
> www.mysite.com/news/article.jsp?articleid=xxx 
> 
> - speed:
> Too much speed can be bad if you doing 100 hits/sec on a site could
hurt
> it (especially if it's not you who are the webmaster)
> Too little speed can be bad if you want to make sure you quickly get
new
> pages.
> 
> - categorisation:
> You might want to separate information in your index. For example,
you
> might want a user to do a search in the documentation section or in
the
> press release section. This categorisation can be done by specifying
> sections to the site, or a subsequent analysis of available docs.
> 
> -up-to-date information
> You'll want to think of your update schedule, so that if you add a
new
> page, it gets indexed quickly. This problem also occurs when you
modify an
> existing page, you might want the modification to be detected
rapidly.
> 
> HTH,
> sv
> 
> On Thu, 22 Apr 2004, Tuan Jean Tee wrote:
> 
> > Have anyone implemented any open source web crawler with Lucene? I
have
> > a dynamic website and are looking at putting in a search tools.
Your
> > advice is very much appreciated.
> >
> > Thank you.
> >
> >
> > IMPORTANT -
> >
> > This email and any attachments are confidential and may be
privileged in
> > which case neither is intended to be waived. If you have received
this
> > message in error, please notify us and remove it from your system.
It is
> > your responsibility to check any attachments for viruses and
defects
> > before opening or sending them on. Where applicable, liability is
> > limited by the Solicitors Scheme approved under the Professional
> > Standards Act 1994 (NSW). Minter Ellison collects personal
information
> > to provide and market our services. For more information about
use,
> > disclosure and access, see our privacy policy at
www.minterellison.com.
> >
> >
> >
-
> > To unsubscribe, e-mail: [EMAIL PROTECTED]

> > For additional commands, e-mail:
[EMAIL PROTECTED] 
> >
> 
> 
>
-
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED]

> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Wiki write access

2004-04-22 Thread Erik Hatcher
Just thought I'd update on the wiki.  I take this to mean we need to 
upgrade to get better control.

	Erik

Begin forwarded message:

From: "Noel J. Bergman" <[EMAIL PROTECTED]>
Date: April 22, 2004 10:01:10 AM EDT
To: "Jakarta Project Management Committee List" 
<[EMAIL PROTECTED]>
Subject: RE: Wiki write access
Reply-To: "Jakarta Project Management Committee List" 
<[EMAIL PROTECTED]>

Do any of the Apache wikis lock down write access for only those with
a registered profile?  Is this a reasonable requirement to have made?
No, they don't and yes it is.  I believe there is more control in the 
next
version of Moin Moin.

	--- Noel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
Andrzej,

Sorry for misspelling your name.  My Polish sucks.

Terry

- Original Message - 
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, April 22, 2004 7:56 PM
Subject: Re: Stemmer Benefits/Costs


> So, Andrez - Thank you for your comments - what you say makes a good deal
of
> sense.  When you have lots of different inflections that all share the
same
> root, stemming can clearly provide significant (recall) benefits (in terms
> of catching hidden words and/or simplifying the query).
>
> However, would you say that "from the perspective of English" ("with its
> minimal inflection") the points I raise are correct?  (You seem to say so
> with the statement that stemming "usually improves recall, but lowers
> precision.")
>
> And, would you expect significant benefits from the Egothor project code
> (versus Snowball/Porter) when the text is in English (as opposed to a
highly
> inflectional language like Polish)?
>
> Regards,
>
> Terry
>
> - Original Message - 
> From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, April 22, 2004 5:37 PM
> Subject: Re: Stemmer Benefits/Costs
>
>
> > Terry Steichen wrote:
> >
> > > I've been experimenting with the Porter and Snowball stemmers.  It
> > > seems to me that one of the most valuable benefits these provide is
> > > the capability to generalize phrase terms.  As a very simple example,
> > > without the stemmer, I might need to include three phrase terms in my
> > > query: "north korea", "north korean", "north koreans".  But with the
> > > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > > non-phrases, the advantage doesn't seem to be so great, because much
> > > the same effect can be achieved with wildcards.)
> >
> > That's because you look at it from the perspective of English language
> > with its minimal inflection... My mother tongue is Polish - a highly
> > inflectional language from the Slavic family of languages. It is normal
> > for a single Polish word to have as many as 20+ different inflected
> > forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> > enough? ;-) ). For this type of language studies show that stemming (or
> > rather lemmatization - bringing words to their base grammatical forms)
> > significantly improves recall in IR systems.
> >
> > >
> > > But there seems to be a price that you also pay, in that
> > > discrimination may be adversely affected.  If you want to
> > > discriminate between two terms that the stemmer views as derived from
> > > the same root, you're out of luck (I think).  The problem with this
> >
> > Stemming usually improves recall, but lowers precision. For some systems
> > it is more desirable to provide any results, even if they are not quite
> > correct, than to provide none.
> >
> > > is that you may start with a set of terms that don't have this
> > > problem, but over time as new content is added to the index, such
> > > problems may gradually get introduced - often unpredictably.  And to
> > > the best of my (admittedly limited) knowledge, once you've indexed
> > > using a stemmer, there's no way to override it in specific instances.
> >
> > You can always store in your index stemmed/non-stemmed terms alongside.
> >
> > >
> > > Appreciate any comments, thoughts on the above.
> >
> > For highly-inflectional languages I had _very_ good results with
> > stemmers built using the code from Egothor project
> > (http://www.egothor.org) - much more sophisticated than simple
> > rule-based stemmers like Snowball or Porter. In fact, after proper
> > training on a large corpus I was getting ~70% of correct lemmas for
> > previously unseen words, and over 90% of correct (unique) stems.
> >
> > -- 
> > Best regards,
> > Andrzej Bialecki
> >
> > -
> > Software Architect, System Integration Specialist
> > CEN/ISSS EC Workshop, ECIMF project chair
> > EU FP6 E-Commerce Expert/Evaluator
> > -
> > FreeBSD developer (http://www.freebsd.org)
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
So, Andrez - Thank you for your comments - what you say makes a good deal of
sense.  When you have lots of different inflections that all share the same
root, stemming can clearly provide significant (recall) benefits (in terms
of catching hidden words and/or simplifying the query).

However, would you say that "from the perspective of English" ("with its
minimal inflection") the points I raise are correct?  (You seem to say so
with the statement that stemming "usually improves recall, but lowers
precision.")

And, would you expect significant benefits from the Egothor project code
(versus Snowball/Porter) when the text is in English (as opposed to a highly
inflectional language like Polish)?

Regards,

Terry

- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, April 22, 2004 5:37 PM
Subject: Re: Stemmer Benefits/Costs


> Terry Steichen wrote:
>
> > I've been experimenting with the Porter and Snowball stemmers.  It
> > seems to me that one of the most valuable benefits these provide is
> > the capability to generalize phrase terms.  As a very simple example,
> > without the stemmer, I might need to include three phrase terms in my
> > query: "north korea", "north korean", "north koreans".  But with the
> > stemmer only one will suffice.  To me, that's a huge advantage.  (For
> > non-phrases, the advantage doesn't seem to be so great, because much
> > the same effect can be achieved with wildcards.)
>
> That's because you look at it from the perspective of English language
> with its minimal inflection... My mother tongue is Polish - a highly
> inflectional language from the Slavic family of languages. It is normal
> for a single Polish word to have as many as 20+ different inflected
> forms (plural/singular/dual, tense, gender, mood, case, infinitive...
> enough? ;-) ). For this type of language studies show that stemming (or
> rather lemmatization - bringing words to their base grammatical forms)
> significantly improves recall in IR systems.
>
> >
> > But there seems to be a price that you also pay, in that
> > discrimination may be adversely affected.  If you want to
> > discriminate between two terms that the stemmer views as derived from
> > the same root, you're out of luck (I think).  The problem with this
>
> Stemming usually improves recall, but lowers precision. For some systems
> it is more desirable to provide any results, even if they are not quite
> correct, than to provide none.
>
> > is that you may start with a set of terms that don't have this
> > problem, but over time as new content is added to the index, such
> > problems may gradually get introduced - often unpredictably.  And to
> > the best of my (admittedly limited) knowledge, once you've indexed
> > using a stemmer, there's no way to override it in specific instances.
>
> You can always store in your index stemmed/non-stemmed terms alongside.
>
> >
> > Appreciate any comments, thoughts on the above.
>
> For highly-inflectional languages I had _very_ good results with
> stemmers built using the code from Egothor project
> (http://www.egothor.org) - much more sophisticated than simple
> rule-based stemmers like Snowball or Porter. In fact, after proper
> training on a large corpus I was getting ~70% of correct lemmas for
> previously unseen words, and over 90% of correct (unique) stems.
>
> -- 
> Best regards,
> Andrzej Bialecki
>
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemmer Benefits/Costs

2004-04-22 Thread Andrzej Bialecki
Terry Steichen wrote:

I've been experimenting with the Porter and Snowball stemmers.  It
seems to me that one of the most valuable benefits these provide is
the capability to generalize phrase terms.  As a very simple example,
without the stemmer, I might need to include three phrase terms in my
query: "north korea", "north korean", "north koreans".  But with the
stemmer only one will suffice.  To me, that's a huge advantage.  (For
non-phrases, the advantage doesn't seem to be so great, because much
the same effect can be achieved with wildcards.)
That's because you look at it from the perspective of English language 
with its minimal inflection... My mother tongue is Polish - a highly 
inflectional language from the Slavic family of languages. It is normal 
for a single Polish word to have as many as 20+ different inflected 
forms (plural/singular/dual, tense, gender, mood, case, infinitive... 
enough? ;-) ). For this type of language studies show that stemming (or 
rather lemmatization - bringing words to their base grammatical forms) 
significantly improves recall in IR systems.

But there seems to be a price that you also pay, in that
discrimination may be adversely affected.  If you want to
discriminate between two terms that the stemmer views as derived from
the same root, you're out of luck (I think).  The problem with this
Stemming usually improves recall, but lowers precision. For some systems 
it is more desirable to provide any results, even if they are not quite 
correct, than to provide none.

is that you may start with a set of terms that don't have this
problem, but over time as new content is added to the index, such
problems may gradually get introduced - often unpredictably.  And to
the best of my (admittedly limited) knowledge, once you've indexed
using a stemmer, there's no way to override it in specific instances.
You can always store in your index stemmed/non-stemmed terms alongside.

Appreciate any comments, thoughts on the above.
For highly-inflectional languages I had _very_ good results with 
stemmers built using the code from Egothor project 
(http://www.egothor.org) - much more sophisticated than simple 
rule-based stemmers like Snowball or Porter. In fact, after proper 
training on a large corpus I was getting ~70% of correct lemmas for 
previously unseen words, and over 90% of correct (unique) stems.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Stemmer Benefits/Costs

2004-04-22 Thread Terry Steichen
I've been experimenting with the Porter and Snowball stemmers.  It seems to me that 
one of the most valuable benefits these provide is the capability to generalize phrase 
terms.  As a very simple example, without the stemmer, I might need to include three 
phrase terms in my query: "north korea", "north korean", "north koreans".  But with 
the stemmer only one will suffice.  To me, that's a huge advantage.  (For non-phrases, 
the advantage doesn't seem to be so great, because much the same effect can be 
achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination may be 
adversely affected.  If you want to discriminate between two terms that the stemmer 
views as derived from the same root, you're out of luck (I think).  The problem with 
this is that you may start with a set of terms that don't have this problem, but over 
time as new content is added to the index, such problems may gradually get introduced 
- often unpredictably.  And to the best of my (admittedly limited) knowledge, once 
you've indexed using a stemmer, there's no way to override it in specific instances.

Appreciate any comments, thoughts on the above.

Regards,

Terry
 

Doing a join?

2004-04-22 Thread Rob Jose
Is it possible to do a join on two fields when searching a Lucene Index.
For example, I have an index of documents that have a "StudentName" and a
"StudentId" field and another document that has "ClassId", "ClassName" and
"StudentId".  I want to do a search on "ClassId" or "ClassName" and get a
list of "StudentName".  Both of these documents are in one index, but are
loaded from seperate files, so I can't join at creation time.  Any help is
greatly appreciated.

Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: more rigid stopword list ?

2004-04-22 Thread Erik Hatcher
p.s. there is no need to create a new Analyzer to tweak the stop word  
list.  The analyzers that do stop word removal accept the list as an  
argument to an overloaded constructor.

	Erik

On Apr 22, 2004, at 1:08 PM, Otis Gospodnetic wrote:

Moving to lucene-user list.

One of my Lucene articles includes a more comprehensive stop word list
for English:
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html? 
page=2#references

Otis

--- [EMAIL PROTECTED] wrote:
Dear all,

for my taste the stopwords included in Lucene (e.g.
StopAnalyzer.ENGLISH_STOP_WORDS, wich is usually used
with the SnowballAnalyzer - and I guess also with the
StandardAnalyzer) is not strict enough:
For example in a sentence with "we need ..." I would
consider "we" and "need" as stopwords but they are not
stripped by SnowballAnalyzer or StandardAnalyzer.
Now:
Is there an in-built solution to use more restrictive
stripping or do I better create my own analyzer in that
case with a more restrictive stopword list ?
If so - are you aware of more rigid lists ? (a URI
would be great !)
Thanks,

Holger

___
The ALL NEW CS2000 from CompuServe
 Better!  Faster! More Powerful!
 250 FREE hours! Sign-on Now!
 http://www.compuserve.com/trycsrv/cs2000/webmail/




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
ka

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: more rigid stopword list ?

2004-04-22 Thread Otis Gospodnetic
Moving to lucene-user list.

One of my Lucene articles includes a more comprehensive stop word list
for English:

http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2#references

Otis

--- [EMAIL PROTECTED] wrote:
> Dear all,
> 
> for my taste the stopwords included in Lucene (e.g.
> StopAnalyzer.ENGLISH_STOP_WORDS, wich is usually used
> with the SnowballAnalyzer - and I guess also with the
> StandardAnalyzer) is not strict enough:
> 
> For example in a sentence with "we need ..." I would
> consider "we" and "need" as stopwords but they are not
> stripped by SnowballAnalyzer or StandardAnalyzer. 
> 
> Now:
> Is there an in-built solution to use more restrictive
> stripping or do I better create my own analyzer in that
> case with a more restrictive stopword list ?
> 
> If so - are you aware of more rigid lists ? (a URI
> would be great !)
> 
> Thanks,
> 
> Holger
> 
> ___
> The ALL NEW CS2000 from CompuServe
>  Better!  Faster! More Powerful!
>  250 FREE hours! Sign-on Now!
>  http://www.compuserve.com/trycsrv/cs2000/webmail/
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
ka

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Digester] DigesterMarriesLucene

2004-04-22 Thread Otis Gospodnetic
Hello,

There is no need to include DigesterMarriesLucene.class in that
Lucene-demos Jar.  You just need to make sure you add the directory
where DigesterMarriesLucene.class is, to your CLASSPATH.  Listing 4 in
that article shows that DigesterMarriesLucene is not in any particular
Java package.  Therefore, do not invoke it as java
org.apacheDigesterMarriesLucene, but rather: java
DigesterMarriesLucene .

Otis

--- Samuel Tang <[EMAIL PROTECTED]> wrote:
> I have read the article on the IBM website regarding using lucene
> (http://www-106.ibm.com/developerworks/library/j-lucene) and followed
> 
> the provided 'Listing 4' to make the DigesterMarriesLucene.class. I 
> downloaded the Digester package as well in order to parse the
> imaginary 
> address book xml to see if it works.
>  
> Unfortunately, I got the below error message:
>  
>   java.lang.NoClassDefFoundError: DigesterMarriesLucene
>  
> My setup is to include the compiled DigesterMarriesLucene.class to
> the 
> lucene-demos-1.3-final.jar file so as to run the class in Lucene by 
> typing in
>  
>   # java org.apache.lucene.demo.DigesterMarriesLucene
>  
> What I should do to get rid of the errors? Are there any
> documentations 
> available online to show me how to do the setup?
>  
> 
> 
> ¥²±þ§Þ¡B¶¼ºq¡B¤p¬P¬P...
> ®öº©¹aÁn  ±¡¤ß³sô
> http://ringtone.yahoo.com.hk/
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[Digester] DigesterMarriesLucene

2004-04-22 Thread Samuel Tang
I have read the article on the IBM website regarding using lucene
(http://www-106.ibm.com/developerworks/library/j-lucene) and followed 
the provided 'Listing 4' to make the DigesterMarriesLucene.class. I 
downloaded the Digester package as well in order to parse the imaginary 
address book xml to see if it works.
 
Unfortunately, I got the below error message:
 
  java.lang.NoClassDefFoundError: DigesterMarriesLucene
 
My setup is to include the compiled DigesterMarriesLucene.class to the 
lucene-demos-1.3-final.jar file so as to run the class in Lucene by 
typing in
 
  # java org.apache.lucene.demo.DigesterMarriesLucene
 
What I should do to get rid of the errors? Are there any documentations 
available online to show me how to do the setup?
 


必殺技、飲歌、小星星...
浪漫鈴聲  情心連繫
http://ringtone.yahoo.com.hk/