sitegeist

2006-05-25 Thread karl wettin
Did anyone write some neat tool for statistical analysis of hits over
time? I need one. And it must be fast. Was thinking something like this:

List timeFrames;

class TimeFrame {
  Date from;
  Date to;

  void add(Hits hits) {
int score = 10;
for (int d = 0; score<0 && d

Re: Making SpanQuery more effiicent

2006-05-25 Thread Michael Chan

After some more research, it seems that one of the bottlenecks is
Spans.next(), can I drop anything out in order to improve performance?
Most of the queries are SpanNearQuery with SpanOrQuery as its clauses.

Any help would be much appreciated.

Regards,

Michael

On 5/25/06, Michael Chan <[EMAIL PROTECTED]> wrote:

I see.

Also, as I'm only interested in the number of results returned and not
in the ranking of documents returned, is there any component I can
simplify in order to improve search performance? Perhaps, Scorer or
Similarity?

Thanks.

Michael

On 5/24/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : Unfortunately, I want to have subqueries inside my query (e.g. (t1 AND
> : t2) NEAR (t3 OR t4)), and PhraseQuery seems to allow only Terms inside
> : it.
>
> In that case, you aren't just using SpanQuery for the use of slop -- you
> are using the Span information, you just don't realize it (that's how all
> of the SpanQueries work -- the get the Slop information from the sub
> queries and propogate them up.  (which is also why you can't use just any
> only Query as a clause in a SpanNearQuery)
>
> : > > As I use SpanQuery purely for the use of slop, I was wondering how to
> : > > make SpanQuery more efficient,. Since I don't need any span
> : > > information, is there a way to disable the computation for span and
> : > > other unneeded overhead?
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search oddities

2006-05-25 Thread Tim.Wright
It appears that I was confused about the way analyzers are working. I
assumed that a typical analyzer would just remove hyphens and treat the
phrase as a space. We're just using StandardAnalyzer. 

When we search (using QueryParser) for the phrase "t-mobile" (including
quotes) we're getting results back which only have the phrase "mobile"
in them. I would assume the analyzer would convert this to the "t
mobile" (again, in quotes) and would only return documents containing
that phrase. 

Oddly, however, if we search for "pay-tv", we only get back documents
that actually have the phrase "pay tv" or "pay-tv" - nothing which just
has "pay" or "tv". 

I'm not quite sure why "t-mobile" is behaving differently to "pay-tv".
If anyone could point me in the right direction I'd be very grateful!

Many thanks, 

Tim.

The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



best way to get specific results

2006-05-25 Thread Omar Didi
Hi all,

I need to be able to get specific documents out of the returned documents 
without the need to retrieve all the other documents.
just to describe my case, the user is allowed to specify in the queryString the 
page number and number of results to return. for example if a query returns 
1000 results, the user is interested only in the results between 500&550. the 
way I implemented it is run a normal query using IndexSercher.search(Query()) 
and then get the specified documents out of the hits object.
I am wondering if there is a more efficient way than this, is using TopDocs 
better than the hits object, knowing that some users may need more than a 1000 
docs back in one query?.

thank you for your help,

Omar.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search oddities

2006-05-25 Thread Daniel Naber
On Donnerstag 25 Mai 2006 16:18, [EMAIL PROTECTED] wrote:

> When we search (using QueryParser) for the phrase "t-mobile" (including
> quotes)

t-mobile becomes "t mobile", but "t" is a stopword by default. Why? Maybe 
the person who added it has a dislike for German Telekom :-) But 
seriously, you should probably file a bug report. Workaround for now is to 
use your own stopwords.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search oddities

2006-05-25 Thread Erik Hatcher


On May 25, 2006, at 11:01 AM, Daniel Naber wrote:


On Donnerstag 25 Mai 2006 16:18, [EMAIL PROTECTED] wrote:

When we search (using QueryParser) for the phrase "t- 
mobile" (including

quotes)


t-mobile becomes "t mobile", but "t" is a stopword by default. Why?  
Maybe

the person who added it has a dislike for German Telekom :-) But
seriously, you should probably file a bug report. Workaround for  
now is to

use your own stopwords.


"t" is a stop word because words like "don't" get analyzed into [don]  
[t].


In the short term, its not really a bug but just the nature of how it  
was meant to be.  Changing the default stop words in the 1.9/2.0  
releases isn't going to happen... but certainly lobbying for this to  
be more sensible in the future is worth it.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search oddities

2006-05-25 Thread Daniel Naber
On Donnerstag 25 Mai 2006 17:48, Erik Hatcher wrote:

> "t" is a stop word because words like "don't" get analyzed into [don]  
> [t].

Maybe it should, but it doesn't it seems: don't gets parsed as field:don't 
using StandardAnalyzer and QueryParser. Mhh, maybe this is because people 
use different characters:

don't
don´t
don`t

Only the first (and correct?) one is not split up using StandardAnalyzer.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Integrating a J2EE Application into "Generic" Enterprise Search

2006-05-25 Thread Nicholas Van Weerdenburg
Hi,

 

I am planning to integrate Lucene into our application. However, I also
want to support the general enterprise search market and what our
customers have installed.

 

Ideally, we would develop:

 

1.  generic search support services

a.  index records into logical "document-centric" records
with urls for access.
b.  Triggers to generated updated records in real time.
c.  Batch indexer for generation of intial data for search
engine.

2.  specific search enginer support - e.g. lucene, retrievalware,
etc.

 

Are there any enterprise search intergration standards (e.g. xml
schema)?

 

Any recommendations for best approaching this?

 

Thanks,

Nick

 



Re: Question about special characters

2006-05-25 Thread Dan Wiggin

My own solution until I have another one better, I use FuzzyQuery for every
term in the phrase.
For example "My work is the worst" ->> My~ work~ is~ the~ worst
What do you think about this uggly solution? I don't have anything more
ideas.

2006/5/24, Dan Wiggin <[EMAIL PROTECTED]>:


I need some functionality and I don't know how to do.
The problem is special characters like à, ä , ç or ñ latin characters in
the text.
Now I use iso latin filter, but the problem is when I want to obtain most
term used. These term are stored without ` ´ ^ or another "character
attribute".
For example "plàntïuç" (it isn't a real word) is stored like the term
"plantiuc".
How can I do to have in term vector the word "plàntïuç".

thks for all replies.
PD: excuse if this question is solved somewhere, but I don't saw it.



Re: best way to get specific results

2006-05-25 Thread Chris Hostetter

: if a query returns 1000 results, the user is interested only in the
: results between 500&550. the way I implemented it is run a normal query
: using IndexSercher.search(Query()) and then get the specified documents
: out of the hits object. I am wondering if there is a more efficient way
: than this, is using TopDocs better than the hits object, knowing that
: some users may need more than a 1000 docs back in one query?.

generally speaking, yes TopDocs (or TopFieldDocs) are better then Hits if
you plan on acessing morethen the first 100 or so results .. Hits will
reexecute your search over and over as you ask for higher numbered
results, while with TopDocs you search is executed once, and you are given
only the Doc IDs of the first N docs you asked for, with no other
processing done behind the scenes  (in your case, it sounds like N would
be 550, and you'd start accessing the ScoreDoc[] at 500.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about special characters

2006-05-25 Thread Chris Hostetter

I think I'm missing something here.  the whole point of the
ISOLatin1AccentFilter is to replace accented characters with their
unaccented equivalent -- it sounds like that's working just fine, If you
want teh words in teh term vector to contain the accents, why don't you
stop using that filter?

if the problem is that you need to be able to match on both the accented
form and the non accented form, perhaps you should have two fields, or
modify the ISOLatin1AccentFilter so it puts both versions of the token in
the TokenStream with the same position?


: > The problem is special characters like à, ä , ç or ñ latin characters in
: > the text.
: > Now I use iso latin filter, but the problem is when I want to obtain most
: > term used. These term are stored without ` ´ ^ or another "character
: > attribute".
: > For example "plàntïuç" (it isn't a real word) is stored like the term
: > "plantiuc".
: > How can I do to have in term vector the word "plàntïuç".
: >
: > thks for all replies.
: > PD: excuse if this question is solved somewhere, but I don't saw it.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boolean query term match count

2006-05-25 Thread Crump, Michael
Hello,

 

I'm working on a search application and I need to know if it is possible
to get the number of terms that actually matched a Boolean query.  For
example let's say I have field test with values aaa bbb ccc d e f and I
constructed a Boolean query like this:  test:aaa OR test:bbb OR test:e
is it possible to get the count - 3 for the the terms (aaa, bbb, e) that
matched from the test field.  If not would it be possible to modify the
BooleanScorer to accomplish this?

 

TIA,

 

Michael 



Re: Boolean query term match count

2006-05-25 Thread Paul Elschot
On Thursday 25 May 2006 21:08, Crump, Michael wrote:
> Hello,
> 
>  
> 
> I'm working on a search application and I need to know if it is possible
> to get the number of terms that actually matched a Boolean query.  For
> example let's say I have field test with values aaa bbb ccc d e f and I
> constructed a Boolean query like this:  test:aaa OR test:bbb OR test:e
> is it possible to get the count - 3 for the the terms (aaa, bbb, e) that
> matched from the test field.  If not would it be possible to modify the
> BooleanScorer to accomplish this?

Have  a look at Similarity and DefaultSimilarity. It looks like
you need your own Similarity with a non constant coord()
implementation, and some constant value for the rest of the
methods in there.
The first argument to coord() is the number of matching
clauses for a document for a boolean query.
The value returned by coord() will be used in the calculation
of the score value for the document.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Integrating a J2EE Application into "Generic" Enterprise Search

2006-05-25 Thread Yonik Seeley

On 5/25/06, Nicholas Van Weerdenburg <[EMAIL PROTECTED]> wrote:

Are there any enterprise search intergration standards (e.g. xml
schema)?


It may or may not be what you are looking for, but there is Solr, a
lucene-based search server with XML/HTTP interfaces.  It's primarily
meant to be a standaone server (think database), but it is possible to
embed.  See my sig for the link.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Integrating a J2EE Application into "Generic" Enterprise Search

2006-05-25 Thread Chris Lu

Solr is nice when you can change the existing enterprise applications,
extract content and post xml content to the server. But definately
still a lot of coding.

I would say DBSight is another alternative here. It has similar
architecture as Solr, but it crawls databases by configurable SQLs.
Only need to plug into any existing databases by JDBC, and it can fit
any schema. No xml, xslt efforts. Usually in 15 minutes you can have a
google-like search.

Chris
--
Lucene Search on Any Databases/Applications
http://www.dbsight.net

On 5/25/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 5/25/06, Nicholas Van Weerdenburg <[EMAIL PROTECTED]> wrote:
> Are there any enterprise search intergration standards (e.g. xml
> schema)?

It may or may not be what you are looking for, but there is Solr, a
lucene-based search server with XML/HTTP interfaces.  It's primarily
meant to be a standaone server (think database), but it is possible to
embed.  See my sig for the link.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Integrating a J2EE Application into "Generic" Enterprise Search

2006-05-25 Thread Nicholas Van Weerdenburg
Both sound interesting, but what I want is to be able to generate the 
intermediate xml that most enterprise search servers could use to quickly 
integrate with them.

e.g. 
customer 1 uses retrievalware for enterprise search
customer 2 uses Solr
customer 3 uses yyy.

How do I build our my functionality to support RetrievalWare, Solr, and YYY?

One thought is to build a directory view that invites crawlers in to auto-index.

DBSight sounds interesting but it does't help me with other enterprise search 
tools from the sounds of it. I'm also thinking that our database schema is 
resistant to sql-oriented queries.

Thanks,
Nick


-Original Message-
From:   Chris Lu [mailto:[EMAIL PROTECTED]
Sent:   Thu 5/25/2006 10:15 PM
To: java-user@lucene.apache.org
Cc: 
Subject:Re: Integrating a J2EE Application into "Generic" Enterprise 
Search

Solr is nice when you can change the existing enterprise applications,
extract content and post xml content to the server. But definately
still a lot of coding.

I would say DBSight is another alternative here. It has similar
architecture as Solr, but it crawls databases by configurable SQLs.
Only need to plug into any existing databases by JDBC, and it can fit
any schema. No xml, xslt efforts. Usually in 15 minutes you can have a
google-like search.

Chris
--
Lucene Search on Any Databases/Applications
http://www.dbsight.net

On 5/25/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 5/25/06, Nicholas Van Weerdenburg <[EMAIL PROTECTED]> wrote:
> > Are there any enterprise search intergration standards (e.g. xml
> > schema)?
>
> It may or may not be what you are looking for, but there is Solr, a
> lucene-based search server with XML/HTTP interfaces.  It's primarily
> meant to be a standaone server (think database), but it is possible to
> embed.  See my sig for the link.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search in Jasper Report

2006-05-25 Thread Chandrakant Singh

Hi All,
I m using Jasper report as a report tool.In my application a report has 300
pages. How can i use Lucene to search in  .jasper file.

Reg.
Chandrakant S Chouhan