Re: IN Query for NumericFields

2009-12-10 Thread Matthew Hall
I suspect he's running the query through an analyzer that is dropping 
out single digit numerics, which would basically be a query that pulls 
back everything from the indexes.. or at least I think so.


Uwe Schindler wrote:

Sorry, if you have an IN query, it must be BooleanClause.Occur.SHOULD, as
the CategoryID can be 1, or 3 or 7. You query should not match any doc (I
verified this).

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de]
Sent: Thursday, December 10, 2009 7:03 PM
To: java-user@lucene.apache.org
Subject: RE: IN Query for NumericFields

Cannot be :-) Is the precstep identical?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de




-Original Message-
From: comparis.ch - Roman Baeriswyl [mailto:roman.baeris...@comparis.ch]
Sent: Thursday, December 10, 2009 5:24 PM
To: 'java-user@lucene.apache.org'
Subject: RE: IN Query for NumericFields

I tried

Query q = new BooleanQuery();
((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 1, 1,
true, true), BooleanClause.Occur.MUST);
((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 3, 3,
true, true), BooleanClause.Occur.MUST);
((BooleanQuery)q).Add(NumericRangeQuery.NewLongRange(CategoryID, 7, 7,
true, true), BooleanClause.Occur.MUST);

But that seems to mach all Documents in my Index.

-Original Message-
From: shashi@gmail.com [mailto:shashi@gmail.com] On Behalf Of
Shashi Kant
Sent: Donnerstag, 10. Dezember 2009 16:40
To: java-user@lucene.apache.org
Subject: Re: IN Query for NumericFields

Have you looked at BooleanQuery? Create individual TermQuery and OR them
using BooleanQuery.

On Thu, Dec 10, 2009 at 10:34 AM, comparis.ch - Roman Baeriswyl 
roman.baeris...@comparis.ch wrote:

  

Hi,

I do have some indices where I need to get results based on a fixed


number
  

list (not a range)
Let's say I have a field named CategoryID and I now need all results
where CategoryID is 1,3 or 7.

In Lucene 2.4 I created a QueryParser which looked like:


CategoryID:(1


3
  

7). But the Query Parser won't work with NumericFields...

How can I achieve the same for NumericFields?

Btw I'm using Lucene.net.

Thanks for Help
//Roman

comparis.ch auf Twitter folgen: http://twitter.com/comparis

Ein Freund auf Facebook werden: http://www.facebook.com/comparis.ch

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




comparis.ch auf Twitter folgen: http://twitter.com/comparis

Ein Freund auf Facebook werden: http://www.facebook.com/comparis.ch

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
  


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: singular and plural search

2009-10-21 Thread Matthew Hall
If I recall correctly the highlighter also has an analyzer passed to 
it.  Ensure that this is the same one as well.


Matt

m.harig wrote:

Thanks erick ,

It works fine , if i use the (code snippet found from nabble) same
analyzer for both indexing  querying .

But the highlighter has gone for plural words. Hope i need to search more ,
i'll come back to you once if i can't find out.  Thanks again erick.
  




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why does this search succeed with web app, but not Luke?

2009-08-07 Thread Matthew Hall
Luke defaults to KeywordAnalyzer when you do a search on it.  You have 
to specifically choose StandardAnalyzer.  You are probably already doing 
this, but I figure its worth a check.


Matt

Andrzej Bialecki wrote:

oh...@cox.net wrote:

Hi Phil,

Well, kind of... but...

Then, why, when I do the search in Luke, do I get the results I cited:

  == succeeds

.yyy  == fails (no results)

I guess that I've been assuming that the search in Luke is correct 
and I've been using that to test my understanding, but maybe that's 
an invalid assumption?


Luke has some bugs, that's for sure, but not as many as one would 
think ;) I recommend the following exercise:


* first, check what the Rewritten query looks like, in both cases. 
This could be enlightening, because depending on the choice of default 
field and query analyzer results could differ dramatically.


* then, if a query succeeds in matching one or more documents, open 
this document and view its fields using Reconstruct  edit, 
especially the Tokenized version of the field. At this point any 
potential mismatch in query terms vs. analyzed tokens in the field 
should become apparent.





--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way for me to handle a multiword synonym correctly?

2009-08-07 Thread Matthew Hall

Create a field that is specifically for this type of matches.

What you could then do is at indexing time manipulate your data in such 
a way that it can be matched in a punctuation irrelevant way.


So in this field you would convert all non letter characters into 
spaces, and reduce all white space instances to single ones (  
becomes  ) , you could also likely lowercase it at the same time.


Then at search time perform a special search against this field that 
does the same thing to the query string.  At this point plain old phrase 
queries should work for you.


Our corpus contains remarkably obnoxious items in it like: Rara^tm3.1Ipc

So we need to be able to do very similar things as you are describing, 
the above mentioned technique worked like a charm.


Matt

Donna L Gresh wrote:
I saw some discussion on the board but I'm not sure I've got quite the 
same problem. As an example, I have a query that might be a technical 
skill:


SAP EM FIN AM

I would like that to match a document that has *either* SAP.EM.FIN.AM or 
SAP EM FIN AM (in that order and all together, not spread out through 
the document).


The approach I had tried was at index time if I saw SAP.EM.FIN.AM I would 
consider SAP EM FIN AM a synonym for it, using the Lucene in Action 
example. Luke shows me that I have two terms in the index for this 
document: SAP.EM.FIN.AM and SAP EM FIN AM (one term). Thus it appears 
differently in the index than if it had been organically found as just the 
string of tokens, in which case there would be separate terms for SAP, EM, 
and so on. 

At query time if I look for SAP EM FIN AM it is formed as a phrase query 
with a slop of 0 which does *not* match the one term version SAP EM FIN 
AM. (For that matter a simple boolean query doesn't find it either) Luke 
confirms the fact that the phrase query does not find my synonym term. The 
query SAP EM FIN AM finds *only* documents that originally had those 
separated tokens in them.


Is there a way to handle this situation such that at index time I can turn 
SAP.EM.FIN.AM into something that will be found with a query for SAP EM 
FIN AM?


Thanks for any pointers

Donna 

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching doubt

2009-08-04 Thread Matthew Hall

Well.. search on both anyhow.

about us OR aboutus should hit the spot I think.

Matt

Ian Lea wrote:

The question was, how given a string aboutus in a document, you can return
that document as a result to the query about us (note the space). So we're
mostly discussing how to detect and then break the word aboutus to two
words.



I haven't really been following this thread so apologies if way off
target, but reading the above makes me wonder if it can simply be
reversed: remove the space from about us and search on that.


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing multiple email addresses in one field

2009-07-31 Thread Matthew Hall
And to address the stop word issue, you can override the stop word list 
that it uses.


Most analyzers that use stop words, (Standard included) has an option to 
pass it an arbitrary list of StopWords which will override the defaults.


You could also just roll your own (which is what you are going to end up 
doing here anyhow)  When you do, just don't include stop word removal in 
the processing of your token stream.


Matt

Phil Whelan wrote:

Hi Matthew / Paul,

On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowanco...@aconex.com wrote:
  

Matthew Hall wrote:


Place a delimiter between the email addresses that doesn't get removed in
your analyzer.  (preferably something you know will never be searched on)
  

Or add them separately (rather than:
 doc.add(new Field(email, f...@bar.com b...@foo.com c...@bar.foo ...);
use
 doc.add(new Field(email, f...@bar.com);
 doc.add(new Field(email, b...@foo.com);
 doc.add(new Field(email, c...@bar.foo);
), using an Analyzer that overrides getPositionIncrementGap(). This inserts
a 'gap' between each set of Tokens for the same Field, which stops phrase
queries from 'crossing the boundaries' between subsequent values.



I like the sound of that! I think I understand it.
getPositionIncrementGap() returns 0 by default which keeps the email
field tokens sequential. Overriding with 1, will add an effective
blank token between the email addresses (overriding with 2 would leave
2). Similar to Matthew's delimiter token, but a bit neater.

So the token (with positions in brackets) would look something like this.

foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)

Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
keeping quite a tight control over the fields going into the index
(not making best use of Lucene).

What Analyzer would you recommend I use for this. I'll also be
indexing IPs, and other things, but that's pretty much the same story.
It seems I have to use the same Analyzer for the all the fields in the
index?

I've been looking at StandardAnalyzer, but I do not want to remove
stop words. I want to keep letters and numbers mainly, and also
override getPositionIncrementGap? Is there anything that does these
things already, or close to it? Overriding getPositionIncrementGap
shouldn't be difficult though.

Cheers,
Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to index IP addresses?

2009-07-30 Thread Matthew Hall
I'm a little unclear on how you could be getting both aa.bb.cc.dd as a 
term, and then also the octets.


Are you adding the contents field into the index multiple times, 
possibly with separate analyzers?


Could you possibly try a test, very simple case?

Just create an index with a single lucene document, with that documents 
contents being aa.bb.cc.dd and then take a look at the index via Luke 
again.


When you look at the terms section (Its what comes up by default) you 
SHOULD see only aa, bb, cc, and dd as the top (and thusly ONLY 
terms in the index).  This could vary depending on your analyzer, as 
some will show an index containing only a single term aa.bb.cc.dd.  
What I would not expect is an index that would contain both.


Furthermore by making the field not analyzed you will now have a 
trickier time searching for it.  As you will need to use a keyword 
analyzer or something similar to search, which if I'm understanding the 
spirit of your problem isn't really something that you want to do.


So, if you could run that test scenario that I've outlined for you I 
think you should be able to have a nice test bed to see what the results 
of swapping to different analyzers will have on the data that you are 
trying to index.  Then, after you have played with that a bit you should 
be able to re-expand your corpus again, and see if the analyzer you have 
chosen continues to stand up. 

I.. had thought that StandardAnalyzer already kept IP addresses together 
as a single token, but maybe its doing something... special and 
interesting and thusly you are seeing the behavior that you are describing.


Matt

oh...@cox.net wrote:

Hi,

Oh.  Ok, thanks!  I'll give that a try.

Jim


 Armasu wrote: 
  

Keyword: Field.Index.NOT_ANALYZED

-Original Message-
From: oh...@cox.net [mailto:oh...@cox.net] 
Sent: Thursday, July 30, 2009 4:36 PM

To: java-user@lucene.apache.org
Subject: How to index IP addresses?

Hi,

I am trying to index information in some proprietary-formatted files.  


In particular, these files contain some IP addresses in dotted notation, e.g., 
aa.bb.cc.dd.

For my initial test, I have a Document implementation, and after I extract what I need 
into a String named Info, I do:

doc.add(new Field(contents, Info, Field.Store.YES, Field.Index.ANALYZED));

From looking at the resulting index using Luke, it appears that I am getting terms for 
the full IP address string (e.g., aa.bb.cc.dd), but I am also getting terms 
for each octet of each IP address string, e.g.:

aa
bb
cc
dd

I'm still just getting started with Lucene, but from the research that I've done, it seems like 
Lucene is treating the . in the dotted notation strings as noise.  Is that 
correct?

If so, is there a way to get it not to do that?

Thanks,
Jim

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, 
floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. 
Registration number J40/12967/2005.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing multiple email addresses in one field

2009-07-30 Thread Matthew Hall

1. Sure, just have an analyzer that splits on all non letter characters.
2. Phrase queries keep the order intact.  (And yes, the positional 
information for the terms is kept, which is what allows span queries to 
work)


So searching on the following foo bar com will match f...@bar.com but 
not b...@foo.com


Matt

Phil Whelan wrote:

Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one email field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field(email, f...@bar.com b...@foo.com c...@bar.foo,
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:f...@bar.com
email:b...@foo.com
email:c...@bar.foo

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms foo, bar, and
com, is Lucene able to find email:f...@bar.com without matching
email:c...@foo.bar?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing multiple email addresses in one field

2009-07-30 Thread Matthew Hall
Place a delimiter between the email addresses that doesn't get removed 
in your analyzer.  (preferably something you know will never be searched on)


That way you can ensure that each email matches independently of each other.

So something like

f...@bar.com DELIM123 b...@foo.com DELIM123 c...@bar.foo

Matt


Phil Whelan wrote:

On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
mh...@informatics.jax.org wrote:
  

1. Sure, just have an analyzer that splits on all non letter characters.
2. Phrase queries keep the order intact.  (And yes, the positional information 
for the terms is kept, which is what allows span queries to work)

So searching on the following foo bar com will match f...@bar.com but not 
b...@foo.com



Thanks, I really appreciate your help with this. That's great to know.
Can I take this a little further...

If I have f...@bar.com b...@foo.com c...@bar.foo and analyze it I get
foo bar com bar foo com com bar foo, so perhaps I need a different
way of delimiting the emails, as it will match some other combinations
here, eg. f...@com.com which is not one of the emails.

Has anyone done anything similar? I can imagine that one option would
be to filter the returned docs based on the original content of the
string I'm analyzing. Does Lucene do this for me?

Thanks,
Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New to Lucene - some questions about demo

2009-07-28 Thread Matthew Hall

Restart tomcat.

When the indexes are read in at initialization time they are a snapshot 
of what the indexes contained at that moment.


Unless the demo specifically either closes its IndexReader and creates a 
new one, or calls IndexReader.reopen periodically (Which I don't 
remember it doing) you will not see updates in the web app until you 
restart.


Matt

Ohaya wrote:

Hi,

I'm just starting to work with Lucene, and I guess that I learn best 
by working with code, so I've started with the demos in the Lucene 
distribution.


I got the IndexFiles.java and IndexHTML.java working, and also the 
luceneweb.war is deployed to Tomcat.


I used IndexFiles.java to index some text files, and then used both 
the SearchFiles.java and the luceneweb web app to do some testing.


One of the things that I noticed with the luceneweb web app is that 
when I searched, the search results returned Summary of null, so I 
added:


doc.add(new Field(summary, FooFoo, Field.Store.YES, 
Field.Index.NOT_ANALYZED));


to the IndexFiles.java, and ran it again.

I had expected that I'd then be able to do a search for something like 
summary:foofoo, but when I did that, I got no results.


I also tried SearchFiles.java, and again got no results.

I tried using Luke, and that is showing that the summary field is in 
the indexes, so I'm wondering why I am not able to search on other 
fields such as summary, path, etc.?


Can anyone explain what else I need to do, esp. in the luceneweb web 
app, to be able to search these other fields?


Thanks!

Jim


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New to Lucene - some questions about demo

2009-07-28 Thread Matthew Hall
Oh, also check to see which Analyzer the demo webapp/indexer is using.  
Its entirely possible the analyzer that has been chosen isn't 
lowercasing input, which could also cause you issues.


I'd be willing to bet your issue lies in one of these two problems I've 
mentioned ^^


Matt

Matthew Hall wrote:

Restart tomcat.

When the indexes are read in at initialization time they are a 
snapshot of what the indexes contained at that moment.


Unless the demo specifically either closes its IndexReader and creates 
a new one, or calls IndexReader.reopen periodically (Which I don't 
remember it doing) you will not see updates in the web app until you 
restart.


Matt

Ohaya wrote:

Hi,

I'm just starting to work with Lucene, and I guess that I learn best 
by working with code, so I've started with the demos in the Lucene 
distribution.


I got the IndexFiles.java and IndexHTML.java working, and also the 
luceneweb.war is deployed to Tomcat.


I used IndexFiles.java to index some text files, and then used both 
the SearchFiles.java and the luceneweb web app to do some testing.


One of the things that I noticed with the luceneweb web app is that 
when I searched, the search results returned Summary of null, so 
I added:


doc.add(new Field(summary, FooFoo, Field.Store.YES, 
Field.Index.NOT_ANALYZED));


to the IndexFiles.java, and ran it again.

I had expected that I'd then be able to do a search for something 
like summary:foofoo, but when I did that, I got no results.


I also tried SearchFiles.java, and again got no results.

I tried using Luke, and that is showing that the summary field is 
in the indexes, so I'm wondering why I am not able to search on other 
fields such as summary, path, etc.?


Can anyone explain what else I need to do, esp. in the luceneweb web 
app, to be able to search these other fields?


Thanks!

Jim


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New to Lucene - some questions about demo

2009-07-28 Thread Matthew Hall

Yeah, Ian has it nailed on the head here.

Can't believe I missed it in the initial writeup.

Matt

Ian Lea wrote:

Jim


Glancing at SearchFiles.java I can see

Analyzer analyzer = new StandardAnalyzer();
...
QueryParser parser = new QueryParser(field, analyzer);
...
Query query = parser.parse(line);

so any query term you enter will be run through StandardAnalyzer which
will, amongst other things, convert it to lowercase and will not match
the indexed value of FooFoo.  If you're just playing, it would
probably be easiest to tell lucene to analyze the summary field e.g.

doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED));

That will cause FooFoo to be indexed as foofoo and thus should be
matched on search.


--
Ian.


On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote:
  

Ian and Matthew,

I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo.  No 
results returned for any of those :(.

Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think 
that's the problem either :(...

I looked at the SearchFiles.java code, and it looks like it's literally using whatever 
query string I'm entering (ditto for luceneweb).  Is there something with the query 
itself that needs to be modified to support searching on the fields other than the 
contents field (recall, I'm pretty sure that all those other fields are in 
the index, via Luke)?

Jim



 Ian Lea ian@gmail.com wrote:


Hi


Field.Index.NOT_ANALYZED means it will be stored as is i.e. FooFoo
in your example, and if you search for foofoo it won't match.  A
search for FooFoo would, assuming that your search terms are not
being lowercased.



--
Ian.


On Tue, Jul 28, 2009 at 1:56 PM, Ohayaoh...@cox.net wrote:
  

Hi,

I'm just starting to work with Lucene, and I guess that I learn best by
working with code, so I've started with the demos in the Lucene
distribution.

I got the IndexFiles.java and IndexHTML.java working, and also the
luceneweb.war is deployed to Tomcat.

I used IndexFiles.java to index some text files, and then used both the
SearchFiles.java and the luceneweb web app to do some testing.

One of the things that I noticed with the luceneweb web app is that when I
searched, the search results returned Summary of null, so I added:

doc.add(new Field(summary, FooFoo, Field.Store.YES,
Field.Index.NOT_ANALYZED));

to the IndexFiles.java, and ran it again.

I had expected that I'd then be able to do a search for something like
summary:foofoo, but when I did that, I got no results.

I also tried SearchFiles.java, and again got no results.

I tried using Luke, and that is showing that the summary field is in the
indexes, so I'm wondering why I am not able to search on other fields such
as summary, path, etc.?

Can anyone explain what else I need to do, esp. in the luceneweb web app, to
be able to search these other fields?

Thanks!

Jim


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New to Lucene - some questions about demo

2009-07-28 Thread Matthew Hall

You can choose to do either,

Having items in multiple fields allows you to apply field specific 
boosts, thusly making matches to certain fields more important to others.


But, if that's not something that you care about the second technique is 
useful in that it vastly simplifies your index structure (And thusly 
your query structure)


So, it depends on what you want to be able to do in the end.  Do you 
envision doing something like being able to search by the summary and 
the contents at the same time, but weighing hits to the summary as a 
higher priority?
If so, use multiple fields.  If not, keep this first iteration in lucene 
simple, and compress everything down.  Also please note that the +   + 
in the example cited is important.  That space will ensure that your 
contents and summary fields will be tokenized properly. (Just in case 
they are single words lets say).


Matt



oh...@cox.net wrote:

Hi Matthew and Ian,

Thanks, I'll try that, but, in the meantime, I've been doing some reading (Lucene in Action), and on pg. 159, section 5.3, it discusses Querying on multiple fields.  


I was just about to try to what's described in that section, i.e., using 
MultiFieldQueryParser.parse(), or, as another note on pg. 161 mentions, doing 
something like:

doc.add(Field.Unstored(contents, contents +   + summary);

So, I guess I'm a little confused (happens a lot :)!):  In the situation I'm talking about 
(starting with the Lucene demo and demo webapp, and trying to be able to index and search more than 
just the contents field), do I not need to use the MultiFieldQueryParser.parse() or do 
what they call create a synthentic content?

Thanks,
Jim


 Matthew Hall mh...@informatics.jax.org wrote: 
  

Yeah, Ian has it nailed on the head here.

Can't believe I missed it in the initial writeup.

Matt

Ian Lea wrote:


Jim


Glancing at SearchFiles.java I can see

Analyzer analyzer = new StandardAnalyzer();
...
QueryParser parser = new QueryParser(field, analyzer);
...
Query query = parser.parse(line);

so any query term you enter will be run through StandardAnalyzer which
will, amongst other things, convert it to lowercase and will not match
the indexed value of FooFoo.  If you're just playing, it would
probably be easiest to tell lucene to analyze the summary field e.g.

doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED));

That will cause FooFoo to be indexed as foofoo and thus should be
matched on search.


--
Ian.


On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote:
  
  

Ian and Matthew,

I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo.  No 
results returned for any of those :(.

Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think 
that's the problem either :(...

I looked at the SearchFiles.java code, and it looks like it's literally using whatever 
query string I'm entering (ditto for luceneweb).  Is there something with the query 
itself that needs to be modified to support searching on the fields other than the 
contents field (recall, I'm pretty sure that all those other fields are in 
the index, via Luke)?

Jim



 Ian Lea ian@gmail.com wrote:



Hi


Field.Index.NOT_ANALYZED means it will be stored as is i.e. FooFoo
in your example, and if you search for foofoo it won't match.  A
search for FooFoo would, assuming that your search terms are not
being lowercased.



--
Ian.


On Tue, Jul 28, 2009 at 1:56 PM, Ohayaoh...@cox.net wrote:
  
  

Hi,

I'm just starting to work with Lucene, and I guess that I learn best by
working with code, so I've started with the demos in the Lucene
distribution.

I got the IndexFiles.java and IndexHTML.java working, and also the
luceneweb.war is deployed to Tomcat.

I used IndexFiles.java to index some text files, and then used both the
SearchFiles.java and the luceneweb web app to do some testing.

One of the things that I noticed with the luceneweb web app is that when I
searched, the search results returned Summary of null, so I added:

doc.add(new Field(summary, FooFoo, Field.Store.YES,
Field.Index.NOT_ANALYZED));

to the IndexFiles.java, and ran it again.

I had expected that I'd then be able to do a search for something like
summary:foofoo, but when I did that, I got no results.

I also tried SearchFiles.java, and again got no results.

I tried using Luke, and that is showing that the summary field is in the
indexes, so I'm wondering why I am not able to search on other fields such
as summary, path, etc.?

Can anyone explain what else I need to do, esp. in the luceneweb web app, to
be able to search these other fields?

Thanks!

Jim


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java

Re: New to Lucene - some questions about demo

2009-07-28 Thread Matthew Hall

Oh.. no.

If you specifically include a fieldname: blah in your clause, you don't 
need a MultiFieldQueryParser.


The purpose of the MFQP is to turn queries like this blah 
automatically into this field1: blah AND field2: blah AND field3: 
blah (Or OR if you set it up properly)


When you setup the MFQP you specify what fields you want to have this 
behavior apply to, and can even give each field its own specific analyzer.


So if in your index you have multiple fields, each of which was created 
with a different analyzer, you could search these effortlessly in your 
webapp using the MFQP.


(If for example you have an exact_contents and a contents field, one 
where punctuation and capitalization matters, one where it does not)


Hope that clears things up for you.

Matt



oh...@cox.net wrote:

Matthew,

I'll keep your comments in mind, but I'm still confused about something.

I currently haven't changed much in the demo, other than adding that doc.add for 
summary.

With JUST that doc.add, having done my reading, I kind of expected NOT to be able to search on the summary at all, but it kind of seems like SOMETIMES, I am still getting responses when I search on something in summary.  


Does that mean that Lucene will automatically do multi-field searching?

Maybe I've been up too long, but it seems like, for example, when I search on 
summary:foofoo I am not getting a response, but, for example, if I search on:

summary:foofoo AND contents:test1

I get results in the search response.

Since I haven't yet added the MultiField query, shouldn't it ONLY be searching on the 
contents field (because the summary:foofo should have been false, and 
because I am using an AND)?

Like I said, maybe I've been staring at this too long, and need to do some more 
structured testing :)...

Sorry.

Later,
Jim




 Matthew Hall mh...@informatics.jax.org wrote: 
  

You can choose to do either,

Having items in multiple fields allows you to apply field specific 
boosts, thusly making matches to certain fields more important to others.


But, if that's not something that you care about the second technique is 
useful in that it vastly simplifies your index structure (And thusly 
your query structure)


So, it depends on what you want to be able to do in the end.  Do you 
envision doing something like being able to search by the summary and 
the contents at the same time, but weighing hits to the summary as a 
higher priority?
If so, use multiple fields.  If not, keep this first iteration in lucene 
simple, and compress everything down.  Also please note that the +   + 
in the example cited is important.  That space will ensure that your 
contents and summary fields will be tokenized properly. (Just in case 
they are single words lets say).


Matt



oh...@cox.net wrote:


Hi Matthew and Ian,

Thanks, I'll try that, but, in the meantime, I've been doing some reading (Lucene in Action), and on pg. 159, section 5.3, it discusses Querying on multiple fields.  


I was just about to try to what's described in that section, i.e., using 
MultiFieldQueryParser.parse(), or, as another note on pg. 161 mentions, doing 
something like:

doc.add(Field.Unstored(contents, contents +   + summary);

So, I guess I'm a little confused (happens a lot :)!):  In the situation I'm talking about 
(starting with the Lucene demo and demo webapp, and trying to be able to index and search more than 
just the contents field), do I not need to use the MultiFieldQueryParser.parse() or do 
what they call create a synthentic content?

Thanks,
Jim


 Matthew Hall mh...@informatics.jax.org wrote: 
  
  

Yeah, Ian has it nailed on the head here.

Can't believe I missed it in the initial writeup.

Matt

Ian Lea wrote:



Jim


Glancing at SearchFiles.java I can see

Analyzer analyzer = new StandardAnalyzer();
...
QueryParser parser = new QueryParser(field, analyzer);
...
Query query = parser.parse(line);

so any query term you enter will be run through StandardAnalyzer which
will, amongst other things, convert it to lowercase and will not match
the indexed value of FooFoo.  If you're just playing, it would
probably be easiest to tell lucene to analyze the summary field e.g.

doc.add(new Field(summary, FooFoo, Field.Store.YES, Field.Index.ANALYZED));

That will cause FooFoo to be indexed as foofoo and thus should be
matched on search.


--
Ian.


On Tue, Jul 28, 2009 at 2:22 PM, oh...@cox.net wrote:
  
  
  

Ian and Matthew,

I've tried foofoo, summary:foofoo, FooFoo, and summary:FooFoo.  No 
results returned for any of those :(.

Also, Matthew, I bounced Tomcat after running IndexFiles, so I don't think 
that's the problem either :(...

I looked at the SearchFiles.java code, and it looks like it's literally using whatever 
query string I'm entering (ditto for luceneweb).  Is there something with the query 
itself that needs to be modified to support searching on the fields other than the 
contents field (recall, I'm

Re: Batch searching

2009-07-23 Thread Matthew Hall
This was at least one of the threads that was bouncing around... I'm 
fairly sure there were others as well.


Hopefully its worth the read to you ^^

http://www.opensubscriber.com/message/java-...@lucene.apache.org/11079539.html

Phil Whelan wrote:

On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hallmh...@informatics.jax.org wrote:
  

Not sure if this helps you, but some of the issue you are facing seem
similar to those in the real time search threads.



Hi Matthew,

Do you have a pointer of where to go to see the real time threads?

Thanks,
Phil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combining hits

2009-07-23 Thread Matthew Hall
Erm.. I have to be missing something here, wouldn't you be able just do 
the following:


do a search on Term 1 AND Term 2
do a search on Term 2 AND Term2 AND Term 3

This would ensure that you have two objects back, one of which is 
guaranteed to be a subset of the other.


Then, when you are iterating on your documents to do your highlighting 
over the results from the first search (At least I think that's what you 
are doing here) check to see if the current document exists in the hits 
or topDocs object that came from the second search.  If it does, use the 
three term highlighter, if it doesn't use the two term highlighter.


But, what sort of reordering are you trying to do here anyhow?

Doing just a normal search against Term 1 OR Term 2 OR Term 3 with 
a standard highlighter would most likely get you ... well exactly the 
same results as what you are describing.  The only real difference I 
could see is the order that the documents are returned to you.


Matt

Max Lynch wrote:

Hi,
I am doing a search on my index for a query like this:

query = \Term 1\ \Term 2\ \Term 3\

Where I want to find Term 1, Term 2 and Term 3 in the index.  However, I
only want to search for Term 3 if I find Term 1 and Term 2 first, to
avoid doing processing on hits that only contain Term 3.  To do this, I
was thinking of doing a search for \Term 1\ \Term 2\ and then if there
are hits for these terms, I would do another search for Term 3 on these
resulting documents.  I am running a background search so I am not too
worried performance issues caused by searching twice.

Is there a way to search on subset of documents and then combining the hits
for the document?  For example, if Term 1 and Term 2 are found in Document1,
and Term3 is also later found in Document1, I want to be able to process the
hits on my highlighter as containing all three terms.

Sorry if it's confusing.

Thanks,
Max

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combining hits

2009-07-23 Thread Matthew Hall

Looking at what you wrote:

I am doing a weighting system where I rank documents that have Term 1 AND
Term 2 AND Term 3 more highly than documents that have just Term 1 AND Term
2, and more highly than documents that just have Term 1 OR Term 2 but not
both.

Couldn't you maybe get the same effect using some clever term boosting?

I.. think something like

Term 1 OR Term 2 OR Term 3 ^ .25

would return in almost the exact order that you are asking for here, 
with the only real difference being that you would have some matches for 
only Term 3 way way at the bottom of your list score wise.


It might be worth investigating something like this, where you cut off 
displaying documents that don't match a certain score thresh hold.  Thus 
cutting out the matches that you don't want (The term3 only ones)


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Batch searching

2009-07-22 Thread Matthew Hall
If you did this, wouldn't you be binding the processing of the results 
of all queries to that of the slowest performing one within the collection?


I'm guessing you are trying for some sort of performance benefit by 
batch processing, but I question whether or not you will actually get 
more performance by performing your queries in a threaded type 
environment, and then processing their results as they come in.


Could you give a bit more description about what you are actually trying 
to accomplish, I'm sure this list could help better if we had more 
information.


Matt

tsuraan wrote:

If I understand lucene correctly, when doing multiple simultaneous
searches on the same IndexSearcher, they will basically all do their
own index scans and collect results independently.  If that's correct,
is there a way to batch searches together, so only one index scan is
done?  What I'd like is a Searcher.search(Query[], Collector[]) type
function, where the search only scans over the index once for each
collection of (basically unrelated) searches.  Is that possible, or
does that even make sense?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Batch searching

2009-07-22 Thread Matthew Hall
Out of curiosity, what is the size of your corpus?  How much and how 
quickly do you expect it to grow?


I'm just trying to make sure that we are all on the same page here ^^

I can see the benefits of doing what you are describing with a very 
large corpus that is expected to grow at quick rate, but if that's not 
really your use case, then perhaps it might be worth investigating if a 
simpler solution would serve you just as well.


In the example you provided, you are only talking about searching 
against 1M documents, which I can guarantee will search with VERY good 
performance in a single properly setup lucene index.


Now if we are talking more on the order of... 100M or more documents you 
may be onto something.


Well, that's my thoughts anyhow

Matt

tsuraan wrote:

If you did this, wouldn't you be binding the processing of the results
of all queries to that of the slowest performing one within the collection?



I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.

  

I'm guessing you are trying for some sort of performance benefit by
batch processing, but I question whether or not you will actually get
more performance by performing your queries in a threaded type
environment, and then processing their results as they come in.

Could you give a bit more description about what you are actually trying
to accomplish, I'm sure this list could help better if we had more
information.



What I'd like to do is build lots of small indices (a few thousand
documents per index) and put them into HDFS for search distribution.
We already have our own map-reduce framework for searching, but HDFS
seems to be a really good fit for an actual storage mechanism.

My concern is that when we have one searcher using thousands of
HDFS-backed indices, the seeking might get a bit nasty. HDFS
apparently has pretty good seeking performance, but it really looks
like it was designed for streaming, so if I could make my searches use
sequential index access, I would expect better performance than having
a ton of simultaneous searches making HDFS seek all over the place.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Batch searching

2009-07-22 Thread Matthew Hall
Not sure if this helps you, but some of the issue you are facing seem 
similar to those in the real time search threads.


Basically their problem involves indexing twitter and the blogosphere, 
and making lucene work for super large data sets like that.


Perhaps some of the discussion in those threads could help.  I'd imagine 
they went over things like massively distributed searches and such, and 
other things that might be of interest to you.


Sorry I can't be of more help than that.

Matt

tsuraan wrote:

Out of curiosity, what is the size of your corpus?  How much and how
quickly do you expect it to grow?



in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, but merging takes a
long time so I'm trying to test out just using a ton of tiny indices
to see if the search penalty from doing that is worth the time savings
from not having to build and optimize large indices.

  

I'm just trying to make sure that we are all on the same page here ^^

I can see the benefits of doing what you are describing with a very
large corpus that is expected to grow at quick rate, but if that's not
really your use case, then perhaps it might be worth investigating if a
simpler solution would serve you just as well.



The indices also grow pretty quickly.  We have some cases where we get
nearly a million new documents per day.  I haven't looked at those
machines for quite a while, but I guess they'd probably have well over
a hundred million documents, and still are growing.  We also don't
have a lot of simultaneous searches yet, but that's changing, so I'm
getting concerned about how well that's being handled.  We expect that
we will soon be dealing with tens to hundreds of searches being
executed simultaneously.

  

In the example you provided, you are only talking about searching
against 1M documents, which I can guarantee will search with VERY good
performance in a single properly setup lucene index.

Now if we are talking more on the order of... 100M or more documents you
may be onto something.



But in any case, there isn't currently any framework for making
multiple searches simultaneously use an index in a coordinated
fashion.  I was pretty much just planning on adding it to my tests if
such a thing existed.  Since it doesn't, I guess I'll stuck to
searching in parallel and hoping that the Linux VFS layer is smart
enough to keep things fast until I have time to try putting something
together myself.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tokenizer queston: how can I force ? and ! to be separate tokens?

2009-07-17 Thread Matthew Hall

I'd think extending WhiteSpaceTokenizer would be a good place to start.

Then create a new Analyzer that exactly mirrors your current Analyzer, 
with the exception that it uses your new tokenizer instead of 
WhiteSpaceTokenizer (Well.. there is of course my assumption that you 
are using an Analyzer that already uses WhiteSpaceTokenizer... but you 
likely are)


OBender wrote:

Hi All,

 


I need to make ? and ! characters to be a separate token e.g. to split [how
are you?] in to 4 tokens [how], [are], [you] and [?] what would be the best
way to do this?

 


Thanks


  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search in non-linguistic text

2009-07-16 Thread Matthew Hall
Assuming your dataset isn't incredibly large, I think you could.. cheat 
here, and optimize your data for searching.


Am I correct in assuming that BC, should also match on ABCD?

If so, then yes your current thoughts on the problems that you face are 
correct, and everything you do will be turning into a contains search, 
which is yes.. not the best performance you have ever seen.


However, knowing this, you can manipulate your data in such a way, that 
you can get around that limitation, and turn everything into a prefix 
(or postfix) search if you so prefer.


So here's what you do:

When you are indexing the term ABCD, you are actually going to add 
several documents into the index (or into various special purpose 
indexes, if you so prefer.. but more on that later on)


Lets say you want to turn everything into a prefix search under the covers.

In the index you would store the following values, all of which point at 
the document ABCD


'ABCD'
'BCD'
'CD'
'D'

Then, when you do your search for the terms BC you will really be 
searching on BC*, which will produce a match to the second document. 

Now Lucene documents can be considered as giant data holding object, you 
can and SHOULD have fields in the document that are not used at search 
time, but ARE used at display generation time (or whatever layer feeds 
your display, if you are going in a more OO fashion).


Now this technique isn't without its drawbacks of course, you will see 
an increase in your index size, but unless you are playing around with 
some VERY large datasets that really shouldn't matter.


Now, if I was the one implementing this, I would probably make at least 
two indexes, one for exact punctuation relevant data.  The other index 
would contain the data that I've described above, with one important 
difference, any and all punctuation (including whitespace) has been 
removed, and all of the letters in your codes were collapsed down into a 
single word.  That way you can perform two searches, and ensure that 
exact punctuation relevant matches will appear higher in your results 
list than non punctuation relevant ones.


Anyhow, that's pretty much it in a nutshell.  I think this technique 
should work for you, after you have decided


JesL wrote:

Hello,
Are there any suggestions / best practices for using Lucene for searching
non-linguistic text?  What I mean by non-linguistic is that it's not English
or any other language, but rather product codes.  This is presenting some
interesting challenges.  Among them are the need for pretty lax wildcard
searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
it needs to be agnostic to special characters.  So, ABC/D should match ABCD
as well as ABC-D or ABC D.

As I write an analyzer to handle these cases, I seem to be pretty quickly
degrading into a like '%blah%' search, with rules to treat all special
characters as single-character, optional wildcards.  I'm concerned that the
performance of this will be disappointing, though.

Any help would be much appreciated.  Thanks!

- Jes
  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ugh

2009-07-16 Thread Matthew Hall
 
They are upgrading our mail servers here, so if you are seeing.. many 
MANY duplicates of things I posted.. I'm really sorry about that. T_T


Matt

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Highligheter fails using JapaneseAnalyzer

2009-07-02 Thread Matthew Hall
Out of curiosity, when you try your other test string aaa _bbb ccc 
what do the token byte offsets show?


Matt

Mark Harwood wrote:


On 1 Jul 2009, at 17:39, k.sayama wrote:


I could verify Token byte offsets

The sytsem outputs
aaa:0:3
bbb:0:3
ccc:4:7



That explains the highlighter behaviour. Clearly BBB is not at 
position 0-3 in the String you supplied



String CONTENTS = AAA :BBB CCC;


Looks like the Tokenizer needs fixing. Is this yours or a standard 
Lucene class? If the latter, raising a JIRA bug with a Junit test 
would be the best way to get things moving.



Cheers
Mark

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Highligheter fails using JapaneseAnalyzer

2009-06-30 Thread Matthew Hall
Does the same thing happen when you use SimpleAnalyzer, or StandardAnalyzer?

I have a sneaking suspicion that the : in your contents string is what's
causing your issue here, as : is a reserved character that denotes a
field specification. But I could be wrong.

Try swapping analyzers, if you no longer have the same issue with
Simple, try Standard. Assuming the same problem shows up there, I think
you might need to do something about the :.

Matt

k.sayama wrote:
 hello.

 i've tried to highlight string using Highligheter(2.4.1) and
 JapaneseAnalyzer
 but the following code extract show the problem

 String F = f;
 String CONTENTS = AAA :BBB CCC;
 JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
 QueryParser qp = new QueryParser( F, analyzer );
 Query query = qp.parse( BBB );
 Highlighter h = new Highlighter( new QueryScorer( query, F ) );

 System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );

 The sytsem outputs
 BAAA/B :BBB CCC

 When you change CONTENTS to AAA _BBB CCC
 the system outputs

 AAA _/B CCC

 Are there any problems?
 Thanks in advance

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query which gives high score proportional to 'distinct term matches'

2009-06-30 Thread Matthew Hall
Well, we have a very similar requirement here, but for us its for every 
single field that we wanted this kind of behavior.


We got this in by eliminating the TF (Term Frequency) contribution to 
score via a custom Similarity. (Which is very easy to do.)


I... think in the newer versions of lucene you can omit TF more 
programatically at query time, but I don't recall if you could do it on 
a per field basis.  Anyone else want to speak on this a bit better?


Matt

chandrakant k wrote:
I have a index which has got fields like 


title :
content :

If I search for, lets say  obama fly ,  then the documents having obama and
fly should be given high scores irrespective of the number of times they may
occur. This requirement is for fields -  title and content.

The implementation which I did with a simple OR query will score high the
documents for e.g.
 having more occurrence of 'obama'  even if it has no occurrence  'fly' word
in it. The tf for 'obama' here in this case is more; so even if 'fly' word
is not present the document is scored higher.

Expected behaviour is that - 
(a)  documents having 'obama' and 'fly' both should be scored higher in

order of their tf .
(b)  documents having either of terms should be given scores but less than
those matched in (a)

I tried by overiding the the coord() in a Custom Similarity implementation
and boosting it if multiple terms match, but what I see is that coord() is
gets boosted even if same word matches in multiple fields (say obama is
present in title: and content: ).

Searching for solutions, I have not got any results which talk about similar
requirement... I guess I am not using right keywords

Thanks
Chandrakant K.




  
  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: No hits while searching!

2009-06-01 Thread Matthew Hall
(

 PhysicianDocumentBuilder.PhysicianFieldInfo.FIRST_NAME
   .toString(), new
MetaphoneReplacementAnalyzer());

   wrapper.addAnalyzer(

 PhysicianDocumentBuilder.PhysicianFieldInfo.LAST_NAME
   .toString(), new
MetaphoneReplacementAnalyzer());

   }

   /**
* @see PerFieldAnalyzerWrapper#tokenStream(String, Reader)
*/
   @Override
   public TokenStream tokenStream(String fieldName, Reader
  

reader) {
  

   return wrapper.tokenStream(fieldName, reader);
   }

}

//lastly the query builder
if(physicianQuery.getExactNameSearch()){

 if(StringUtils.isNotEmpty(physicianQuery.getFirstNameStartsWith())){
   TermQuery term = new TermQuery(new
Term(FIRST_NAME_EXACT.toString(),
physicianQuery.getFirstNameStartsWith()));
   query.add(term,MUST);

   }

 if(StringUtils.isNotEmpty(physicianQuery.getLastNameStartsWith())){
   TermQuery term = new TermQuery(new
Term(LAST_NAME_EXACT.toString(),
physicianQuery.getLastNameStartsWith()));
   query.add(term,MUST);

   }
else{
//we want metaphone search
if (StringUtils.isNotEmpty(physicianQuery.getFirstNameStartsWith()))
  

{
  

 query.add(buildMultiTermPrefixQuery(FIRST_NAME.toString(),

 physicianQuery.getFirstNameStartsWith()), MUST);
   }

   if
(StringUtils.isNotEmpty(physicianQuery.getLastNameStartsWith())) {

 query.add(buildMultiTermPrefixQuery(LAST_NAME.toString(),

 physicianQuery.getLastNameStartsWith()), MUST);
   }
}


--
View this message in context:

  

http://www.nabble.com/No-hits-while-searching%21-tp23735920p23735920.htm
l
  

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  



  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: No hits while searching!

2009-06-01 Thread Matthew Hall

Just build your own.

Here's exactly what you are looking for:

(Mind you I just whipped this out, and didn't compile it... so there 
could be minor syntax errors here.)


You will also obviously have to make your own package declaration, and 
your own imports.


So anyhow, the really neat thing about lucene, is being able to do 
exactly what we just did here, you can chain these tokenizers and 
filters together in almost any way you want, and create custom analyzers 
outta them.


Its a good thing to become familiar with, because I will nearly promise 
you that this analyzer here will ALSO probably be insufficient for your 
needs.


Anyhow, hope this helps.

Matt

/**
* Custom Lowercase Analyzer
*
* @author mhall
*
* This analyzer tokenizes on whitespace, and then lowercases the token.
*
*/

public class LowerCaseAnalyzer extends Analyzer {

   public LowerCaseAnalyzer() {
  super();
   }

   /**
* Worker for this Analyzer.
*
* Specifically this analyzer chains together WhitespaceTokenizer -
* LowerCaseFilter together to form customized Tokens
*/

   public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseFilter(new WhitespaceTokenizer(reader));
   }

}

vanshi wrote:

Thanks Matt  sithu. Yes, It was due to stop word analyzer...now i'm using a
simple analyzer temporarily, as I know even simple analyzer cannot handle
quotes in names. However, can somebody plz direct me towards how to handle
quotes with the name in query using lowercase analyzer?

thanks,
Vanshi

Matthew Hall-7 wrote:
  

Yeah, he's gotta be.

You might be better of using something like a lowercase analyzer here, 
since punctuation in a name is likely important.


Matt

Sudarsan, Sithu D. wrote:

 


Do you use stopword filtering?

Sincerely,
Sithu D Sudarsan

-Original Message-
From: vanshi [mailto:nilu.tha...@gmail.com] 
Sent: Monday, June 01, 2009 11:39 AM

To: java-user@lucene.apache.org
Subject: Re: No hits while searching!


Thanks Erick, I was able to get this work...as you said ..Luke is a
great
tool to look in to what gets stored as indexes though in my case I was
searching before the indexes were created so i was getting zero hits.

On side note, I'm running a strange output with prefix query...it only
works
when i have 3 or more than 3 letters in the first name/last name. Any
idea
what is going on here? Please see the output from log here.

02:05:20,996 INFO  [PhysicianQueryBuilder] Entered addTypeSpecificTerms
in
PhysicianQuerybuilder with exactName=true
02:05:20,996 INFO  [PhysicianQueryBuilder] Before running Prefix query,
First name: ang
02:05:20,996 INFO  [PhysicianQueryBuilder] Before running  Prefix query,
Last name: john
02:05:21,012 INFO  [LuceneIndexService] the query is:
+(FIRST_NAME_EXACT:ang*) +(LAST_NAME_EXACT:john*)
02:05:21,012 INFO  [LuceneIndexService] Result Size: 1

02:06:03,578 INFO  [PhysicianQueryBuilder] Entered addTypeSpecificTerms
in
PhysicianQuerybuilder with exactName=true
02:06:03,578 INFO  [PhysicianQueryBuilder] Before running term query,
First
name: a
02:06:03,578 INFO  [PhysicianQueryBuilder] Before running term query,
Last
name: johns
02:06:03,578 INFO  [LuceneIndexService] the query is: +()
+(LAST_NAME_EXACT:johns*)
02:06:03,578 INFO  [LuceneIndexService] Result Size: 0

02:08:01,548 INFO  [PhysicianQueryBuilder] Entered addTypeSpecificTerms
in
PhysicianQuerybuilder with exactName=true
02:08:01,548 INFO  [PhysicianQueryBuilder] Before running term query,
First
name: an
02:08:01,548 INFO  [PhysicianQueryBuilder] Before running term query,
Last
name: johns
02:08:01,548 INFO  [LuceneIndexService] the query is: +()
+(LAST_NAME_EXACT:johns*)
02:08:01,580 INFO  [LuceneIndexService] Result Size: 0

As one can see the query works with first name=ang but not with first
name=a
or an.

Appreciate all your inputs.

Vanshi

Erick Erickson wrote:
  
  

The most common issue with this kind of thing is that



UN_TOKENIZEDimplies
  
  

no
case folding. So if your case differs you won't get a match.

That aside, the very first thing I'd do is get a copy of Luke (google
Lucene
Luke)
and examine the index to see if what's in your index is what you



*think*
  
  

is
in there.


The second thing I'd do is look at query.toString() to see what the



actual
  
  

query is. You can even paste the output of toString() into Luke and
  
  

see
  
  

what happens.
  
  

I'm not sure what buildMultiTermPrefixQuery is all about, but I assume
you have a good reason for using that. But the other strategy I use



for
  
  

this kind of what happened? question is to peel back to simpler



cases
  
  

until I get what I expect, then build back up until it breaks.

But really get a copy of Luke, it's a wonderful tool that'll give you



lots
  
  

of
insight about what's *really* going on...

Best
Erick

On Wed

Re: Parsing large xml files

2009-05-22 Thread Matthew Hall

2g... should not be a maximum for any Jvm that I know of.

Assuming you are running a 32 bit Jvm you are actually able to address a 
bit under 4G of memory, I've always used around 3.6G when trying to max 
out a 32 bit jvm.  Technically speaking it should be able to address 4g 
under a 32 bit or, however a certain percentage of the memory is set 
aside for overhead, so you can only really use a bit less than the max.


If you have a 64 bit os/jvm (which you likely might), you can use the 
-d64 setting for your runtime environment to set your maximum memory 
much.. MUCH higher, for example we regularly use 6G of memory on our 
application servers here at the lab.


Hope this helps you a bit,

Matt

crack...@comcast.net wrote:
http://vtd-xml.sf.net 



- Original Message - 
From: Sithu D. Sudarsan sithu.sudar...@fda.hhs.gov 
To: java-user@lucene.apache.org 
Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
Subject: Parsing large xml files 



Hi, 

While trying to parse xml documents of about 50MB size, we run into 
OutOfMemoryError due to java heap space. Increasing JVM to use close 2GB 
(that is the max), does not help. Is there any API that could be used to 
handle such large single xml files? 

If Lucene is not the right place, please let me know alternate places to 
look for, 

Thanks in advance, 
Sithu D Sudarsan 
sithu.sudar...@fda.hhs.gov 
sdsudar...@ualr.edu 





  




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Parsing large xml files

2009-05-22 Thread Matthew Hall
Yeah, there's a setting on windows that allows you to use up to .. erm 
3G I think it was.  The limitation there is due to the silly windows 
file system.  I'm don't remember off hand exactly what that setting was, 
but I'm 100% certain that its there.


If you do a google search for jvm maximum memory settings on windows you 
should be able to find a few articles about it.


(At least that's certainly my recollection)

Secondly, if you have a linux machine available you should likely just 
use that, particularly if its a 64 bit processor because then a whole 
ton more memory becomes available to you.


When I'm developing my indexes I do it via eclipse on my windows 
platform, but with the actual directories themselves mounted from a 
solaris machine.  When I go to actually MAKE the indexes I simply login 
to the machine do a quick ant compile, and run them.  Sure its an extra 
step, but the gains are more than worth it in our case.


Matt

Sudarsan, Sithu D. wrote:
 
Hi Matt,


We use 32 bit JVM. Though it is supposed to have upto 4GB, any
assignment above 2GB in Windows XP fails. The machine has  quad-core
dual processor.

On Linux we're able to use 4GB though!

If there is any setting that will let us use 4GB do let me know.

Thanks,
Sithu D Sudarsan

-Original Message-
From: Matthew Hall [mailto:mh...@informatics.jax.org] 
Sent: Friday, May 22, 2009 8:59 AM

To: java-user@lucene.apache.org
Subject: Re: Parsing large xml files

2g... should not be a maximum for any Jvm that I know of.

Assuming you are running a 32 bit Jvm you are actually able to address a

bit under 4G of memory, I've always used around 3.6G when trying to max 
out a 32 bit jvm.  Technically speaking it should be able to address 4g 
under a 32 bit or, however a certain percentage of the memory is set 
aside for overhead, so you can only really use a bit less than the max.


If you have a 64 bit os/jvm (which you likely might), you can use the 
-d64 setting for your runtime environment to set your maximum memory 
much.. MUCH higher, for example we regularly use 6G of memory on our 
application servers here at the lab.


Hope this helps you a bit,

Matt

crack...@comcast.net wrote:
  
http://vtd-xml.sf.net 



- Original Message - 
From: Sithu D. Sudarsan sithu.sudar...@fda.hhs.gov 
To: java-user@lucene.apache.org 
Sent: Thursday, May 21, 2009 7:42:59 AM GMT -08:00 US/Canada Pacific 
Subject: Parsing large xml files 



Hi, 

While trying to parse xml documents of about 50MB size, we run into 
OutOfMemoryError due to java heap space. Increasing JVM to use close

2GB 
  

(that is the max), does not help. Is there any API that could be used

to 
  
handle such large single xml files? 


If Lucene is not the right place, please let me know alternate places

to 
  
look for, 

Thanks in advance, 
Sithu D Sudarsan 
sithu.sudar...@fda.hhs.gov 
sdsudar...@ualr.edu 





  





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall

For writing indexes?

Well I guess it depends on what you want.. but I personally use this:

(2.3.2 API)

File INDEX_DIR = /data/searchtool/thisismyindexdirectory
Analyzer analyzer = new WhateverConcreteAnalyzerYouWant();

writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true);

Your best bet would be to peruse the API docs of whatever lucene version 
you are using.


However, I'm still pretty sure this ISN'T your actual issue here.

Looking at your full path example those still seem to be by reference 
to me. Let me be more specific and tell you EXACTLY what I mean by that,


Lets say you are running your program in the following directory:

/home/test/app/

Trying to open an index like you have below will effectively be trying 
to open an index in the following location:


/home/test/app/home/marco/RdfIndexLucene

What I think you MEAN to be doing is:

/home/marco/RdfIndexLucene

That leading slash is VERY VERY important, as its the entire difference 
between an relative path and an absolute one.


Matt

Marco Lazzara wrote:

I was talking with my teacher.
Is it correct to use FSDirectory?Could you please look again at the code
I've posted here??
Should I choose a different way to Indexing ??

Marco Lazzara




2009/5/22 Ian Lea ian@gmail.com

  

OK.  I'd still like to see some evidence, but never mind.

Next suggestion is the old standby - cut the code down to the absolute
minimum to demonstrate the problem and post it here.  I know you've
already posted some code, but maybe not all of it, and definitely not
cut down to the absolute minimum.


--
Ian.


On Thu, May 21, 2009 at 10:48 PM, Marco Lazzara marco.lazz...@gmail.com
wrote:


_I strongly suggest that you use a full path name and/or provide some
evidence that your readers and writers are using the same directory
and thus lucene index.
_
I try a full path like home/marco/RdfIndexLucene,even
media/disk/users/fratelli/RDFIndexLucene.But nothing is changed.

MARCOLAZZARA
_

_
  

Its been a few days, and we haven't heard back about this issue, can
we assume that you fixed it via using fully qualified paths then?

Matt

Ian Lea wrote:


Marco


You haven't answered Matt's question about where you are running it
from.  Tomcat's default directory may well not be the same as yours.
I strongly suggest that you use a full path name and/or provide some
evidence that your readers and writers are using the same directory
and thus lucene index.


--
Ian.


On Wed, May 20, 2009 at 9:59 AM, Marco Lazzara
marco.lazz...@gmail.com wrote:

  

I've posted the indexing part,but I don't use this in my app.After I
create the index,I put that in a folder like


/home/marco/RDFIndexLucece


and when I run the query I'm only searching (and not indexing).

String[] fieldsearch = new String[] {name, synonyms, propIn};
   //RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch);
TreeMapInteger, ArrayListString paths;
try {
   this.paths = this.rdfind.Search(text, path);
   } catch (ParseException e1) {
   e1.printStackTrace();
   } catch (IOException e1) {
   e1.printStackTrace();
   }

Marco Lazzara



Sorry, anyhow looking over this quickly here's a summarization of
  

what


I see:

You have documents in your index that look like the following:

name which is indexed and stored.
synonyms which are indexed and stored
path, which is stored but not indexed
propin, which is stored and indexed
propinnum, which is stored but not indexed
and ... vicinity I guess which is stored but not indexed

For an analyzer you are using Standard analyzer (which considering
  

all


the Italian? is an interesting choice.)

And you are opening your index using FSDirectory, in what appears to
be a by reference fashion (You don't have a fully qualified path to
where your index is, you are ASSUMING that its in the same directory
as this code, unless FSDirectory is not implemented as I think it
  

is.)


Now can I see the consumer code?  Specifically the part where you are
opening the index/constructing your queries?

I'm betting what's going on here is you are deploying this as a war
file into tomcat, and its just not really finding the index as a
result of how the war file is getting deployed, but looking more
closely at the source code should reveal if my suspicion is correct
here.

Also runtime wise, when you run your standalone app, where
specifically in your directory structure are you running it from?
Cause if you are opening your index reader/searcher in the same way
  

as


you are creating your writer here, I'm pretty darn certain that will
cause you problems.

Matt



Marco Lazzara wrote:

  

_Could you further post your Analyzer Setup/Query Building code from
BOTH apps. _

there is only one code.It is the same for web and for standalone.
And it is exactly the real problem!!the code is 

Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall

because that's the default index write behavior.

It will create any directory that you ask it to.

Matt

Marco Lazzara wrote:

ok.I understand what you really mean but It doesn't work.
I understand one thing.For example When i try to open an index in the
following location : RDFIndexLucene/ but the folder doesn't exist,*Lucene
create an empty folder named RDFIndexLucene* in my home folder...WHY???

MARCO LAZZARA

2009/5/22 Matthew Hall mh...@informatics.jax.org

  

For writing indexes?

Well I guess it depends on what you want.. but I personally use this:

(2.3.2 API)

File INDEX_DIR = /data/searchtool/thisismyindexdirectory
Analyzer analyzer = new WhateverConcreteAnalyzerYouWant();

writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true);

Your best bet would be to peruse the API docs of whatever lucene version
you are using.

However, I'm still pretty sure this ISN'T your actual issue here.

Looking at your full path example those still seem to be by reference to
me. Let me be more specific and tell you EXACTLY what I mean by that,

Lets say you are running your program in the following directory:

/home/test/app/

Trying to open an index like you have below will effectively be trying to
open an index in the following location:

/home/test/app/home/marco/RdfIndexLucene

What I think you MEAN to be doing is:

/home/marco/RdfIndexLucene

That leading slash is VERY VERY important, as its the entire difference
between an relative path and an absolute one.

Matt


Marco Lazzara wrote:



I was talking with my teacher.
Is it correct to use FSDirectory?Could you please look again at the code
I've posted here??
Should I choose a different way to Indexing ??

Marco Lazzara




2009/5/22 Ian Lea ian@gmail.com



  

OK.  I'd still like to see some evidence, but never mind.

Next suggestion is the old standby - cut the code down to the absolute
minimum to demonstrate the problem and post it here.  I know you've
already posted some code, but maybe not all of it, and definitely not
cut down to the absolute minimum.


--
Ian.


On Thu, May 21, 2009 at 10:48 PM, Marco Lazzara marco.lazz...@gmail.com

wrote:





_I strongly suggest that you use a full path name and/or provide some
evidence that your readers and writers are using the same directory
and thus lucene index.
_
I try a full path like home/marco/RdfIndexLucene,even
media/disk/users/fratelli/RDFIndexLucene.But nothing is changed.

MARCOLAZZARA
_

_


  

Its been a few days, and we haven't heard back about this issue, can
we assume that you fixed it via using fully qualified paths then?

Matt

Ian Lea wrote:




Marco


You haven't answered Matt's question about where you are running it
from.  Tomcat's default directory may well not be the same as yours.
I strongly suggest that you use a full path name and/or provide some
evidence that your readers and writers are using the same directory
and thus lucene index.


--
Ian.


On Wed, May 20, 2009 at 9:59 AM, Marco Lazzara
marco.lazz...@gmail.com wrote:



  

I've posted the indexing part,but I don't use this in my app.After I
create the index,I put that in a folder like




/home/marco/RDFIndexLucece
  


and when I run the query I'm only searching (and not indexing).
  

String[] fieldsearch = new String[] {name, synonyms, propIn};
  //RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch);
TreeMapInteger, ArrayListString paths;
try {
  this.paths = this.rdfind.Search(text, path);
  } catch (ParseException e1) {
  e1.printStackTrace();
  } catch (IOException e1) {
  e1.printStackTrace();
  }

Marco Lazzara





Sorry, anyhow looking over this quickly here's a summarization of


  

what



I see:
  

You have documents in your index that look like the following:

name which is indexed and stored.
synonyms which are indexed and stored
path, which is stored but not indexed
propin, which is stored and indexed
propinnum, which is stored but not indexed
and ... vicinity I guess which is stored but not indexed

For an analyzer you are using Standard analyzer (which considering


  

all



the Italian? is an interesting choice.)
  

And you are opening your index using FSDirectory, in what appears to
be a by reference fashion (You don't have a fully qualified path to
where your index is, you are ASSUMING that its in the same directory
as this code, unless FSDirectory is not implemented as I think it


  

is.)



Now can I see the consumer code?  Specifically the part where you are
  

opening the index/constructing your queries?

I'm betting what's going on here is you are deploying this as a war
file into tomcat, and its just not really finding the index as a
result of how the war file

Re: Searching index problems with tomcat

2009-05-22 Thread Matthew Hall

humor me.

Open up your indexing software package.

Step 1: In all places where you reference your index, replace whatever 
the heck you have there with the following EXACT STRING:


/home/marco/testIndex

Do not leave off the leading slash.

After you have made these changes to the indexing software, recompile 
and create your indexes.


Step 2: After your indexing process completes do the following:

cd /home/marco/testIndex/index

You should see files in there, they will look something like this:

drwxrwxr-x   3 mhallprogs   4.0K May 18 11:19 ..
-rw-rw-r--   1 mhallprogs 80 May 21 16:47 _9j7.fnm
-rw-rw-r--   1 mhallprogs   4.1G May 21 16:50 _9j7.fdt
-rw-rw-r--   1 mhallprogs   434M May 21 16:50 _9j7.fdx
-rw-rw-r--   1 mhallprogs   280M May 21 16:52 _9j7.frq
-rw-rw-r--   1 mhallprogs   108M May 21 16:52 _9j7.prx
-rw-rw-r--   1 mhallprogs   329M May 21 16:52 _9j7.tis
-rw-rw-r--   1 mhallprogs   4.7M May 21 16:52 _9j7.tii
-rw-rw-r--   1 mhallprogs   108M May 21 16:52 _9j7.nrm
-rw-rw-r--   1 mhallprogs 47 May 21 16:52 segments_9je
-rw-rw-r--   1 mhallprogs 20 May 21 16:52 segments.gen

You have now confirmed that you are actually creating indexes.  And the 
indexes you are creating exist at EXACTLY the place you have asked them to.


Step 3: Then.. go download luke, and open these indexes.  Perform a 
query on them, confirm that the data you want is actually IN the indexes.


Step 4: Now, open up your standalone application, and replace whatever 
you are using in the to open the index with the SAME string I have 
listed above.


Perform a search, verify that the indexes are there, and actually return 
values.


Step 5: Lastly, go into your web application and again replace the path 
with the one I have above, recompile, and perform a search.  Verify that 
the indexes are actually THERE and searchable.


This.. damn well SHOULD work, if it doesn't it is likely pointing to 
some other issues in what you have setup.  For example your tomcat 
instance could perhaps not have permission to read the lucene indexes 
directory.  You should be able to tell this in the tomcat logs, BUT 
don't do this yet.  Carefully and fully follow the steps I have outlined 
for you, and then you have chased down the full debugging path for this.


If this yields nothing for you, I'd be happy to take a closer look at 
your source code, but until then give this a shot.


Oh.. if it fails, please post back EXACTLY which steps in the above 
outlined process failed for you, as that will be really really helpful.


Matt



Marco Lazzara wrote:

I dont't know hot to solve the problem..I've tried all rationals
things.Maybe the last thing is to try to index not with FSDirectory but with
something else.I have to peruse the api documentation.
But.IF IT WAS A LUCENE'S BUG???

2009/5/22 Matthew Hall mh...@informatics.jax.org

  

because that's the default index write behavior.

It will create any directory that you ask it to.

Matt


Marco Lazzara wrote:



ok.I understand what you really mean but It doesn't work.
I understand one thing.For example When i try to open an index in the
following location : RDFIndexLucene/ but the folder doesn't
exist,*Lucene
create an empty folder named RDFIndexLucene* in my home folder...WHY???

MARCO LAZZARA

2009/5/22 Matthew Hall mh...@informatics.jax.org



  

For writing indexes?

Well I guess it depends on what you want.. but I personally use this:

(2.3.2 API)

File INDEX_DIR = /data/searchtool/thisismyindexdirectory
Analyzer analyzer = new WhateverConcreteAnalyzerYouWant();

writer = new IndexWriter(/INDEX_DIR/, /analyzer/, true);

Your best bet would be to peruse the API docs of whatever lucene version
you are using.

However, I'm still pretty sure this ISN'T your actual issue here.

Looking at your full path example those still seem to be by reference
to
me. Let me be more specific and tell you EXACTLY what I mean by that,

Lets say you are running your program in the following directory:

/home/test/app/

Trying to open an index like you have below will effectively be trying to
open an index in the following location:

/home/test/app/home/marco/RdfIndexLucene

What I think you MEAN to be doing is:

/home/marco/RdfIndexLucene

That leading slash is VERY VERY important, as its the entire difference
between an relative path and an absolute one.

Matt


Marco Lazzara wrote:





I was talking with my teacher.
Is it correct to use FSDirectory?Could you please look again at the code
I've posted here??
Should I choose a different way to Indexing ??

Marco Lazzara




2009/5/22 Ian Lea ian@gmail.com





  

OK.  I'd still like to see some evidence, but never mind.

Next suggestion is the old standby - cut the code down to the absolute
minimum to demonstrate the problem and post it here.  I know you've
already posted some code, but maybe not all of it, and definitely not
cut

Re: Searching index problems with tomcat

2009-05-20 Thread Matthew Hall
Right, so again, you are opening your index by reference there.  You 
application has to assume that the index that its looking for exists in 
the same directory as the application itself lives.  Since you are 
deploying this application as a deployable war file that's not going to 
work really well.


Well.. on the other hand this seems to be commented out in this snippet, 
but wherever you actually DO initialize the directory you are using to 
help your index, try doing it with the full path.  In your example below:


RDFinder rdfind = new RDFinder(/home/marco/RDFIndexLucene/,fieldsearch);

instead of what you have written here.

Matt



Marco Lazzara wrote:

I've posted the indexing part,but I don't use this in my app.After I
create the index,I put that in a folder like /home/marco/RDFIndexLucece
and when I run the query I'm only searching (and not indexing).

String[] fieldsearch = new String[] {name, synonyms, propIn};
//RDFinder rdfind = new RDFinder(RDFIndexLucene/,fieldsearch);
TreeMapInteger, ArrayListString paths;
try {
this.paths = this.rdfind.Search(text, path);
} catch (ParseException e1) {
e1.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
}

Marco Lazzara
  

Sorry, anyhow looking over this quickly here's a summarization of what
I see:

You have documents in your index that look like the following:

name which is indexed and stored.
synonyms which are indexed and stored
path, which is stored but not indexed
propin, which is stored and indexed
propinnum, which is stored but not indexed
and ... vicinity I guess which is stored but not indexed

For an analyzer you are using Standard analyzer (which considering all
the Italian? is an interesting choice.)

And you are opening your index using FSDirectory, in what appears to
be a by reference fashion (You don't have a fully qualified path to
where your index is, you are ASSUMING that its in the same directory
as this code, unless FSDirectory is not implemented as I think it is.)

Now can I see the consumer code?  Specifically the part where you are
opening the index/constructing your queries?

I'm betting what's going on here is you are deploying this as a war
file into tomcat, and its just not really finding the index as a
result of how the war file is getting deployed, but looking more
closely at the source code should reveal if my suspicion is correct here.

Also runtime wise, when you run your standalone app, where
specifically in your directory structure are you running it from? 
Cause if you are opening your index reader/searcher in the same way as

you are creating your writer here, I'm pretty darn certain that will
cause you problems.

Matt



Marco Lazzara wrote:


_Could you further post your Analyzer Setup/Query Building code from
BOTH apps. _

there is only one code.It is the same for web and for standalone.
And it is exactly the real problem!!the code is the same,libraries are
the same,query index etc etc. are the same.

This is the class that create index


public class AlternativeRDFIndexing {
   private Analyzer analyzer;
private Directory directory;
private IndexWriter iwriter;
private WordNetSynonymEngine wns;
private AlternativeResourceAnalysis rs;
public ArrayListString commonnodes;
   //private RDFinder rdfind = new RDFinder(RDFIndexLucene/,new
String[] {name});
   //public boolean Exists(String node) throws ParseException,
IOException{
//   //return rdfind.Exists(node);
//}
   public AlternativeRDFIndexing(String inputfilename) throws
IOException, ParseException{
 commonnodes = new ArrayListString();
   // bisogna istanziare un oggetto per fare analisi sul
documento rdf
rs = new AlternativeResourceAnalysis(inputfilename);

   ArrayListString nodelist = rs.getResources();
int nodesize = nodelist.size();
ArrayListString sourcelist = rs.getsource();
int sourcesize = sourcelist.size();
   //sinonimi
wns = new WordNetSynonymEngine(sinonimi/);
   //creazione di un analyzer standard
analyzer = new StandardAnalyzer();

//Memorizza l'indice in RAM:
   //Directory directory = new RAMDirector();
   //Memorizza l'indice su file
   directory = FSDirectory.getDirectory(RDFIndexLucene/);
   //Creazione istanza per la scrittura dell'indice
//Tale istanza viene fornita di analyzer, di un boolean per
indicare se ricreare o meno da zero
//la struttura e di una dimensione massima (o infinita
IndexWriter.MaxFieldLength.UNLIMITED)
iwriter = new IndexWriter(directory, analyzer, true, new
IndexWriter.MaxFieldLength(25000));
  //costruiamo un indice con solo n documenti: un
documento per nodo
   for (int i = 0; i  nodesize; i++){
   Document doc 

Re: Searching index problems with tomcat

2009-05-19 Thread Matthew Hall

Things that could help us immensely here.

Can you post your indexReader/Searcher initialization code from your 
standalone app, as well as your webapp.


Could you further post your Analyzer Setup/Query Building code from both 
apps.


Could you further post the document creation code used at indexing time? 
(Which analyzer, and which fields are indexed/stored)


Give us this, and I'm pretty darn sure we can nail down your issue.

Matt

Ian Lea wrote:

...
There are no exceptions.When I run the query a new shell is displayed but
 with no result.



New shell?

  

_*Are you sure the index is the same - what do IndexReader.maxDoc(),
numDocs() and getVersion() say, standalone
and in tomcat?

*_What do you mean with this question??



IndexReader ir = ...
System.out.printf(maxDoc=%s, ..., ir.maxDoc(), ...);

and run in tomcat and standalone.  To absolutely confirm you're
looking at the same index, and it has documents, etc.


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching index problems with tomcat

2009-05-19 Thread Matthew Hall
Sorry, anyhow looking over this quickly here's a summarization of what I 
see:


You have documents in your index that look like the following:

name which is indexed and stored.
synonyms which are indexed and stored
path, which is stored but not indexed
propin, which is stored and indexed
propinnum, which is stored but not indexed
and ... vicinity I guess which is stored but not indexed

For an analyzer you are using Standard analyzer (which considering all 
the Italian? is an interesting choice.)


And you are opening your index using FSDirectory, in what appears to be 
a by reference fashion (You don't have a fully qualified path to where 
your index is, you are ASSUMING that its in the same directory as this 
code, unless FSDirectory is not implemented as I think it is.)


Now can I see the consumer code?  Specifically the part where you are 
opening the index/constructing your queries?


I'm betting what's going on here is you are deploying this as a war file 
into tomcat, and its just not really finding the index as a result of 
how the war file is getting deployed, but looking more closely at the 
source code should reveal if my suspicion is correct here.


Also runtime wise, when you run your standalone app, where specifically 
in your directory structure are you running it from?  Cause if you are 
opening your index reader/searcher in the same way as you are creating 
your writer here, I'm pretty darn certain that will cause you problems.


Matt



Marco Lazzara wrote:

_Could you further post your Analyzer Setup/Query Building code from
BOTH apps. _

there is only one code.It is the same for web and for standalone.
And it is exactly the real problem!!the code is the same,libraries are
the same,query index etc etc. are the same.

This is the class that create index


public class AlternativeRDFIndexing {
   
private Analyzer analyzer;

private Directory directory;
private IndexWriter iwriter;
private WordNetSynonymEngine wns;
private AlternativeResourceAnalysis rs;
public ArrayListString commonnodes;
   
//private RDFinder rdfind = new RDFinder(RDFIndexLucene/,new

String[] {name});
   
//public boolean Exists(String node) throws ParseException, IOException{
//   
//return rdfind.Exists(node);

//}
   
public AlternativeRDFIndexing(String inputfilename) throws

IOException, ParseException{
 
commonnodes = new ArrayListString();
   
// bisogna istanziare un oggetto per fare analisi sul documento rdf

rs = new AlternativeResourceAnalysis(inputfilename);

   
ArrayListString nodelist = rs.getResources();

int nodesize = nodelist.size();
ArrayListString sourcelist = rs.getsource();
int sourcesize = sourcelist.size();
   
//sinonimi

wns = new WordNetSynonymEngine(sinonimi/);
   
//creazione di un analyzer standard

analyzer = new StandardAnalyzer();

//Memorizza l'indice in RAM:
   
//Directory directory = new RAMDirector();
   
//Memorizza l'indice su file
   
directory = FSDirectory.getDirectory(RDFIndexLucene/);
   
//Creazione istanza per la scrittura dell'indice

//Tale istanza viene fornita di analyzer, di un boolean per
indicare se ricreare o meno da zero
//la struttura e di una dimensione massima (o infinita
IndexWriter.MaxFieldLength.UNLIMITED)
iwriter = new IndexWriter(directory, analyzer, true, new
IndexWriter.MaxFieldLength(25000));
   
   
//costruiamo un indice con solo n documenti: un documento per nodo
   
for (int i = 0; i  nodesize; i++){
   
Document doc = new Document();
   
//creazione dei vari campi
   
// ogni documento avrˆ

// un campo name: nome del nodo
// indicazione di memorizzazione(Store.YES) e indicizzazione
con analyzer(ANALYZED)
   
String node = nodelist.get(i);
   
//if (sourcelist.contains(node)) break;
   
//if (rdfind.Exists(node)) commonnodes.add(node);
   
Field field = new Field(name, node,

Field.Store.YES,Field.Index.ANALYZED);
//Aggiunta campo al documento
doc.add(field);
   
//Aggiungo i sinonimi

String[] nodesynonyms = wns.getSynonyms(node);
for (int is = 0; is  nodesynonyms.length; is++) {
   
field = new Field(synonyms, nodesynonyms[is],

Field.Store.YES,Field.Index.ANALYZED);
//Aggiunta campo al documento
doc.add(field);
}
   
// uno o piu campi path_i: path minimali dalle sorgenti al nodo

// non indicizzati
   
for (int j = 0; j  sourcesize; j++) {
String source = sourcelist.get(j);   

NE Lucene User's Interest Group

2009-05-19 Thread Matthew Hall
Since everyone else seems to be trying to start these up I figured I 
would poll the community and see if there is any interest in the greater 
new england ares for a Lucene users group.  Searching over on Google 
leads me to believe that such a group doesn't currently exist, and I 
think it would certainly be something interesting to attempt.


Sadly. I'm betting that for any such group to succeed it would likely 
have to be Boston based, but perhaps there is a secret pocket of Lucene 
user's in Maine that I am not aware of.


So, would anyone be interested in a such a thing, with location/topics 
of discussion TBD based on interest?  Assuming there is enough interest 
in such a thing I would be willing to help organize/plan it, and I think 
I could convince my group to discuss our practical application of Lucene 
when searching Genomic information.


-Matt


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to get the word before and the word after the matched Term?

2009-05-18 Thread Matthew Hall
Well, when you get the Document object, you have access to the fields in 
that document, including the text that was searched against.


You could simply retrieve this string, and then use simple java String 
manipulation to get what you want.


Matt

Kamal Najib wrote:

Hi all,
I want to  get the word before and the word after  the matched Term.For Example if i have the Text  The drug was freshly prepared at 4-hour intervals . Eleven courses were administered to seven patients at this dose level and no patient experienced nausea or vomiting and the matched Term for example patient i want to get the word level and the word experienced(and and no are stop words, therefore i d'ont want to get them.).I have looked at the Class Termposition but in this Class i can only get the position of the matched Term, how can i get the word before and after it, any suggestion?. 
Thank you in advance.

Kamal
  




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Stemming

2009-05-08 Thread Matthew Hall

Ganesh wrote:
My opinion is Stemming process is to get the base word. Here it is not 
doing so.


Unfortunately this is where your problem lies, stemming doesn't do this, 
it breaks words that are almost lexically equivalent down into a similar 
root word. thus cat = cats.


From the wiki: *Stemming* is the process for reducing inflected (or 
sometimes derived) words to their stem 
http://en.wikipedia.org/wiki/Word_stem, base or root 
http://en.wikipedia.org/wiki/Root_%28linguistics%29 form – generally a 
written word form. The stem need not be identical to the morphological 
root http://en.wikipedia.org/wiki/Morphological_root of the word; it 
is usually sufficient that related words map to the same stem, even if 
this stem is not in itself a valid root. The algorithm 
http://en.wikipedia.org/wiki/Algorithm has been a long-standing 
problem in computer science 
http://en.wikipedia.org/wiki/Computer_science; the first paper on the 
subject was published in 1968. The process of stemming, often called 
*conflation http://en.wikipedia.org/wiki/Conflation*, is useful in 
search engines http://en.wikipedia.org/wiki/Search_engine for query 
expansion http://en.wikipedia.org/wiki/Query_expansion or indexing 
http://en.wikipedia.org/wiki/Index_%28search_engine%29 and other 
natural language processing 
http://en.wikipedia.org/wiki/Natural_language_processing problems.


But the words hard, and harder mean different things (In the opinion of 
those who developed the Snowball algorithm), and as such shouldn't be 
stemming down to a single word.


Now, I find it to be an arguable point about hard and harder not being 
close enough to stem to the same root, but in order to get this effect 
you will need to either change the snowball algorithm, or process your 
words into a more base form before they go into the stemmed, which is a 
hairy road indeed ^^


Hope this helps.

Matt

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 'problem with indexformat and luke

2009-05-08 Thread Matthew Hall

Which version of luke are you using?

Timon Roth wrote:

hello list

i am using lucene 2.9. when i try to open the index with luke i got an error:

unknown format version: -8

any hints?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Matthew Hall
Same here, sadly there isn't much call for Lucene user groups in Maine.  
It would be nice though ^^


Matt

Amin Mohammed-Coleman wrote:

I would love to come but I'm afraid I'm stuck in rainy old England :(

Amin

On 18 Apr 2009, at 01:08, Bradford Stephens 
bradfordsteph...@gmail.com wrote:



OK, we've got 3 people... that's enough for a party? :)

Surely there must be dozens more of you guys out there... c'mon,
accelerate your knowledge! Join us in Seattle!



On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens
bradfordsteph...@gmail.com wrote:

Greetings,

Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
with me in the Seattle area? I can donate some facilities, etc. -- I
also always have topics to speak about :)

Cheers,
Bradford



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ebook resources - including lucene in action

2009-04-20 Thread Matthew Hall
Strange.. as far as I can tell I never even got this email at all, was 
it not originally sent to the lucene lists?


Matt

Grant Ingersoll wrote:

Lest you think silence equals acceptance...

This is not appropriate use of these lists.

-Grant

On Apr 19, 2009, at 11:58 PM, wu fuheng wrote:


welcome to download
http://www.ultraie.com/admin/flist.php




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A Challenge!: Combining 2 searches into a single resultset?

2009-04-17 Thread Matthew Hall
-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  


--
View this message in context:
http://www.nabble.com/A-Challenge%21%3A-Combining-2-searches-into-a-single-resultset--tp23085506p23099744.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  



  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A Challenge!: Combining 2 searches into a single resultset?

2009-04-17 Thread Matthew Hall
Erm, I likely should have mentioned that this technique requires the use 
of a MultiFieldQueryParser.


Matt

Matthew Hall wrote:
If you can build an analyzer that tokenizes the second field so that 
it filters out the words you don't want, you can then take advantage 
of more intelligent queries as well.



So for the example that pjaol wrote, the query would become something 
like this:


Query= body:(game OR redskins) keyword:(redskins)^10

Depending on your corpus this may or may not be possible, the 
determining factor being whether or not the list of words being 
removed from each document to create the second field varies.


(more specifically I mean, in some documents you remove the word game, 
and in others you don't, if this is the case this technique won't work 
for you.)


Matt



theDude_2 wrote:

Ah, Interesting... I didnt think of that!  I will try it and report back


pjaol wrote:
 

Why not put the keywords into the same document as another field? and
search
both fields
at once, you can then use lucene syntax to give a boosting to the 
keyword

fields.
e.g.
body:A good game last night by the redskins
keyword: redskins

Query= body:(game OR redskins) keyword:(game OR redskins)^10

And adjust the boosting until you're happy.
Check out for querying multiple fields
http://wiki.apache.org/lucene-java/LuceneFAQ#head-300f0756fdaa71f522c96a868351f716573f2d77 



You might even want to consider Solr and it's dismax search component
http://wiki.apache.org/solr/DisMaxRequestHandler
to make it easier




On Fri, Apr 17, 2009 at 11:19 AM, theDude_2 aornst...@webmd.net 
wrote:


   

I appreciate your response, and read the wiki article concerning the
Federated search
and

I'm not sure that my project falls into the Federated Search 
bucket...


What I've done is created 2 indexes created with the same documents.
One index, contains the full documents - great for pure relevancy 
search
The second index: contains all of the same documents, but a small 
subset

of
each documents contents - only allowing words to be indexed that we 
deem

as
good words -

(for example) if this was a football article database
Index 1: would index 100% of the article about the Redskins and the 
New

York
Giants
Index 2: would index the same article by only the good words in the
document like Redskins, Giants, Quarterback, Linebacker, etc.

What I'm trying to do, if it's even possible! is run the search on 
both

indexes containing references to the same article, and multiple the
scores
together to get a final score that would represent something like a
relative AND good word score

Figuring that if a user searches on Who is the Quarterback for the
Giants
this will get the user an article that is both related to the 
query, and

deemed important to the query...

I will look further into federated search and related items, but I 
think

that lucene probably wont be able to help me with this, am I right?










pjaol wrote:
 
I'd start by doing some research on the question rather than 
asking for


a
 

solution..
What your asking for can be considered 'Federated Search'
http://en.wikipedia.org/wiki/Federated_search

And it can be conceived in as many ways as you have document 
types. Any

answer will probably end up
customized and weighted by your document silo value, usually 
companies

weight those by business rules
rather than head down the path of federated search, as it's just


quicker
 

and
cheaper, and you can accomplish more.
e.g
Medication = score *2  (as higher advertising incentives)
Diseases = score
Books = score * 0.75  ( thousands of books, which nobody buys etc..)

You might also want to try consolidating your data into 1 schema, and
consider layering or collapsing results
based on type.

P

On Fri, Apr 17, 2009 at 10:39 AM, theDude_2 aornst...@webmd.net


wrote:
 

(bump) - any thoughts?




theDude_2 wrote:
 

hi!

I am trying to do something a little unique...

I have a 90k text documents that I am trying to search
Search A: indexes and searches the documents using regular 
relevancy

search
Search B: indexes and searches the documents using a smaller subset


of
 

key words that I have chosen

This gives me 2 seperate scores: Score A, and Score B...

I am trying to show the top 10 results of the scores combined 
so


FinalScoretextDoc = (scoreA_of_td1 * 0.5) * (scoreB_of_td1 * 0.5)

While it seems straightforward, I do not want to calculate the


scores
 

of
 

all the documents outside of lucene.  How can I integrate this


better
 

into
 

the lucene search engine?  Is this possible to do by any simple


means?
 

Thanks guys + gals!




--
View this message in context:

  
http://www.nabble.com/A-Challenge%21%3A-Combining-2-searches-into-a-single-resultset--tp23085506p23098961.html 

 
Sent from the Lucene - Java Users

Re: Query any data

2009-04-10 Thread Matthew Hall
I think I would tackle this in a slightly different manner.

When you are creating this index, make sure that that field has a
default value. Make sure this value is something that could never appear
in the index otherwise. Then, when you goto place this field into the
index, either write out your actual value, or the default one.

Then when you get the document back, you can look at that field, and
solve your question. You can also craft queries that specifically avoid
entries that don't have a value in this field with a not clause.

Hope this helps,

Matt

Erick Erickson wrote:
 searching for fieldname:* will be *extremely* expensive as it will, by
 default,
 build a giant OR clause consisting of every term in the field. You'll throw
 MaxClauses exceptions right and left. I'd follow Tim's thread lead first

 Best
 Erick

 2009/4/8 王巍巍 ww.wang...@gmail.com

   
 first you should change your querypaser to accept wildcard query by calling
 method of QueryParser
  setAllowLeadingWildcard
 then you can query like this:  fieldname:*

 2009/4/9 Tim Williams william...@gmail.com

 
 On Wed, Apr 8, 2009 at 11:45 AM, addman addiek...@yahoo.com wrote:
   
 Hi,
   Is it possible to create a query to search a field for any value?  I
 
 just
   
 need to know if the optional field contain any data at all.
 
 google for:  lucene field existence

 There's no way built in, one strategy[1] is to have a 'meta field'
 that contains the names of the fields the document contains.

 --tim

 [1] -
 http://www.mail-archive.com/lucene-u...@jakarta.apache.org/msg07703.html

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


   
 --
 王巍巍(Weiwei Wang)
 Department of Computer Science
 Gulou Campus of Nanjing University
 Nanjing, P.R.China, 210093

 Mobile: 86-13913310569
 MSN: ww.wang...@gmail.com
 Homepage: http://cs.nju.edu.cn/rl/weiweiwang

 

   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How Can I make an analyzer that ignore the numbers o the texts ???

2009-04-08 Thread Matthew Hall
You can define your own STOP_LIST and pass it in as a constructor to 
most analyzers.


For example from the Lucene Javadocs:


 StandardAnalyzer

public *StandardAnalyzer*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html[] stopWords)

Builds an analyzer with the given stop words.

The only thing that you need to be careful of is to make sure that the 
analyzer isn't doing some sort of conversion of the tokens before the 
stoplist is checked, but otherwise that should work out just fine.


Matt

Ariel wrote:

Hi everybody:

I would want to know how Can I make an analyzer that ignore the numbers o
the texts like the stop words are ignored ??? For example that the terms :
3.8, 100, 4.15, 4,33 don't be added to the index.
How can I do that ???

Regards
Ariel

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is the right query syntax for matching some field's substring?

2009-04-01 Thread Matthew Hall
Which analyzer are you using here?  Depending on your choice the comma 
separated values might be being kept together in your index, rather than 
tokenized as you expected.


Secondly, you should get Luke, and take a look into your index, this 
should give you a much better idea of what's going on in your index.


Anyhow, closely examine your analyzer choice, and then your query type 
choice and see if that's where the problem lies.


Matt

Bon wrote:

Hi all,

I've a question about the query syntax statement,
There is a lucene text field and the value of the field like
,11,12,15,16,
if I want to query some data and the value of the field has included
some number what I like(11 or 15),
how can I do?
I try to give a query like (filed_name:,11,) but it can not get the
matching.

or I must reformat the field value with some other symbol not the symbol
comma ','

Bon
  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Memory Leak?

2009-03-23 Thread Matthew Hall
Perhaps this is a simple question, but looking at your stack trace, I'm 
not seeing where it was set during the tomcat initialization, so here goes:


Are you setting up the jvm's heap size during your Tomcat initialization 
somewhere?


If not, that very well could be part of your issue, as the standard JVM 
heapsize varies from platform to platform, so your windows based 
installation of tomcat simply might not have enough JVM Heap available 
to completely instantiate your RAMDirectory.


So, to start what is your heap currently set at for tomcat?

Secondly, if you try to increase it to a more reasonable value (say 512M 
or 1G) do you still run into this issue?


Matt

Chetan Shah wrote:

The stack trace is attached.
http://www.nabble.com/file/p22667542/dump dump 



The file size of 
_30.cfx - 1462KB

_32.cfs - 3432KB
_30.cfs - 645KB


The source code of WatchListHTMLUtilities.getHTMLTitle is as follows :

File f = new File(htmlFileName);
FileInputStream fis = new FileInputStream(f);
org.apache.lucene.demo.html.HTMLParser parser = new 
HTMLParser(fis);
String title = parser.getTitle();
fis.close();
fis = null;
f = null;
return title;





Michael McCandless-2 wrote:
  

Hmm... after how many queries do you see the crash?

Can you post the full OOME stack trace?

You're using a RAMDirectory to hold the entire index... how large is  
your index?


Mike

Chetan Shah wrote:



After reading this forum post :
http://www.nabble.com/Lucene-Memory-Leak-tt19276999.html#a19364866

I created a Singleton For Standard Analyzer too. But the problem still
persists.

I have 2 singletons now. 1 for Standard Analyzer and other for
IndexSearcher.

The code is as follows :

package watchlistsearch.core;

import java.io.IOException;

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import watchlistsearch.utils.Constants;

public class IndexSearcherFactory {

private static IndexSearcherFactory instance = null;

private IndexSearcher indexSearcher;

private IndexSearcherFactory() {

}

public static IndexSearcherFactory getInstance() {

if (IndexSearcherFactory.instance == null) {
IndexSearcherFactory.instance = new 
IndexSearcherFactory(); 
}

return IndexSearcherFactory.instance;   

}

public IndexSearcher getIndexSearcher() throws IOException {

if (this.indexSearcher == null) {   
Directory directory = new 
RAMDirectory(Constants.INDEX_DIRECTORY);
indexSearcher = new IndexSearcher(directory);   

}

return this.indexSearcher;  
}

}



package watchlistsearch.core;

import java.io.IOException;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;


---

public class AnalyzerFactory {

private static AnalyzerFactory instance = null;

private StandardAnalyzer standardAnalyzer;

Logger logger = Logger.getLogger(AnalyzerFactory.class);

private AnalyzerFactory() {

}

public static AnalyzerFactory getInstance() {

if (AnalyzerFactory.instance == null) { 
AnalyzerFactory.instance = new AnalyzerFactory();   

}

return AnalyzerFactory.instance;

}

public StandardAnalyzer getStandardAnalyzer() throws IOException {

if (this.standardAnalyzer == null) {
this.standardAnalyzer = new StandardAnalyzer();
logger.debug(StandardAnalyzer Initialized..);

}

return this.standardAnalyzer;   
}

}

--
View this message in context:
http://www.nabble.com/Memory-Leak--tp22663917p22666121.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  

-
To unsubscribe, e-mail: 

Re: newbie seeking explanation of semantics of Field class

2009-02-17 Thread Matthew Hall

Comments inline:

rolaren...@earthlink.net wrote:
R2.4 

I have been looking through the soon-to-be-superseded (by its 2nd ed.) book Lucene In Action (hope it's ok on this newsgroup to say I like that book); also at these two tutorials: http://darksleep.com/lucene/ and http://www.informit.com/articles/article.aspx?p=461633seqNum=3 and also at the Lucene online docco (http://lucene.apache.org/java/2_4_0/index.html) the last of which has nothing on the topic at all! I've also tried to search http://www.nabble.com/Lucene---Java-Users-f45.html -- but there are almost 10,000 docs there on Field. so that is too much data. 

The book is consistent with the two tutorials, but all three seem to be out of date (and the design less clear) compared to the code: http://lucene.apache.org/java/2_4_0/api/index.html 

I have copied some code and it is working for me, but I am a little uncertain how to decide what value of Field.Index and Field.Store to choose in order to get the behavior I'd like. If I read the javadocs, and decide to ignore all the expert items, it looks like this: 


Field.Store.NO = I'll never see that data again; I wonder why I'd do this?
This is useful in the cases where you have data you want to be able to 
search by, but never need to display it.


For example in my application we have complex data like:

kitGsfco1
^ 
http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=alleleDetailid=MGI:3530308
In one of our searchable indexes we do quite a bit of transformation to 
this data, and remove all of the punctuation, etc etc.


so it turns into: kit gsfcol

This is great for searching, cause it allows us to have punctuation 
irrelevant search results, but the user simply doesn't care whatsoever.  
So at display time, we show them the unmodified, case sensitive version 
of this data, which is stored in another field.
 

Field.Store.YES = good, the data will be stored 

  
Storage takes up space, so if you are ONLY going to search on a piece of 
data, and never display it, you should not store it.
Field.Store.COMPRESS = even better, stored and compressed; why would anyone do anything else? 

  

I agree.



Field.Index.NO = I cannot search that data, but if I need its value for a given document (e.g., to decorate a result), I can retrieve it (use-case: maybe, the date the document was created -- but why not just make that searchable? I am having a hard time thinking of an actually useful piece of data that could go here and would not want to be one of ANALYZED or NOT_ANALYZED) 

  
Correct, you use this type of data as additional information about the 
data you matched on. 

Field.Index.ANALYZED = the normal value, I would guess, except in the special 
case of stuff not searchable but used to decorate results (Field.Index.NO)

  

Correct.
Field.Index.NOT_ANALYZED = I can search for this value, but it won't get analyzed, so it is searched for as the very same value I put in (the docco suggests product numbers: any other interesting use-cases anyone can suggest?) 

  

Its highly useful for exact match searching.
= 


thanks in advance for helping me get clearer on this!

-Paul 







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: waaaay too many files in the index!

2009-02-03 Thread Matthew Hall

Did you optimize your index?

If not, depending on your merge factor, this could be a very normal 
index for you.


-Matt

John Byrne wrote:

Hi,

I've got a weird problem with a lucene index, using 2.3.1. The index 
contains 6660 files. I don't know how this happened.Maybe somone can 
tell me something about the files themselves? (examples below)


On one day, between 10 and 40 of these files were being created every 
minute. The index updates are triggered by updates to an SVN 
repository, but I can't find any corresponding activity in the SVN logs.


The lucene files all have names like this:

_1qsw.cfs
_1qsx.cfs
_1qsy.cfs
_1qsz.cfs
_1qt0.cfs

and are mostly  5K in size.

My application uses just one instance each of 
IndexReader/IndexWriter/IndexSearcher. From looking at


Can anyone shed any light on these files? I'm not too hopeful about 
fixing this index because we are getting too many open files, even 
with an unlimited ulimit, but any info/suggestions would be great. 
Thanks.


-John




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance issue

2009-02-02 Thread Matthew Hall

Do you NEED to be using 7 fields here?

Like Erick said, if you could give us an example of the types of data 
you are trying to search against, it would be quite helpful.


Its possible that you might be able to say collapse your 7 fields down 
to a single field, which would likely reduce the overall number of or 
clauses in your searches, speeding things up nicely.


At my project we search two letter prefix searches in sub seconds, for 
much larger datasets.  Alot of this however is directly due to how our 
indexes are structured.


-Matt

Erick Erickson wrote:

Prefix queries are expensive here. The problem is
that each one forms a very large OR clause on all
the terms that start with those two letters. For instance,
if a field in your index contained
mine
milanta
mica

a prefix search on mi would form
mine OR milanta OR mica.

Doing this across seven fields could get expensive.

Two things:
1 what is the problem you are trying to solve? Perhaps some
of the folks on the list can give you some suggestions. You can
think about many strategies depending upon what you want
to accomplish. A 300M index isn't very big, so you could, for
instance, think about indexing a separate field that contains only
the two beginning letters and search *that* in this case. I'll
assume that three letter prefix queries are OK.

2 How are you measuring query time? If you're measuring the
time it takes when you first start a searcher, be aware that the
first few queries are usually slow because the caches haven't
been filled. Further, are you measuring total response time or
are you measuring *just* the query time? It's possible that the
time is being spent assembling the response in your code
rather than actual searching. You might insert some timers
to determine that.

Best
Erick

On Mon, Feb 2, 2009 at 2:58 AM, Mittal, Sourabh (IDEAS) 
sourabh-931.mit...@morganstanley.com wrote:

  

Hi All,

We face serious performance issues when users do 2 letter search e.g ho,
jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs.
Below is our implementation details:

1. Search performs on 7 fields.
2. PrefixQuery implementation on all fields
3. AND search.
4. Our indexer size is 300 MB.
5. We show only 100 top documents only on the basis of score.
6. We user StandardAnalyzer  StandardTokenizer for indexing 
searching.
7. Lucene 2.4
8. JDK1 .6

Please suggest me how can we improve the performance.

Regards,
Sourabh Mittal
Morgan Stanley | IDEAS Practice Areas
Manikchand Ikon | South Wing 18 | Dhole Patil Road
Pune, 411001
Phone: +91 20 2620-7053
sourabh-931.mit...@morganstanley.com



--
NOTICE: If received in error, please destroy and notify sender. Sender does
not intend to waive confidentiality or privilege. Use of this email is
prohibited when received in error.




  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Filtering accents

2008-12-30 Thread Matthew Hall
If you are constrained in such a way as to not use the French Analyzer 
you might instead consider transforming the input as an additional step 
at both search/indexing time.


Use something like a regex that looks for é and always replaces it with 
e in the index, and at search time.  (expand this transformation step as 
needed)


You likely also need to store the original word somewhere, so I would 
suggest adding a second stored, but unindexed field that stores the 
original value of the word, so when you match on your search criteria, 
you will also get the original form of the word in your hits object.


Hope this helps,

Matt

egrand thomas wrote:

Dear all,

I'd like my lucene searches to be insensitive to (French) accents. For example, considering a indexed term 
métal, I want to get it when searching for metal or métal . I use lucene-2.3.2 and 
the searches are performed with: IndexSearcher.search(query,filter,sorter), Another filter is already used together 
with a Sort object. Futrhermore, I cannot use the FrenchAnalyzer as my index does not only contain French 
words.

Can anybody help ?
Thanks in advance,
Tom



  
  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IDF scoring issue

2008-12-17 Thread Matthew Hall
Well, you could also do a simple test of removing IDF from the scoring 
equation and seeing if the query then reacts the way you want it to.


Simply write your own custom similarity that does this, and test out to 
see how it works.


Handily enough, I've already done this, so here's some code you can try:


Fix the package declaration to something that works for you, and then 
simply use the custom similarity at the appropriate times.


==
package org.jax.mgi.shr.searchtool;

import org.apache.lucene.search.DefaultSimilarity;

/**
* This is our custom similarity class, which removes document frequency 
from

* the calculation of score.
*
* It extends the DefaultSimilarity class, and thusly inherits most of its
* methods from it.
*
* @author mhall
*
*/

public class MGISimilarity extends DefaultSimilarity {

   /**
* If we have any doc frequency at all in the index, normalize it to 
1 (The

* document exists)
*
* Otherwise, return 0 (Does not exist)
*
* @param docFreq
* This items doc frequency
* @param numDocs
* How many documents this item appears in.
*
* This API is enforced by the DefaultSimilarity class.
*
*/

   public float idf(int docFreq, int numDocs) {
   if (docFreq  0) {
   return 1.0f;
   } else {
   return 0.0f;
   }
   }

}

===
Rajiv2 wrote:

Because, the search term is provided by a user, and that user would explicity
have to put quotes around marietta ga when I beleive the search text as it
is : fleming roofing inc., marietta ga  -- should score higher for marietta
ga

rajiv


Grant Ingersoll-6 wrote:
  

On Dec 16, 2008, at 8:19 PM, Rajiv2 wrote:



Hello,

I'm using the default lucene Queryparser on the search text : fleming
roofing inc., marietta ga

Also, I don't want to modify the search text by putting quotes around
marietta ga which forces the query parser to make a phrase query.
  

Why not?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to search for -2 in field?

2008-12-12 Thread Matthew Hall
Are you absolutely, 100% sure that the -2 token has actually made it 
into your index?


As a VERY basic way to check this try something like this:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;


public class IndexTerms {
  
  
   public static void main(String[] args) {

   try {
   IndexReader ir = IndexReader.open(C:/Search/index/index);

   TermEnum te = ir.terms();

   while (te.next()) {
   System.out.println(te.term().text());
   }
   }
   catch (Exception e) {;}
   }
}

Then look through the output, verifying that the tokens you are 
expecting to exist in your index, actually do.


I have a feeling that whatever analyzer you are using is dropping the 
- from the front of your -2 at indexing time, and if so it can 
sometimes be pretty hard to tell via Luke.


Hope this helps,

-Matt

Darren Govoni wrote:

Tried them all, with quotes, without. Doesn't work. At least in Luke it
doesn't.

On Fri, 2008-12-12 at 07:03 +0530, prabin meitei wrote:
  

whitespace analyzer will tokenize on white space irrespective of quotes. Use
standard analyzer or keyword analyzer.
Prabin meitei
toostep.com

On Thu, Dec 11, 2008 at 11:28 PM, Darren Govoni dar...@ontrenet.com wrote:



I'm using Luke to find the right combination of quotes,\'s and
analyzers.

No combination can produce a positive result for -2 String for the
field 'type'. (any -number String)

type: 0 -2 Word

analyzer:
query - rewritten = result

default field is 'type'.

WhitespaceAnalyzer:
\-2 ConfigurationFile\  - type:-2 type:ConfigurationFile = NO
-2 ConfigurationFile - -type:2 type:ConfigurationFile = NO
\-2 ConfigurationFile - type:-2 type:ConfigurationFile = NO
\-2 ConfigurationFile - type:-2 ConfigurationFile = NO (thought
this one would work).

Same results for the other analyzers more or less.

Weird.

Darren



On Thu, 2008-12-11 at 23:02 +0530, prabin meitei wrote:
  

Hi,  While constructing the query give the query string in quotes.
eg: query = queryparser.parse(\-2 word\);

Prabin meitei
toostep.com

On Thu, Dec 11, 2008 at 10:37 PM, Darren Govoni dar...@ontrenet.com


wrote:
  

I'm hoping to do this with a simple query string, but not sure if its
possible. I'll try your suggestion though as a workaround.

Thanks!!

On Thu, 2008-12-11 at 16:48 +, Robert Young wrote:
  

You could do it with a TermQuery but I'm not quite sure if that's the


answer
  

you're looking for.

Cheers
Rob

On Thu, Dec 11, 2008 at 3:59 PM, Darren Govoni dar...@ontrenet.com


wrote:
  

Hi,
 This might be a dumb question, but I have a simple field like this

field: 0 -2 Word

that is indexed,tokenized and stored. I've tried various ways in
  

Lucene
  

(using Luke) to search for -2 Word and none of them work, the
  

query
  

is
  

re-written improperly. I escaped the -2 to \-2 Word and it still
doesn't work. I've used all the analyzers.


What's the trick here?

Thanks,
Darren



  

-
  

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using AND with MultiFieldQueryParser

2008-11-13 Thread Matthew Hall

Which Analyzer have you assigned per field?

The PerFieldAnalyzerWrapper uses a default analyzer (the one you passed 
during its construction), and then you assign specific analyzers to each 
field that you want to have special treatment.


For example:

   PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(
   new StandardAnalyzer());
   aWrapper.addAnalyzer(data, new MGIAnalyzer());
   aWrapper.addAnalyzer(sdata, new StemmedMGIAnalyzer());

Now, for the fields in question, have you assigned an Analyzer that 
doesn't actually use stopwords? (there are several available in core)  
Or are you perchance using a custom Analyzer that doesn't process stop 
words?


Could you possibly post your Initialization code for this?  If so I 
think we could be of more help to you.


Matt

Rafael Cunha de Almeida wrote:

On Thu, 13 Nov 2008 14:53:59 +0530
prabin meitei [EMAIL PROTECTED] wrote:

  

Hi,
From whatever you have written you are trying to write a query
*word1 AND stopword AND word2
*this means that the result should contain all of word1, word2 and the
stopword.

Since you have already removed the stopword during index time you will never
find any document matching your query. (this is expected behaviour)
you can possibly use word1 OR stopword OR word2 (depends on what you want in
the result)
 If you can clarify more about what you want in the result we can discuss on
what can be done.



I wanted MultiFieldQueryParser to ignore any stopword the user may type
in. In that particular case I'd like the result to be word1 AND word2. I
thought that was what would happen because I pass the Analyzer to
MultiFieldQueryParser, so I expected the parser to ignore stopwords for
fields which the analyzer drops stopwords (I use PerFieldAnalyzerWrapper
analyzer).

  

On Thu, Nov 13, 2008 at 10:30 AM, Rafael Cunha de Almeida 
[EMAIL PROTECTED] wrote:



Hello,

I used an Analyzer which removes stopwords when indexing, then I wanted
to do an AND search using MultiFieldQueryParser. So I did this:
   word1 AND stopword AND word2
I thought the stopword would be ignored by the searcher (I use the same
Analyzer to index and search). But instead, I get no results whenever I
have a stopword like that. If I remove the stopword, giving me:
   word1 AND word2
then the search is sucessful. Is that the expected behaviour? Am I
doing something wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. Database

2008-10-01 Thread Matthew Hall
Another thing you could consider is that rather than meshing all this 
data into a single index, logically break out the data you need for 
searching into one index, and the data you need for display into another 
index.


This is the technique we use here and its been wildly successful for us, 
as compared to going directly into the database.


Our DB is structured for ease of data entry/annotation rather than for 
the ease is display.  So we use our display index to have standard 
realized datafields that are pulled from various tables in the database.


So, when we search we include the unique key that each matched term 
points to, and then use this unique key to pull out our display time 
information. 


Its worked pretty well for us, so its certianly a viable approach.

- Matt

agatone wrote:
Hi, 
I asked this question already on lucene-general list but also got advised

to ask here too.

I'm working on a project that has big database in the background (some
tables have about 150 rows). We decided to use Lucene for faster
search. Our search works similar as all searches: you write search string,
get list of hits with detail link. But there is dilemma if we should store
more data into index than it's needed. 


One side of developing team insists that we should use lucene index as
somekind of storage for data so when you get hit, you go onto details and
then again use lucene to find document that matches the selected ID and take
the data from Lucene index. So in the end you end with copying complete
database tables into the lucene index.

Other side insists on storing to index only data that is displayed directly
to the user when showing the search results list and needed for search
criteria. When you go onto details, you have the matching ID so you can
pickup that row from database by that ID rather than search it inside Lucene
index. 


Can someone please describe drawbacks and advantages of both approaches.
Actually can someone write down what's the actual profit, where and when of
the Lucene itself in real production env. 


IT would be great if there is anyone who could write his experience with
indexing and searching large amount of data.


Thank you
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query attached words

2008-09-23 Thread Matthew Hall

We have a similar requirement here at our work.

In order to get around it we create two indexes, one of which 
punctuation is relevant, and one in which all punctuation is treated as 
a place to break tokens.


We then do a search against both indexes and merge the results, it seems 
that such a technique might be able to help you here as well.  (Though 
upon rereading it seems like perhaps you want SOME punctuation to be 
relevant, but others not, the technique itself though could still be 
applied with these rules used instead)


- Matt

Jean-Claude Antonio wrote:

Thanks Erick, you are right about the various combinations.
Cheers,

Erick Erickson wrote:

Yes you can query *method. But you have to turn leading wildcards
(which I don't have right on the tips of my fingers, but know it's been
an option for some time now).

But your solution doesn't scale well. If you had
a.b.c.d.e.f.g.h you'd have to store many combinations in order
to do what you want, quickly becoming really, really ugly.

But you could store the tokens
a
.
b
.
c
.
e
.
f
.
g
.
h
by using the appropriate analyzer (or perhaps rolling
your own). Then you could use either PhraseQuerys
or SpanQuerys to do what you want

Best
Erick

On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio
[EMAIL PROTECTED]wrote:

 

Hello,

If I had a file with the following content:
...
object.method();
...
I would like to be able to query for
object
method
object.method

My guess is that I should store not only object.method, but also 
object

and method as I cannot query *method.
Any other suggestion?

Kind regards,

JClaude




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Front-end match

2008-09-19 Thread Matthew Hall
The reason the wildcard is being dropped is because you have wrapped it 
in a phrase query.  Wildcards are not supported in Pharse Queries.  At 
least not in any Analyzers that I'm aware of.


A really good tool to see the transformations that happen to a query is 
Luke, open it up against your index, go into the search section, choose 
the analyzer you use and start playing around.


This has helped me countless times when creating I'm my own queries and 
not getting the results that I expect.


-Matt

叶双明 wrote:

I am sorry, just put the string to QueryParser.
But what make me confusing the code:

Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\);

doesn't work as i have expected. It drop the Wildcard *.


2008/9/19, 叶双明 [EMAIL PROTECTED]:
  

Thanks!

Now, I just use Query query = qp.parse(a*); and  meet the my
requirements.

Another question: how to parser query string like:   title:The
Right Way AND text:go
please show me in java code. thanks.

2008/9/19 Karl Wettin [EMAIL PROTECTED]



19 sep 2008 kl. 11.05 skrev 叶双明:

Documentstored/uncompressed,indexedfield:abc
  

Documentstored/uncompressed,indexedfield:bcd

How can I get the first Document buy some query string like a , ab or
abc but no b and bc?



You would create an ngram filter that create grams from the first position
only. Take a look at EdgeNGramTokenFilter in contrib/analyzers.


karl



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

--
Sorry for my English!! 明
Please help me correct my English expression and error in syntax






  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Front-end match

2008-09-19 Thread Matthew Hall

To be more specific (just in case you are new to lucene)

Your Query:

Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\);

What I think you actually want here:

Query query = qp.parse(bbb:b* AND ccc:cc*);

Give it a shot, and then like I said, go get Luke, it will help you 
tremendously ^^


Matthew Hall wrote:
The reason the wildcard is being dropped is because you have wrapped 
it in a phrase query.  Wildcards are not supported in Pharse Queries.  
At least not in any Analyzers that I'm aware of.


A really good tool to see the transformations that happen to a query 
is Luke, open it up against your index, go into the search section, 
choose the analyzer you use and start playing around.


This has helped me countless times when creating I'm my own queries 
and not getting the results that I expect.


-Matt

叶双明 wrote:

I am sorry, just put the string to QueryParser.
But what make me confusing the code:

Query query = qp.parse(bbb:\b*\ AND ccc:\cc*\);

doesn't work as i have expected. It drop the Wildcard *.


2008/9/19, 叶双明 [EMAIL PROTECTED]:
 

Thanks!

Now, I just use Query query = qp.parse(a*); and  meet the my
requirements.

Another question: how to parser query string like:   title:The
Right Way AND text:go
please show me in java code. thanks.

2008/9/19 Karl Wettin [EMAIL PROTECTED]

   

19 sep 2008 kl. 11.05 skrev 叶双明:

Documentstored/uncompressed,indexedfield:abc
 

Documentstored/uncompressed,indexedfield:bcd

How can I get the first Document buy some query string like a , 
ab or

abc but no b and bc?


You would create an ngram filter that create grams from the first 
position

only. Take a look at EdgeNGramTokenFilter in contrib/analyzers.


karl



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

--
Sorry for my English!! 明
Please help me correct my English expression and error in syntax






  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: AW: Search with multiple wildcards

2008-09-11 Thread Matthew Hall
Well, you could certainly manipulate your search string, removing the 
wildcard punctuations, and then use that for what you pass to the 
highlighter.


That should give you the functionality you are looking for.


-Matt
mark harwood wrote:

Is this possible?
  


Not currently, the highlighter works with a list of words (or words AND phrases 
using the new span support) and highlights those.
To do anything else would require the higlighter to faithfully re-implement 
much of the logic in all of the different query types (fuzzy, wildcard, regex 
etc etc) which is much more challenging/difficult to maintain.



- Original Message 
From: Sertic Mirko, Bedag [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, 11 September, 2008 12:07:36
Subject: AW: AW: Search with multiple wildcards

Ok, one final question:

If i query for *ll*, the query is expanded to (hallo or alle or ...), so 
the
Highligter will highlight the words hallo or alle. But how can i highlight 
only
the original query, so only the ll? Is this possible?

Thanks a lot
Mirko

-Ursprüngliche Nachricht-
Von: mark harwood [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 11. September 2008 11:20

An: java-user@lucene.apache.org
Betreff: Re: AW: Search with multiple wildcards

You need to call rewrite on the query to expand it then give that version to 
the highlighter - see the package javadocs.
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description


Cheers
Mark




- Original Message 
From: Sertic Mirko, Bedag [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, 11 September, 2008 9:34:13
Subject: AW: Search with multiple wildcards

Ok, i gave it a try, but i ran into this TooManyClauses Exception. I see that
3ildcard queries are expanded before they are processed, and I see that i can
set the clauses count to Integer.MAXVALUE, and queries can consume a lot of memory, 
but one final thing is still open: does a wildcard query work together with 
the Lucene Highlighter? I tried it, but I only got an empty result. Without 
wildcards, the highlighter works pretty smooth!


Regards
Mirko

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 10. September 2008 18:15

An: java-user@lucene.apache.org
Betreff: Re: Search with multiple wildcards

Of course you can construct your own BooleanQuery
programmatically.

It's relatively easy, just try it.

On Wed, Sep 10, 2008 at 11:52 AM, Sertic Mirko, Bedag [EMAIL PROTECTED]
  

wrote:



  

Jep, this is what i have read.

do I need to use the query parser, or can I create a query by the api?
Is there an example available?

Thanks a lot
Mirko

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 10. September 2008 16:45
An: java-user@lucene.apache.org
Betreff: Re: Search with multiple wildcards

Is this what you're referring to?

Lucene supports single and multiple character wildcard searches within
single terms (not within phrase queries).
(from http://lucene.apache.org/java/docs/queryparsersyntax.html)

I'm pretty sure you can have multiple *terms* with wildcards. Luke is your
friend here, download a copy and try it G. Be sure on the search tab to
specify StandardAnalyzer or some such, rather than keywordanalyzer.

The phrase is trying to point out that a phrase query does NOT respect
wildcards. That is, submitting
ab* bc* cd* AS A PHRASE QUERY won't do what you expect. But I'm pretty
sure that

+field:ab* +field:bc* +field:cd*

will work just fine. The key here is within single terms, which I think
of
as
within a single term query. You can add as many TermQuerys as you want.
See the query documentation for how to submit phrase queries.

Best
Erick

On Wed, Sep 10, 2008 at 10:11 AM, Sertic Mirko, Bedag
[EMAIL PROTECTED]


wrote:
  
Hi


Thank you for your quick response:-)

Of course I need to use the * character :-) But I have read somewhere in
the documentation that leading wildcards are not supported, and only one
wildcard term per query. Is this limitation resolved in the current
  

version?


Regards
Mirko

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 10. September 2008 15:47
An: java-user@lucene.apache.org
Betreff: Re: Search with multiple wildcards

Sure, but you'll have to set the leading wildcard option,
which I've forgotten the exact call, but it's in the docs.

And use * rather than % G.

But wildcards are tricky, especially the TooManyClauses
exception. You might want to peruse the archive for wildcard
posts...

Best
Erick

On Wed, Sep 10, 2008 at 9:06 AM, Sertic Mirko, Bedag
[EMAIL PROTECTED]wrote:

  

[EMAIL PROTECTED]



Is it possible to do a search with multiple wildcards in one query, for
instance %MANAGE% AND CORE%? Is there a code example available?



Thanks a lot

Mirko







Re: AW: AW: Search with multiple wildcards

2008-09-11 Thread Matthew Hall

Ah.. that's a darn good point..

Though, that second bit of code you have there could be used at display 
time for him to get the functionality that he wants.  You could also 
modify it somewhat, and apply it against the displayable part of the hit 
he's getting back rather than the individual tokens.


This if of course only assuming that this functionality would be used 
after a contains type search was detected.


Considering that's the only real usecase for a technique like this, I'm 
thinking its probably more trouble than its worth for his case.




mark harwood wrote:

That should give you the functionality you are looking for.
  


If I understand your suggestion correctly, It won't. The Highlighter uses a 
tokenized version of the document text.

Simplistically it does the following psuedo code:

for all tokens in documentTokenStream,
   if(queryTermsSet.contains(token))
output b+token+/b
   else
output token

NOT

for all tokens in query string
 fullDocumentString.replaceAll(queryStringToken, 
b+queryStringToken+/b

So in the given example while you suggest manipulating ll to be in the query string,  
you cannot make ll appear as a token in documentTokenStream.

Actually the Highlighter logic is a fair bit more involved than this 
(especially when using SpanQueryScorer) but the basis of it is there in the 
above pseudo code.





- Original Message 
From: Matthew Hall [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, 11 September, 2008 14:40:26
Subject: Re: AW: AW: Search with multiple wildcards

Well, you could certainly manipulate your search string, removing the 
wildcard punctuations, and then use that for what you pass to the 
highlighter.


That should give you the functionality you are looking for.


-Matt
mark harwood wrote:
  

Is this possible?
 


Not currently, the highlighter works with a list of words (or words AND phrases 
using the new span support) and highlights those.
To do anything else would require the higlighter to faithfully re-implement 
much of the logic in all of the different query types (fuzzy, wildcard, regex 
etc etc) which is much more challenging/difficult to maintain.



- Original Message 
From: Sertic Mirko, Bedag [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, 11 September, 2008 12:07:36
Subject: AW: AW: Search with multiple wildcards

Ok, one final question:

If i query for *ll*, the query is expanded to (hallo or alle or ...), so 
the
Highligter will highlight the words hallo or alle. But how can i highlight 
only
the original query, so only the ll? Is this possible?

Thanks a lot
Mirko

-Ursprüngliche Nachricht-
Von: mark harwood [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 11. September 2008 11:20

An: java-user@lucene.apache.org
Betreff: Re: AW: Search with multiple wildcards

You need to call rewrite on the query to expand it then give that version to 
the highlighter - see the package javadocs.
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/highlight/package-summary.html#package_description


Cheers
Mark




- Original Message 
From: Sertic Mirko, Bedag [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, 11 September, 2008 9:34:13
Subject: AW: Search with multiple wildcards

Ok, i gave it a try, but i ran into this TooManyClauses Exception. I see that
3ildcard queries are expanded before they are processed, and I see that i can
set the clauses count to Integer.MAXVALUE, and queries can consume a lot of memory, 
but one final thing is still open: does a wildcard query work together with 
the Lucene Highlighter? I tried it, but I only got an empty result. Without 
wildcards, the highlighter works pretty smooth!


Regards
Mirko

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 10. September 2008 18:15

An: java-user@lucene.apache.org
Betreff: Re: Search with multiple wildcards

Of course you can construct your own BooleanQuery
programmatically.

It's relatively easy, just try it.

On Wed, Sep 10, 2008 at 11:52 AM, Sertic Mirko, Bedag [EMAIL PROTECTED]
 


wrote:
   
  
 


Jep, this is what i have read.

do I need to use the query parser, or can I create a query by the api?
Is there an example available?

Thanks a lot
Mirko

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 10. September 2008 16:45
An: java-user@lucene.apache.org
Betreff: Re: Search with multiple wildcards

Is this what you're referring to?

Lucene supports single and multiple character wildcard searches within
single terms (not within phrase queries).
(from http://lucene.apache.org/java/docs/queryparsersyntax.html)

I'm pretty sure you can have multiple *terms* with wildcards. Luke is your
friend here, download a copy and try it G. Be sure on the search tab to
specify StandardAnalyzer or some such, rather than keywordanalyzer.

The phrase

Re: escaping special characters

2008-08-11 Thread Matthew Hall
You can simply change your input string to lowercase before passing it 
to the analyzers, which will give you the effect of escaping the boolean 
operators.  (I.E you will now search on and or and not)  Remember 
however that these are extremely common words, and chances are high that 
you are removing them via your stop words list in your analyzer.  This 
is also assuming you are using an analyzer that does lowercasing as part 
of its normal processing, which many do.


Matt

Steven A Rowe wrote:

On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:
  

Aravind R Yarram wrote:


can i escape built in lucene keywords like OR, AND aswell?
  

as of the last time i checked: no, they're baked into the grammer.



I have not tested this, but I've read somewhere on this list that enclosing OR 
and AND in double quotes effectively escapes them.

  

(that may have changed when it switchedfrom a javac to a flex grammer
though, so i'm not 100% positive)



Although the StandardTokenizer was switched about a year ago from a JavaCC to a 
JFlex grammar, QueryParser's grammar remains in the JavaCC camp.

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using lucene as a database... good idea or bad idea?

2008-07-29 Thread Matthew Hall
Yeah.. we do the same thing here for indexes of up to  57M documents 
(rows), and that's just one part of our implementation.


It takes quite a bit of.. wrangling to use lucene in this manner.. but 
we've found it to be utterly worthwhile.


Matt

Ian Lea wrote:

John


I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days).  Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling.  There are certainly lucene indexes out there bigger
than what you propose.  You can compress the stored data to save some
space.  Run times for optimization might get interesting but see
recent threads for suggestions on that.  And since you are not too
concerned about performance you may not need to optimize much, or even
at all.

Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.


--
Ian.


On Tue, Jul 29, 2008 at 2:53 AM, John Evans [EMAIL PROTECTED] wrote:
  

Hi All,

I have successfully used Lucene in the tradtiional way to provide
full-text search for various websites.  Now I am tasked with developing a
data-store to back a web crawler.  The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields.  It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good.  I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance?  Has anyone done something similar?  Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale.  We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).  We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
appreciated.

Thanks,
-
John




  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread Matthew Hall

Erm.. if its not tokenized that's your problem.

You are setting up an Analyzer when indexing.. but then not actually 
USING it.


Whereas when you are searching you are running your query through the 
analyzer, which transforms your text in such a way that it no longer 
matches against your untokenized form.


So, rerun your index, changing untokenized to tokenized, and I think you 
will see the results you are looking for.


Matt

samd wrote:

Oh and the field is not tokenized and stored.
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the percent of size of lucene's index ?

2008-07-23 Thread Matthew Hall
You can also use Luke after you've created your indexes to get their 
exact size, and other interesting data points.


Like Ian said though, the decisions you make on a field by field basis 
will make your index size vary quite a bit, so probably the best thing 
you could do is simply try it out, and then examine it.


Matt

Ian Lea wrote:

I think there are too many variables to give a simple answer.

How much of your data are you storing?  Indexing? Compressing?

Get a representative sample of your data and try it out.


--
Ian.


On Wed, Jul 23, 2008 at 5:00 PM, Ariel [EMAIL PROTECTED] wrote:
  

I need to know what is the percent of size of lucene's index respect the
information I'm going to index, I have read some articles that say if a I
index 120 Gb of information the index will grow until 40 Gb, that means the
percent is 30 %, Could somebody tell me how can be proved that ?
Is there any official document of apache lucene where says that ?
I hope somebody can help me.
Thanks.
Ariel




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Matthew Hall

Did you try to open the index using Luke?

Luke will be able to tell you whether or not the index is in  fact 
corrupted, but looking at your stack trace, it almost looks like the 
file.. simply isn't there?


Matt

Jamie wrote:

Hi Everyone

I am getting the the following error when executing  Hits hits = 
searchers.search(query, queryFilter, sort):


18007414-java.io.IOException: Bad file descriptor
18007455-   at java.io.RandomAccessFile.seek(Native Method)
18007504-   at 
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:545) 

18007592-   at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) 

18007678-   at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) 


--
18009148-   at 
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168) 

18009247-   at 
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) 

18009332-   at 
org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:43) 

18009419-   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:122)
18009493-   at 
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250) 



Does this mean the index is corrupted? Any idea why it would be 
corrupted?


Jamie



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Matthew Hall
I'm not sure which file in particular would be the one 
corrupter/missing, which is why I suggested looking at the index with luke.


As for the jave 1.6 lucene 2.3.2 index corruption issue, I'm not 100% 
familiar with the details on that one, but as a quick test, you should 
be able to swap to a 1.5 version of java, reindex and see if that fixes 
things.


Well.. unless your code uses something jave 6 specific I suppose.

Matt

Jamie wrote:

Hi Matthew

Thanks in advance for the suggestion.

Which file do you think does not exist?

This is what we have:

_15zw.cfs  _19od.cfs  _1a5d.cfs  _1a7n.cfs  _1ahf.cfs  _1ahh.cfs  
_qzl.cfs   segments.gen
_1993.cfs  _1a0w.cfs  _1a7c.cfs  _1a9m.cfs  _1ahg.cfs  _1ahi.cfs  
segments_158j


Aside from Luke (which requires a GUI), it is there a command line 
utility that can check the integrity of the index?


Jamie

Matthew Hall wrote:

Did you try to open the index using Luke?

Luke will be able to tell you whether or not the index is in  fact 
corrupted, but looking at your stack trace, it almost looks like the 
file.. simply isn't there?


Matt

Jamie wrote:

Hi Everyone

I am getting the the following error when executing  Hits hits = 
searchers.search(query, queryFilter, sort):


18007414-java.io.IOException: Bad file descriptor
18007455-   at java.io.RandomAccessFile.seek(Native Method)
18007504-   at 
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:545) 

18007592-   at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:131) 

18007678-   at 
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240) 


--
18009148-   at 
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168) 

18009247-   at 
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java:56) 

18009332-   at 
org.apache.lucene.search.TopFieldDocCollector.init(TopFieldDocCollector.java:43) 

18009419-   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:122)
18009493-   at 
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250) 



Does this mean the index is corrupted? Any idea why it would be 
corrupted?


Jamie



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Term Frequency for more complex terms

2008-07-03 Thread Matthew Hall
I have a quick question, could someone point me towards where in the API 
I'll have to investigate in order to figure out the term frequencies of 
more complex terms?


For example I want to know the tf of kit ligand treated as a phrase.  
I see that luke has access to this information in its explain method, 
but the api call is currently eluding me.


Thanks,

Matt

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can you create a Field that is a copy of another Field?

2008-06-30 Thread Matthew Hall
Hrm, sorry then I'm not sure how much more help I'm going to be able to 
be on this on.  I have to index things that have a DAG Structure 
(Treelike), but in order to get that functionality into my search I 
simply flatten out my dag, so any single term knows all of its children, 
but loses the structure of those children beyond that.  This approach 
works for my data, but it doesn't sound like it will for yours.


So, while I think you can still use the general technique that I showed 
you on this on, I have a feeling you are going to need to customize it 
some for your domain.


Best of luck, and if there's anything else I can help with let me know.

Matt

[EMAIL PROTECTED] wrote:

Matthew,

It has to do with the fact that we're trying to represent these Property entitities hierarchically.  We are displaying them in a tree structure, similar to the way Windows Explorer displays directories and files your file system.  E.g. all the states would be at the root level.  If you expanded a particular state you would see all the cities in that state, etc.  


If the user does a search we want to filter or reduce the tree.  E.g. imagine 
you search on the term 'Smith'.  Well since it's a safe bet to assume that there's 
somebody with the last name of Smith in all fifty states, then all fifty states would 
show up at the root level.  On the other hand, suppose there's one guy in the whole 
country named with the last name of 'Fleebleflabble' and he lives in Michigan.  If I 
search on that term I would expect only one state, namely Michigan to show up at the root 
level.  Each level in the heirarchy is filtered by the search specified terms in this way.

Searches are not limited to people's names though.  We want to reduce the tree 
by matches on ANY field in the Properties from 'State' to 'Name'.  So for 
example, a seach on 'Smith' would return matches for everybody that lived in a 
city named 'Smith City' or on a street named 'Smith Avenue', etc.

This doesn't make a lot of sense for people and addresses, I admit.  I just 
used that as an easy follow example.  But it does make sense for the data we're 
storing.  And BTW, maybe you can see a few holes in this approach.  There's a 
bit more to it than I've described above.  We have had to get a little creative 
with other documents and fields in order for it work correctly.  I'd be happy 
to elaborate if anybody is interested.  There may be better ways to do it.  
Like I said I'm fairly new to Lucene.  Was just trying to keep it simple.

--
Bill 


-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 30, 2008 8:26 AM

To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

Sorry, didn't get this until this morning.

Yes, both fields should be indexed and searchable, though the data_type 
one should likely be untokenized. 

Data should be indexed and tokenized with whatever appropriate Analyzer 
works for your data.


As for what your indexing, may I ask why you are doing it like that?

I would have thought indexing each property seperately (a seperate doc) 
would have been sufficient for your needs, but if you can explain a bit 
more about your situation perhaps I can be more helpful on this matter?


Matt

[EMAIL PROTECTED] wrote:
  
Hmmm, I think maybe I am missing something.  In your design is the 'data' field indexed, i.e. searchable?  Or is it an unindexed, stored field?  

I was thinking that both 'data' and 'data_type' were indexed and searchable.  


Maybe the confusion stems from the fact that for the Document corresponding to 
State=California, we're not just indexing on the token 'California'.  We're 
indexing on all the tokens from all the Properties in the set of Properties corresponding 
to a person's address.  In my original example this would be: California, Sacremento, 
94203, South, Main, 1234, Joe and Smith.

For the 'data_type' field I was thinking you were saying we'd index on a single 
token, namely 'State' (or whatever the left-hand side is).

Does that make sense?
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 



-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 3:33 PM

To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

Yup, you're pretty much there.

The only part I'm a bit confused about is what you've said in your data 
field there,


I'm thinking you mean that for the data_type: State, you would have 
the data entry of California, right?


If so, then yup, you are spot on ^^

We use this technique all the time on our side, and its helped 
considerably.  We then use the db_key to reference into a display time 
cache that holds all of the display information for the underlying 
object that we would ever want

Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Matthew Hall
));
propertyIndexWriter.addDocument(doc);
tokenStream.close();
}

Hope that clears it up.  


BTW, in case this seems like a strange way to index things, I will also add 
that we are doing it this way in order to impose a heirarchical structure on 
Properties.  So my example above should really look like this:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Use your imagination to visualize what the tree might look like with millions of peoples' addresses.  Now 
imagine trying to tokenize the Document corresponding to State=California.  Each path thru the 
tree from root (State) to leaf (Name) represents a set of Properties that is used to index the 
keywords field in the State=California document.  In other words it takes a long time 
to index.  This is why I'm looking for a way to just copy one field to another.

There is a lot more to our design to facilitate this hierarchical structure but 
this is probably more than you wanted to know. :)

thanks in advance,
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 



-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 7:26 AM

To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?


On Jun 27, 2008, at 12:01 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] 
  wrote:


  

Hello Lucene Gurus,



I'm new to Lucene so sorry if this question basic or naïve.



I have a Document to which I want to add a Field named, say, foo  
that is tokenized, indexed and unstored.  I am using the  
Field(String name, TokenStream tokenStream) constructor to create  
it.  The TokenStream may take a fairly long time to return all its  
tokens.





Can you share some code here?  What's the reasoning behind using it  
(not saying it's wrong, just wondering what led you to it)?  Are you  
just loading it up from a file, string or something or do you have  
another reason?



  
Now for querying reasons I want to add another Field named, say,  
bar, that is tokenized and indexed in exactly the same way as  
foo.  I could just pass it the same TokenStream that I used to  
create foo but since it takes so long to return all its tokens, I  
was wondering if there is a way to say, create bar as a copy of  
foo.  I looked thru the javadoc but didn't see anything.






By exactly the same, do you really mean exactly the same?  What's the  
point of that?  What are the querying reasons?


You may want to look at the TeeTokenFilter and the SinkTokenizer, but  
I guess I'd like to know more about what's going on before fully  
recommending anything.



  
Is this possible in Lucene or do I just have to bite the bullet  
build the new Field using the same TokenStream again?


--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194  
Oak Valley Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 

www.sungard.com/energy blocked::http://www.sungard.com/energy







--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Matthew Hall

Yup, you're pretty much there.

The only part I'm a bit confused about is what you've said in your data 
field there,


I'm thinking you mean that for the data_type: State, you would have 
the data entry of California, right?


If so, then yup, you are spot on ^^

We use this technique all the time on our side, and its helped 
considerably.  We then use the db_key to reference into a display time 
cache that holds all of the display information for the underlying 
object that we would ever want to present to the user.  This allows our 
search time index to be very concise, and as a result nearly every 
search we hit it with is subsecond, which is a nice place to be ^^


Matt

[EMAIL PROTECTED] wrote:

Matthew,

Thanks for the reply.  This looks very interesting.  If I'm understanding 
correctly your db_key, data and data_type are Fields within the Document, 
correct?  So is this how you envision it?

Document: State=California
   Field: 'db_key'='1395' (primary key into relational table, correct?)
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
   Field: 'data_type' indexed by 'State'

Document: City=Sacremento
   Field: 'db_key'='2405' 
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.

   Field: 'data_type' indexed by 'City'

Then my query for all Properties would be:

+data:South

My query for only 'City' Properties would be:

+data:South +data_type:City

Is that right?

I think that would work.  Very nice.  Thank you very much
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 



-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 11:49 AM

To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

I'm not sure if this is helpful, but I do something VERY similar to this 
in my project.


So, for the example you are citing I would design my index as follows:

db_key, data, data_type

Where the data_type is some sort of value representing the thing that's 
on the left hand side of your property relationship there.


So, then in order to satisfy your search, the queries become quite simple:

The search for everything simply searches against the data field in this 
index, wheras the search for a specific data_type + searchterm becomes a 
simple boolean query, that has a MUST clause for the data_type value.


As an even BETTER bonus, this will then mean that all of your searchable 
values will now have relevance to each other at scoring time, which is 
quite useful in the long run.


Hope this helps you out,

Matt

[EMAIL PROTECTED] wrote:
  

Grant,

Thanks for the reply.  What we're trying to do is kind of esoteric and hard to 
explain without going into a lot of gory details so I was trying to keep it 
simple.  But I'll try to summarize.

We're trying to index entities in a relational database.  One of the entities 
we're trying to index is something called a Property.  Think of a Property kind 
of like the java.util.Properties class, i.e. a name/value pair. So some 
examples of Properties might be:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Etc., etc.

(Note: this isn't the type of data we're storing... just trying to keep it 
simple.)

Imagine that the above list represents the the set of Properties that specify 
the address for a single person, Joe Smith.  Each Property in the set will be 
indexed by the values on the right-hand side of all the other name/value pairs 
in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and 
Smith.

There are two types of queries that we want to do.  
1) retrieve every Property matching the specified search terms, regardless of its left-hand side.  For this we want to create a field in EVERY Document called keywords and index it by the right-hand side values as described above.

2) retrieve every Property with a given left-hand side that matches the 
specified search terms.  For example, find all the 'City' Properties that match 
the term 'South'.  For this we want to create a field with the name of the 
left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents 
that correspond to a Property with that left-hand side.  Again this field will 
be indexed by the right-hand side values as described above.

So a couple of examples from the above list might look something like:

Document: State=California
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

Document: City=Sacremento
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

Now if I'm interested in all the Properties that match the word South, I

Re: lucene wildcard query with stop character

2008-06-12 Thread Matthew Hall

I assume you want all of your queries to function in this way?

If so, you could just translate the * character into a ? at search time, 
which should give you the functionality you are asking for.


Unless I'm missing something.

Matt

Cam Bazz wrote:

Hello,

Imagine I have the following documents having keys

A
AB
ABC
ABD
ABCD

now Imagine a query with keyword analyzer and a wildcard: AB*

which will bring me ABC , ABD and ABCD

but I just want to get ABC and ABD

so can I make a query like AB* but does not have the  character after
AB

Best Regards,
-C.B.

  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene wildcard query with stop character

2008-06-12 Thread Matthew Hall
Hrm.. can we see a more specific example of the type of data you are 
trying to query against here?


Matt

Cam Bazz wrote:

well the ? would work if the length of each token be same.
however, instead of ABC I want tags that change dynamically from 1 to
unlimited length.

I just I could just pad every token to a normalized length such as
...000A but i am hoping there is a better method.

if we could tell lucene to do it like in a regular expression until a  is
there to insert ??'s ...

Another way could be to do the regularexpression outside lucene, but then
still there is need for fetching the hits.

Best.
-C.B.



On Thu, Jun 12, 2008 at 8:47 PM, Matthew Hall [EMAIL PROTECTED]
wrote:

  

I assume you want all of your queries to function in this way?

If so, you could just translate the * character into a ? at search time,
which should give you the functionality you are asking for.

Unless I'm missing something.

Matt


Cam Bazz wrote:



Hello,

Imagine I have the following documents having keys

A
AB
ABC
ABD
ABCD

now Imagine a query with keyword analyzer and a wildcard: AB*

which will bring me ABC , ABD and ABCD

but I just want to get ABC and ABD

so can I make a query like AB* but does not have the  character after
AB

Best Regards,
-C.B.



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible Bug when Querying?

2008-05-16 Thread Matthew Hall

Very very interesting.

I went ahead and turned on the AllowLeadingWildcard toggle and 
everything works just as expected now, which is odd in a way.


I'm still not certain why a search for '\*ache*' would be considered to 
have a  leading wildcard.  I'm searching for the literal * character 
here which I would have assumed would be a completely fine thing to do 
in a search, but somehow its triggering the leading wildcard checking 
logic. 

Well, anyhow thanks much for the suggestion, things are working properly 
now.


Matt

Karl Wettin wrote:


15 maj 2008 kl. 18.33 skrev Matthew Hall:


12:23:05,602 INFO  [STDOUT] 
org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': 
'*' not allowed as first character in PrefixQuery

12:23:05,602 INFO  [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen
12:23:05,602 ERROR [STDERR] java.lang.NullPointerException
12:23:05,602 ERROR [STDERR] at 
org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown 
Source)



Which looks to me alot like something akin to the 
AllowLeadingWildcard stuff that comes along with wildcardqueries.


But, the odd thing is the leading character in my search string ISN'T 
*, its the escaped star character, which I would have thought would 
work with no problems at all.


Have I stumbled across a bug here?



Did you setAllowLeadingWildcard(true)?

  /**
   * Set to codetrue/code to allow leading wildcard characters.
   * p
   * When set, code*/code or code?/code are allowed as
   * the first character of a PrefixQuery and WildcardQuery.
   * Note that this can produce very slow
   * queries on big indexes.
   * p
   * Default: false.
   */
  public void setAllowLeadingWildcard(boolean allowLeadingWildcard) {



 karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Possible Bug when Querying?

2008-05-15 Thread Matthew Hall

Greetings,

I'm searching against a data set using lucene that contains searches 
such as the following:


*ache*
*aChe*

etc and so forth, sadly this part of the dataset is imported via an 
external client, so we have no real way of controlling how they format it.


Now, to make matters a bit more complex, my clients have decided to turn 
off all wildcard searching, EXCEPT for prefix searches, so when I 
process the query string I go ahead and go through it and escape out all 
lucene special characters, except for the trailing *.


So I end up sending the following string to the query parser:

\*ache* (I'm doing standard things like converting everything to lowercase)

and when I put that into the query parser its throwing the following 
exception:


12:23:05,602 INFO  [STDOUT] 
org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': 
'*' not allowed as first character in PrefixQuery

12:23:05,602 INFO  [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen
12:23:05,602 ERROR [STDERR] java.lang.NullPointerException
12:23:05,602 ERROR [STDERR] at 
org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown 
Source)



Which looks to me alot like something akin to the AllowLeadingWildcard 
stuff that comes along with wildcardqueries.


But, the odd thing is the leading character in my search string ISN'T *, 
its the escaped star character, which I would have thought would work 
with no problems at all.


Have I stumbled across a bug here?

Matt


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible Bug when Querying?

2008-05-15 Thread Matthew Hall
No I did not, because I'm not performing a search with a leading 
wildcard, nor am I intending to allow that behavior.  But what I do want 
to be able to search on is a word that starts with a * by escaping it, 
because sadly our data contains such things.


Matt

Karl Wettin wrote:


15 maj 2008 kl. 18.33 skrev Matthew Hall:


12:23:05,602 INFO  [STDOUT] 
org.apache.lucene.queryParser.ParseException: Cannot parse '\*ache*': 
'*' not allowed as first character in PrefixQuery

12:23:05,602 INFO  [STDOUT] Failure in QS_MarkerSearch.searchMarkerNomen
12:23:05,602 ERROR [STDERR] java.lang.NullPointerException
12:23:05,602 ERROR [STDERR] at 
org.jax.mgi.search.model.QS_MarkerSearch.searchInexactMatches(Unknown 
Source)



Which looks to me alot like something akin to the 
AllowLeadingWildcard stuff that comes along with wildcardqueries.


But, the odd thing is the leading character in my search string ISN'T 
*, its the escaped star character, which I would have thought would 
work with no problems at all.


Have I stumbled across a bug here?



Did you setAllowLeadingWildcard(true)?

  /**
   * Set to codetrue/code to allow leading wildcard characters.
   * p
   * When set, code*/code or code?/code are allowed as
   * the first character of a PrefixQuery and WildcardQuery.
   * Note that this can produce very slow
   * queries on big indexes.
   * p
   * Default: false.
   */
  public void setAllowLeadingWildcard(boolean allowLeadingWildcard) {



 karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Quickie Luke Question

2008-04-25 Thread Matthew Hall

Does anyone know how to set the MaxClauseCount in luke?

I'm in a situation where I've had to override it when searching against 
my indexes, but now I can't use luke to examine what's going on with my 
queries anymore.


Any help would be appreciated.

Matt

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question About Hits

2008-04-04 Thread Matthew Hall
This is more of a trying to understand the design sort of question, but 
its still something I need to able to succinctly express to my project 
manager.


I know that lucene is by design not allowing us to see which fields were 
hit for a given document in an easy manner.  Instead it presents us with 
a collection of hits, with each hit having the total score for the 
document, given all of the fields that you have searched on, with that 
total score being the score for the matches for each field combined via 
the scoring algorithm.


The question I'm being asked is why is the information about how each 
field matched not easily accessable in lucene?


I know I can go ahead and do a searcher.explain on my hit object, and 
then ... parse out the individual fields with their scores, but couldn't 
this be much more easily accessable from the hits object itself?


The hits object already has a get method that allows you to pass a 
String value for a string name to the object, couldn't another method be 
added such as getScoreByField(String s) that had access to the 
information that was used to build the total score of the document?


I'm sure part of the reason that this wasn't included were performance 
based, I mean it would be a fair amount of extra information for the 
average search to have to carry around, but for my application, and many 
others I'm sure, its a very important thing to be able to find out WHY a 
document was returned.  If for nothing less than for display purposes.


Anyhow, any insight as to why things are the way they are would be most 
appreciated, or if someone else has faced the same problems as I, and 
have went ahead and modified the hits object to include such things (and 
this is no small task) I'd love to hear about it.


-Matt


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implementing CMS search function using Lucene

2008-04-03 Thread Matthew Hall
You could try something like this, which use when I put my own documents 
together:


   public Document getDocument(){
  
   Document doc = new Document();
  
   doc.add(new Field(db_key, this.getDb_key(), Field.Store.YES, 
Field.Index.UN_TOKENIZED));
   doc.add(new Field(acc_ids, this.getAcc_ids(), Field.Store.YES, 
Field.Index.TOKENIZED));
   doc.add(setBoost(new Field(name, this.getName(), 
Field.Store.YES, Field.Index.TOKENIZED), 1.0f));
   doc.add(setBoost(new Field(symbol, this.getSymbol(), 
Field.Store.YES, Field.Index.TOKENIZED), 1.0f));
   doc.add(setBoost(new Field(synonyms, this.getSynonyms(), 
Field.Store.YES, Field.Index.TOKENIZED), .8f));
   doc.add(setBoost(new Field(allele_nomen, 
this.getAllele_nomen(), Field.Store.YES, Field.Index.TOKENIZED), .6f));
   doc.add(setBoost(new Field(old_nomen, this.getOld_nomen(), 
Field.Store.YES, Field.Index.TOKENIZED), .4f));
   doc.add(setBoost(new Field(orth_nomen, this.getOrth_nomen(), 
Field.Store.YES, Field.Index.TOKENIZED), .2f));

   return doc;
   }
  
   private static Field setBoost (Field workField, float boost)

   {
   workField.setBoost(boost);
   return workField;
   }

It works out pretty well for me anyhow.

Илья Казначеев wrote:

В сообщении от Thursday 03 April 2008 16:24:15 Илья Казначеев написал(а):

  

- Is there a way to set weights for different fields? Let's say, content
have a weight of 1, title have a weight of 5 and picture subscribe have a
weight of 0.5. If no, can I do that by hand?


Already found field.setBoost().
Sorry for asking lame questions :(

By the way, if setBoost() returned this, it would be much easier to assemble 
document, one line instead of three. Chaining rules.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter Hits

2008-03-12 Thread Matthew Hall
I suspect you are using a different analyzer to highlight than you are 
using to search.


A couple of things you can check:

Immediately after your query simply print out hits.length, this should 
conclusively tell you that you query is in fact working, after that 
ensure that you are using the same analyzer for your highlighter that 
you are for your query parser.


If you are not, its entirely possible that the text you are trying to 
highlight with is being transformed differently than how it was in the 
query, and as a result isn't matching against your fields anymore.


Hope that helps,

Matt

JensBurkhardt wrote:

Hello everybody,

I have s slight problem using lucenes highlighter. If i have the highlighter
enabled, a query creates 0 hits, if i disable the highlighter i get the
hits.
It seems like, when i call searcher.search() and pass my Hits hits to the
highlighter function, the program quits. All prints after the highlighter
call also do not appear.
I have no idea what the problem is. 


Thanks in advise

Jens Burkhardt
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Looking for an example of Using Position Increment Gap

2008-03-04 Thread Matthew Hall

Fellows,

I'm working on a project here where we are trying to use our lucene 
indexes to return concrete objects.  One of the things we want to be 
able to match by is by vocabulary terms annotated to that object, as 
well as all of the child vocabulary terms of that annotated term.


So, what I was thinking about doing is extending my index that returns 
objects of that type to include a new field say sub_term.  In this 
field I would put all of the text of these vocabulary sub terms 
together, and introduce phrase boundries using some of the techniques 
that are described in the Javadoc in the analysis section.  (Basically 
writing a custom analyzer that introduces a position increment gap 
between phrases)  I am however curious if an example of a usage like 
that exists somewhere that I could use as a basis for the analyzer that 
I'm going to have to write to handle this case.


Does anyone know of a good example?

Matt


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suffix search

2008-02-22 Thread Matthew Hall

What you need is to set the allow leading wildcard flag.

qp.setAllowLeadingWildcard(true);

(where qp is a query parser instance)

That will let you do it, be warned however there is most definitely a 
significant performance degradation associated with doing this.


Matt

[EMAIL PROTECTED] wrote:

Hi,

using WildcardQuery directly it is possible to search for suffixes like
*foo.

The QueryParser throws an exception that this is not allowed in a
WildcardQuery.

Hm, now I'm confused ;)

How can I configure the QueryParser to allow a wildcard as first character?

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem in building Lucene

2007-09-13 Thread Matthew Hall
Also, ensure that you didn't inadvertently add an older version of your 
Jar file somewhere in your classpath.  Eclipse will take the first it 
comes to, and skip any others found later on in the path.


Right Click on your Project - Properties - Java Build Path and ensure 
you don't have an older version in there.


Matt

Koji Sekiguchi wrote:

Try to reread jar file on Eclipse.
To do it, right-click on your project,
then choose refresh.

Thank you,

Koji

sandeep chawla wrote:

I have to change lucene code for some reason.

I changed the source code of the lucene and ran the ant command on 
build.xml.


it created a jar file in build directory then i added the jar file to
my project in eclipse .

I am facing a bizarre problem now. Changes i have made in source code
are not reflected in new jar file..

Any help in this regards..please

Thanks
Sandeep

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]