Frustrated with tokenized listing terms

2005-10-24 Thread JMA

Greetings...
Quick question, perhaps I am missing something.

I have a bunch of documents where one of the indexed fields is "author". For
example:

book1, by "John Smith"
book2, by "Steve Smith"
book3, by "John Smith"

I would like to find all distinct authors in my index.  I want to support
searches for author:smith, so I tokenize the author field during index.
However, getTerms() then returns:

John (x2)
Smith (x3)
Steve (x1)

I would like to see:
John Smith (x2)
Steve Smith (x1)

I've solved this by indexing the field twice, once as author:(searchable/not
stored/tokenized)
and once as author_phrased:(not searchable/stored/not tokenized).

Then I query using the 'author' field while listing terms using the
'author_phrased' field.

This works, but is it the proper way to do it?

Thanks in advance,

JMA



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread Olena Medelyan
Hi Sam,

to do such matching you first of all need something that keeps semantic
information about words: e.g. a thesaurus, where "red", "blue" and "black"
are all grouped under the same term "colour". Otherwise, how will
your system know that "nike red shoes" should match to "nike shoes -black" and 
not to
"nike shoes -"anything else"?
You would also need rules that define that only certain terms are to
be replaced with alternatives. Otherwise, your query can be mapped to X
alternatives like:
"-adidas red shoes", "nike red -pants" ...

Cheers,
Olena

On Sun, 23 Oct 2005, Sam Lee wrote:

> Hi,
>   Someone suggested that I should use MemoryIndex to
> match content to a large # of queries. e.g. "nike red
> shoes" --match--> "nike shoes -blue"  and --match-->
> "nike shoes -black"...  What if I have 10 of these
> queries for each content?  and there maybe 100 of
> these contents.
>
> But how fast is MemoryIndex?  Is it cpu and memory
> intensive?  I read somewhere and it said that it is
> about  three order faster than normal operation.  If
> so, why not use it for the normal operation as well?
>
> Many thanks.
>
>
>
>
>
> __
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Webfarm and Index Location

2005-10-24 Thread msftblows
Hey-
 
I would like to store my index in one location, and then have all my IIS 
servers on the farm call that one index. Basically, I am looking for the best 
approach here...and any ideas anyone has...
 
Options:
 
1. Store index on SAN and have each server call that location...seems this is 
an issue because on the SAN I can not have more than one shared drive per 
computer calling it...would need LUN for each.
 
2. Store index on a shared drive (not on a SAN), and then cluster that box that 
I store the index on...will this work? What over-head for a shared drive call?
 
3. Make a webservice call
 
4. Make a remoting call
 
Anything else?
 
Regards!
-Joe


Re: Recommendation on Reading or Websites or Examples of How to Use Lucene?

2005-10-24 Thread Grant Ingersoll
I find the unit tests in the actual code to be quite helpful.  Have also 
found various talks and articles sprinkled throughout the web.  I think 
the Wiki lists some of these, see:

http://wiki.apache.org/jakarta-lucene/HowTo
http://wiki.apache.org/jakarta-lucene/Resources


Sam Lee wrote:


Hi,
 Do you guys have good recommendation on websites
that have detail explanation about how to use Lucene? 
If they have source examples too, that would be great.
I already read the book Lucene in Action.  


Many thanks.



__ 
Yahoo! FareChase: Search multiple travel sites in one click.

http://farechase.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexwriter and index searcher

2005-10-24 Thread Dan Adams
If I have a directory open and I open an index writer and add a document
do I have to close the directory and re-open it before I can open a
searcher and have the new document be included in the search?

In general, is it good to keep the directory open or is it better to
open the document each time you need a searcher or writer or something.

-- 
Dan Adams
Software Engineer
Interactive Factory


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexwriter and index searcher

2005-10-24 Thread Erik Hatcher


On 24 Oct 2005, at 10:07, Dan Adams wrote:
If I have a directory open and I open an index writer and add a  
document

do I have to close the directory and re-open it before I can open a
searcher and have the new document be included in the search?


Yes.


In general, is it good to keep the directory open or is it better to
open the document each time you need a searcher or writer or  
something.


In general it all depends :)

But, it is best to keep IndexSearcher cached over multiple searches  
and only recreate it when the index changes and you need to reflect  
those changes with future searches.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread markharw00d
If so, why not use it for the normal operation as well?  


Because MemoryIndex only allows you to store/query one document.
It is fast, but I would not suggest running 1 queries against it.

Why not try store the queries as documents in a special index and query 
them using the subject document.
The results will be a rough short-list of the queries you now need to 
run (ie less than 10,000!).  Put the subject document eg "i sell red 
nike shoes" into a memory index then run the selected queries against it.


These queries may have mandatory clauses  ( eg +/- operators) which may 
cause them to fail when run as queries against the MemoryIndexed subject 
doc which is why the first "query the queries" search is insufficient to 
find the matches.


Cheers,
Mark


Sam Lee wrote:


Hi,
 Someone suggested that I should use MemoryIndex to
match content to a large # of queries. e.g. "nike red
shoes" --match--> "nike shoes -blue"  and --match-->
"nike shoes -black"...  What if I have 10 of these
queries for each content?  and there maybe 100 of
these contents.

But how fast is MemoryIndex?  Is it cpu and memory
intensive?  I read somewhere and it said that it is
about  three order faster than normal operation.  If
so, why not use it for the normal operation as well?  


Many thanks.





__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 








___ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexwriter and index searcher

2005-10-24 Thread MALCOLM CLARK

Hi all,

I am relatively new and scared by Lucene so please don't flame me.I have 
abandoned Digester and am now just using other SAX stuff.

I have used the sandbox stuff to parse an XML file with SAX which then bungs it 
into a document in a Lucene index.The bit I'm stuck on is how is a 
elementBuffer split up into several items.I have a elementBuffer with three 
'article' documents but only shows as one when using Luke to view the index? 

please advise.

Thanks very much.

MC



Re: indexwriter and index searcher

2005-10-24 Thread Erik Hatcher
I think you really need to show us some code.  If your XML documents  
are small enough, then perhaps DOM (via JDOM) would be a much simpler  
way to navigate XML via XPath.


Erik

On 24 Oct 2005, at 11:07, MALCOLM CLARK wrote:



Hi all,

I am relatively new and scared by Lucene so please don't flame me.I  
have abandoned Digester and am now just using other SAX stuff.


I have used the sandbox stuff to parse an XML file with SAX which  
then bungs it into a document in a Lucene index.The bit I'm stuck  
on is how is a elementBuffer split up into several items.I have a  
elementBuffer with three 'article' documents but only shows as one  
when using Luke to view the index?


please advise.

Thanks very much.

MC





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Non-scoring fields

2005-10-24 Thread Maik Schreiber
Hi,

Just a quick question: How do I add non-scoring fields to a query? Set boost to 
0?

To be more specific, my documents have a "permissions" field containing the
names of groups who are allowed to access the document. When searching, I
search for the particular user's group (a user is in exactly one group).
Searching in the "permissions" field adds to the score, however, so that more
restrictive documents (having fewer groups in the field) tend to get a higher
score, thus showing up more towards the top of the list. I just don't want
that, though...

-- 
Maik Schreiber   *   http://www.blizzy.de

GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread Sam Lee
How much of a performance impact if I store queries as
documents first?  

Actually, I just thought of a way to first select
queries with certain quality before doing memoryindex,
so it will trim it to much less than 10.  

But has anyone done MemoryIndex?  I need some
real-world examples that can tell me how fast
MemoryIndex is before I decide to use it, like # of
queries /sec and cpu and memory they are using, etc. 
I searched all over google but can't find any.

--- markharw00d <[EMAIL PROTECTED]> wrote:

> >>If so, why not use it for the normal operation as
> well?  
> 
> Because MemoryIndex only allows you to store/query
> one document.
> It is fast, but I would not suggest running 1
> queries against it.
> 
> Why not try store the queries as documents in a
> special index and query 
> them using the subject document.
> The results will be a rough short-list of the
> queries you now need to 
> run (ie less than 10,000!).  Put the subject
> document eg "i sell red 
> nike shoes" into a memory index then run the
> selected queries against it.
> 
> These queries may have mandatory clauses  ( eg +/-
> operators) which may 
> cause them to fail when run as queries against the
> MemoryIndexed subject 
> doc which is why the first "query the queries"
> search is insufficient to 
> find the matches.
> 
> Cheers,
> Mark
> 
> 
> Sam Lee wrote:
> 
> >Hi,
> >  Someone suggested that I should use MemoryIndex
> to
> >match content to a large # of queries. e.g. "nike
> red
> >shoes" --match--> "nike shoes -blue"  and
> --match-->
> >"nike shoes -black"...  What if I have 10 of
> these
> >queries for each content?  and there maybe 100
> of
> >these contents.
> >
> >But how fast is MemoryIndex?  Is it cpu and memory
> >intensive?  I read somewhere and it said that it is
> >about  three order faster than normal operation. 
> If
> >so, why not use it for the normal operation as
> well?  
> >
> >Many thanks.
> >
> >
> >
> >
> > 
> >__ 
> >Start your day with Yahoo! - Make it your home
> page! 
> >http://www.yahoo.com/r/hs
> >
>
>-
> >To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> >
> >
> >  
> >
> 
> 
> 
>   
>   
>   
>
___
> 
> Yahoo! Messenger - NEW crystal clear PC to PC
> calling worldwide with voicemail
> http://uk.messenger.yahoo.com
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 




__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread Christophe

Hi, Sam,

Is there a reason you couldn't build a test case and try it, in your  
environment and on your hardware?  That seems to be the only way to  
really answer the question.


On 24 Oct 2005, at 09:54, Sam Lee wrote:


How much of a performance impact if I store queries as
documents first?

Actually, I just thought of a way to first select
queries with certain quality before doing memoryindex,
so it will trim it to much less than 10.

But has anyone done MemoryIndex?  I need some
real-world examples that can tell me how fast
MemoryIndex is before I decide to use it, like # of
queries /sec and cpu and memory they are using, etc.
I searched all over google but can't find any.

--- markharw00d <[EMAIL PROTECTED]> wrote:



If so, why not use it for the normal operation as


well?

Because MemoryIndex only allows you to store/query
one document.
It is fast, but I would not suggest running 1
queries against it.

Why not try store the queries as documents in a
special index and query
them using the subject document.
The results will be a rough short-list of the
queries you now need to
run (ie less than 10,000!).  Put the subject
document eg "i sell red
nike shoes" into a memory index then run the
selected queries against it.

These queries may have mandatory clauses  ( eg +/-
operators) which may
cause them to fail when run as queries against the
MemoryIndexed subject
doc which is why the first "query the queries"
search is insufficient to
find the matches.

Cheers,
Mark


Sam Lee wrote:



Hi,
 Someone suggested that I should use MemoryIndex


to


match content to a large # of queries. e.g. "nike


red


shoes" --match--> "nike shoes -blue"  and


--match-->


"nike shoes -black"...  What if I have 10 of


these


queries for each content?  and there maybe 100


of


these contents.

But how fast is MemoryIndex?  Is it cpu and memory
intensive?  I read somewhere and it said that it is
about  three order faster than normal operation.


If


so, why not use it for the normal operation as


well?



Many thanks.





__
Start your day with Yahoo! - Make it your home


page!


http://www.yahoo.com/r/hs




-


To unsubscribe, e-mail:


[EMAIL PROTECTED]


For additional commands, e-mail:


[EMAIL PROTECTED]


















___



Yahoo! Messenger - NEW crystal clear PC to PC
calling worldwide with voicemail
http://uk.messenger.yahoo.com




-


To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]








__
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Cross-field multi-word and query

2005-10-24 Thread Maxim Patramanskij

I have the following problem:

I need to construct programmatically a Boolean query against n fields
having m words in my query.

All possible unique combinations(sub-queries) are disjunctive between
each other while boolean clauses of each combination combines with AND
operator.

The reason of such complexity is that I have to find a result of AND
query against several field, when parts of my query could appear in
different fields and I can't create just one single field because each
field has its own boost level.

Does anyone have an experience of writing such query builder?

Best regards,
 Maxim


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread mark harwood
It is fast.
>> so, why not use it for the normal operation as
well?

Because it only stores one document.

Given the number of queries you have I'm not sure I'd
run them all. How about putting them as docs into a
categorisation index then using the subject document
as a query to selct a subset of the queries you need
to run?
This should give you a rough shortlist of queries then
you can run them all against the one memory-indexed
subject document to see if they *really* match i.e if
the mandatory/AND statements are all satisfied. 

Cheers,
Mark



___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene and databases

2005-10-24 Thread Rick Hillegas

Thanks to Yonik for replying to my last question about queries and filters.

Now I have another issue. I would appreciate any pointers to attempts to 
integrate Lucene with databases. There's a tantalizing reference to a 
class called JDBCDirectory mentioned at 
http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser 
times out trying to access the follow-up link 
http://ppinew.mnis.com/jdbcdirectory. An email thread 
(http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html) 
makes me hope that this class helps an application index a body of 
documents stored in a relational database. But this class, perhaps a 
cousin of FSDirectory and RAMDirectory, doesn't seem to be part of 
Lucene proper.


In any event, I would appreciate pointers to people's experience 
integrating Lucene with relational databases. I realize this is a very 
broad question. It sweeps up topics like the following:


o Indexing content that is stored in a dbms

o Wrapping filters around the results of sql queries

o Integrating Lucene query syntax with sql query syntax

o Practical tips about when to expose information as a Lucene field vs. 
when to expose that  information as a column in a relational table


Thanks,
-Rick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and databases

2005-10-24 Thread Chris Lu
JDBCDirectory doesn't help you to index content in rdms.
It just stores the lucene index into rdms. This approach will be
slower than file system based approach.

For your first question, "Indexing content that is stored in a dbms",
you can take a look at DBSight. It's a generic tool to easily extract
content from database and build an index, which seems simple, but
behind the scene, it does more than that, including, multi-threaded
extraction and search, multi index support, template-based search
result, scheduled index updating, web-based control and configuration,
remote index replication, etc.

Chris Lu
--
Lucene Full-Text Search on Any Database
http://www.DBSight.net


On 10/24/05, Rick Hillegas <[EMAIL PROTECTED]> wrote:
> Thanks to Yonik for replying to my last question about queries and filters.
>
> Now I have another issue. I would appreciate any pointers to attempts to
> integrate Lucene with databases. There's a tantalizing reference to a
> class called JDBCDirectory mentioned at
> http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser
> times out trying to access the follow-up link
> http://ppinew.mnis.com/jdbcdirectory. An email thread
> (http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html)
> makes me hope that this class helps an application index a body of
> documents stored in a relational database. But this class, perhaps a
> cousin of FSDirectory and RAMDirectory, doesn't seem to be part of
> Lucene proper.
>
> In any event, I would appreciate pointers to people's experience
> integrating Lucene with relational databases. I realize this is a very
> broad question. It sweeps up topics like the following:
>
> o Indexing content that is stored in a dbms
>
> o Wrapping filters around the results of sql queries
>
> o Integrating Lucene query syntax with sql query syntax
>
> o Practical tips about when to expose information as a Lucene field vs.
> when to expose that  information as a column in a relational table
>
> Thanks,
> -Rick
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and databases

2005-10-24 Thread Steven Rowe
Code and examples for embedding Lucene in HSQLDB and Derby relational 
databases:



Rick Hillegas wrote:

Thanks to Yonik for replying to my last question about queries and filters.

Now I have another issue. I would appreciate any pointers to attempts to 
integrate Lucene with databases. There's a tantalizing reference to a 
class called JDBCDirectory mentioned at 
http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser 
times out trying to access the follow-up link 
http://ppinew.mnis.com/jdbcdirectory. An email thread 
(http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html) 
makes me hope that this class helps an application index a body of 
documents stored in a relational database. But this class, perhaps a 
cousin of FSDirectory and RAMDirectory, doesn't seem to be part of 
Lucene proper.


In any event, I would appreciate pointers to people's experience 
integrating Lucene with relational databases. I realize this is a very 
broad question. It sweeps up topics like the following:


o Indexing content that is stored in a dbms

o Wrapping filters around the results of sql queries

o Integrating Lucene query syntax with sql query syntax

o Practical tips about when to expose information as a Lucene field vs. 
when to expose that  information as a column in a relational table


Thanks,
-Rick



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexwriter and index searcher

2005-10-24 Thread Otis Gospodnetic
Hi Malcolm,

--- MALCOLM CLARK <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> I am relatively new and scared by Lucene so please don't flame me.

I won't flame, but you just hijacked somebody else's thread baaad
boy!

> I have abandoned Digester and am now just using other SAX stuff.

No need to be afraid of Lucene nor SAX.  Have you got a copy of Lucene
in Action?  If not, consider getting at least a PDF version (cheaper)
and peeking at its code (free) - see http://lucenebook.com/ .  If you
look at chapter 7, you'll get most of the code you need for indexing
XML.

> I have used the sandbox stuff to parse an XML file with SAX which
> then bungs it into a document in a Lucene index.The bit I'm stuck on
> is how is a elementBuffer split up into several items.I have a
> elementBuffer with three 'article' documents but only shows as one
> when using Luke to view the index? 
> please advise.

I'm not sure what you are trying to do take look at LIA code.  If
that doesn't help, start a new thread, show the code, and somebody may
be able to provide some advice.

Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Non-scoring fields

2005-10-24 Thread Daniel Naber
On Montag 24 Oktober 2005 14:29, Maik Schreiber wrote:

> Just a quick question: How do I add non-scoring fields to a query? Set
> boost to 0?

Yes, just use permissions:blah^0

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Non-scoring fields

2005-10-24 Thread Maik Schreiber

Just a quick question: How do I add non-scoring fields to a query? Set
boost to 0?


Yes, just use permissions:blah^0


Cool, thanks.

--
Maik Schreiber   *   http://www.blizzy.de

GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Non-scoring fields

2005-10-24 Thread Mordo, Aviran (EXP N-NANNATEK)
You can also use a filter to filter your results. As far as I know
Filter does not effect the score

HTH

Aviran

http://www.aviransplace.com

-Original Message-
From: Maik Schreiber
[mailto:[EMAIL PROTECTED] 
Sent: Monday, October 24, 2005 2:24 PM
To: java-user@lucene.apache.org
Subject: Re: Non-scoring fields

>>Just a quick question: How do I add non-scoring fields to a query? Set

>>boost to 0?
> 
> Yes, just use permissions:blah^0

Cool, thanks.

-- 
Maik Schreiber   *   http://www.blizzy.de

GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Non-scoring fields

2005-10-24 Thread Andrzej Bialecki

Daniel Naber wrote:

On Montag 24 Oktober 2005 14:29, Maik Schreiber wrote:



Just a quick question: How do I add non-scoring fields to a query? Set
boost to 0?



Yes, just use permissions:blah^0


However, a side effect of this is that Explanations are broken (return 
always "0.0: match required").



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is there a way to get absolutely exact phrase matching (no stop words, etc)

2005-10-24 Thread Bob Mason

We have a large body of documents that have xml
and ocr embedded within one of the xml fields.

Searches such as "group effect"

are returning hits for docs such as ones that include the following:

 ...group of ~a- The effect...

because, I take it, stop words like 'of' and 'the' and punctuation
are ignored. Is there anything I can do about this other
than write an alternative to the Standard Analyzer?

thanks,

Bob Mason
UCSF Tobacco Industy Digital Library



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index downwards compatible?

2005-10-24 Thread Eva Rissmann
Hi all,
currently we are using Lucene 1.3, but soon we'd like to switch to
Lucene 1.4. Can the old index be used or does it have to be recreated?
And what about Lucene 1.9/2.0. Is the index downwards compatible?

Thanks
Eva


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is there a way to get absolutely exact phrase matching (no stop words, etc)

2005-10-24 Thread Steven Rowe

Hi Bob,

StandardAnalyzer filters the token stream created by StandardTokenizer 
through StandardFilter, LowercaseFilter, and then StopFilter.  Unless 
you supply a stoplist to the StandardAnalyzer constructor, you get the 
default set of English stopwords, from StopAnalyzer:


  public static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
  };

One approach to the problem you're seeing is to advance the token 
position in StopFilter with each stopword encountered, so that phrase 
queries like


   "group effect"

will fail to match against

   "...group of ~a- The effect..."

because the positions for tokens "group" and "effect" would not be adjacent.

(My naive reading of StandardTokenizer.jj, the JavaCC grammar used to 
create StandardTokenizer.java, is that "~a-" will generate a single 
token "a", which will then be filtered out by StopFilter.)


A patch implementing this approach was actually applied to 
StopFilter.java in late 2003, but was reverted shortly afterward, 
because this approach conflicts with the QueryParser and PhraseQuery 
implementations.


See Doug Cutting's description of the problem with the position 
increment modification approach here:



See a colored diff of StopFilter.java, just before and after the 
position increment modification patch was reverted, here:



This modification is simple and straightforward.  You could make the 
same changes to a local copy of StopFilter (call it PosIncrStopFilter), 
then create and use a StandardAnalyzer clone that uses PosIncrStopFilter 
instead of StopFilter.


Good luck,
Steve Rowe

Bob Mason wrote:

We have a large body of documents that have xml
and ocr embedded within one of the xml fields.

Searches such as "group effect"

are returning hits for docs such as ones that include the following:

 ...group of ~a- The effect...

because, I take it, stop words like 'of' and 'the' and punctuation
are ignored. Is there anything I can do about this other
than write an alternative to the Standard Analyzer?

thanks,

Bob Mason
UCSF Tobacco Industy Digital Library


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and databases

2005-10-24 Thread Chris Lu
Also, you can try Compass. I remember it stores the index when you use
hibernate.

Chris Lu
--
Lucene Full-Text Search on Any Database
http://www.DBSight.net

On 10/24/05, Chris Lu <[EMAIL PROTECTED]> wrote:
> JDBCDirectory doesn't help you to index content in rdms.
> It just stores the lucene index into rdms. This approach will be
> slower than file system based approach.
>
> For your first question, "Indexing content that is stored in a dbms",
> you can take a look at DBSight. It's a generic tool to easily extract
> content from database and build an index, which seems simple, but
> behind the scene, it does more than that, including, multi-threaded
> extraction and search, multi index support, template-based search
> result, scheduled index updating, web-based control and configuration,
> remote index replication, etc.
>
> Chris Lu
> --
> Lucene Full-Text Search on Any Database
> http://www.DBSight.net
>
>
> On 10/24/05, Rick Hillegas <[EMAIL PROTECTED]> wrote:
> > Thanks to Yonik for replying to my last question about queries and filters.
> >
> > Now I have another issue. I would appreciate any pointers to attempts to
> > integrate Lucene with databases. There's a tantalizing reference to a
> > class called JDBCDirectory mentioned at
> > http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser
> > times out trying to access the follow-up link
> > http://ppinew.mnis.com/jdbcdirectory. An email thread
> > (http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html)
> > makes me hope that this class helps an application index a body of
> > documents stored in a relational database. But this class, perhaps a
> > cousin of FSDirectory and RAMDirectory, doesn't seem to be part of
> > Lucene proper.
> >
> > In any event, I would appreciate pointers to people's experience
> > integrating Lucene with relational databases. I realize this is a very
> > broad question. It sweeps up topics like the following:
> >
> > o Indexing content that is stored in a dbms
> >
> > o Wrapping filters around the results of sql queries
> >
> > o Integrating Lucene query syntax with sql query syntax
> >
> > o Practical tips about when to expose information as a Lucene field vs.
> > when to expose that  information as a column in a relational table
> >
> > Thanks,
> > -Rick
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Delete doesn't delete?

2005-10-24 Thread Dan Quaroni
I know there's a little bit of trickery when it comes to deletes (i.e. it's 
still in the index until optimize, still available to open readers, etc) 
however I'm having this problem:

I've implemented a call to delete by term.  It tells me that it deleted 1 item, 
but then I go and open a new reader on the index, search for this document, and 
I find it.  Confused, I run the delete again, and it once again tells me 1 
document deleted.  And still, I can open a new searched and search for it.

I've tried closing the reader that I used to delete it, still no luck.

Any suggestions?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing problem - empty index files!

2005-10-24 Thread Samuel Jackson
Hi to all!

I'm new to Lucene and wanted to create a sample
application to index certain database fields.

But there seems to be some problem because the created
files in the index target directory are only 1kb -->
So I don't get any results of course.


Here is what I did - can anyone give me a hint whats
wrong?

for (Iterator iter = someData.iterator();
iter.hasNext(); ) { 
Item item = (Item) iter.next();
Document doc = new Document();
doc.add(Field.Text("id", 
Long.toString(item.getId(;
doc.add(Field.Text("title", item.getTitle()));
doc.add(Field.Text("description", item.getTitle()));
try {
final IndexWriter writer = new
IndexWriter(indexLocation, new StandardAnalyzer(),
true);
writer.addDocument(doc);
writer.optimize();
writer.close();
indexCreated = true;
}
catch (IOException e) {
e.printStackTrace();
}
}










___ 
Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos - Hier anmelden: 
http://mail.yahoo.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and databases

2005-10-24 Thread Rick Hillegas

Thanks, Chris. I have a couple follow-on questions:

1) Thanks for the pointer to DBSight.net. It seems that DBSight has 
built some integration support for MySQL. Do you know if there are any 
plans to build integration support for Derby, the Apache open source 
database (http://db.apache.org/derby/)?


2) I'm afraid I don't understand the reference to Compass. Can you point 
me at an URL?


3) Thanks for clarifying what JDBCDirectory does. Is this a dead 
technology? All of my attempts to access the JDBCDirectory url timeout.


Regards,
-Rick

Chris Lu wrote:


Also, you can try Compass. I remember it stores the index when you use
hibernate.

Chris Lu
--
Lucene Full-Text Search on Any Database
http://www.DBSight.net

On 10/24/05, Chris Lu <[EMAIL PROTECTED]> wrote:
 


JDBCDirectory doesn't help you to index content in rdms.
It just stores the lucene index into rdms. This approach will be
slower than file system based approach.

For your first question, "Indexing content that is stored in a dbms",
you can take a look at DBSight. It's a generic tool to easily extract
content from database and build an index, which seems simple, but
behind the scene, it does more than that, including, multi-threaded
extraction and search, multi index support, template-based search
result, scheduled index updating, web-based control and configuration,
remote index replication, etc.

Chris Lu
--
Lucene Full-Text Search on Any Database
http://www.DBSight.net


On 10/24/05, Rick Hillegas <[EMAIL PROTECTED]> wrote:
   


Thanks to Yonik for replying to my last question about queries and filters.

Now I have another issue. I would appreciate any pointers to attempts to
integrate Lucene with databases. There's a tantalizing reference to a
class called JDBCDirectory mentioned at
http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser
times out trying to access the follow-up link
http://ppinew.mnis.com/jdbcdirectory. An email thread
(http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html)
makes me hope that this class helps an application index a body of
documents stored in a relational database. But this class, perhaps a
cousin of FSDirectory and RAMDirectory, doesn't seem to be part of
Lucene proper.

In any event, I would appreciate pointers to people's experience
integrating Lucene with relational databases. I realize this is a very
broad question. It sweeps up topics like the following:

o Indexing content that is stored in a dbms

o Wrapping filters around the results of sql queries

o Integrating Lucene query syntax with sql query syntax

o Practical tips about when to expose information as a Lucene field vs.
when to expose that  information as a column in a relational table

Thanks,
-Rick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and databases

2005-10-24 Thread Rick Hillegas
Thanks, Steven. This is an interesting approach which looks like it gets 
the user up and running pretty fast.


Cheers,
-Rick

Steven Rowe wrote:

Code and examples for embedding Lucene in HSQLDB and Derby relational 
databases:



Rick Hillegas wrote:

Thanks to Yonik for replying to my last question about queries and 
filters.


Now I have another issue. I would appreciate any pointers to attempts 
to integrate Lucene with databases. There's a tantalizing reference 
to a class called JDBCDirectory mentioned at 
http://wiki.apache.org/jakarta-lucene/LatestNews. However, my browser 
times out trying to access the follow-up link 
http://ppinew.mnis.com/jdbcdirectory. An email thread 
(http://www.mail-archive.com/java-user@lucene.apache.org/msg01036.html) 
makes me hope that this class helps an application index a body of 
documents stored in a relational database. But this class, perhaps a 
cousin of FSDirectory and RAMDirectory, doesn't seem to be part of 
Lucene proper.


In any event, I would appreciate pointers to people's experience 
integrating Lucene with relational databases. I realize this is a 
very broad question. It sweeps up topics like the following:


o Indexing content that is stored in a dbms

o Wrapping filters around the results of sql queries

o Integrating Lucene query syntax with sql query syntax

o Practical tips about when to expose information as a Lucene field 
vs. when to expose that  information as a column in a relational table


Thanks,
-Rick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing problem - empty index files!

2005-10-24 Thread Koji Sekiguchi
Samuel,

IndexWriter should be opened once and keep it open
until all documents are added to the writer, then close the writer.

modified sample code:

final IndexWriter writer = new IndexWriter(indexLocation, new
StandardAnalyzer(),true);

for (Iterator iter = someData.iterator(); iter.hasNext(); ) {
Item item = (Item) iter.next();
Document doc = new Document();
doc.add(Field.Text("id", Long.toString(item.getId(;
doc.add(Field.Text("title", item.getTitle()));
doc.add(Field.Text("description", item.getTitle()));
writer.addDocument(doc);
}
writer.optimize();
writer.close();

Koji

> -Original Message-
> From: Samuel Jackson [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 25, 2005 8:34 AM
> To: java-user@lucene.apache.org
> Subject: Indexing problem - empty index files!
>
>
> Hi to all!
>
> I'm new to Lucene and wanted to create a sample
> application to index certain database fields.
>
> But there seems to be some problem because the created
> files in the index target directory are only 1kb -->
> So I don't get any results of course.
>
>
> Here is what I did - can anyone give me a hint whats
> wrong?
>
> for (Iterator iter = someData.iterator();
> iter.hasNext(); ) {
> Item item = (Item) iter.next();
> Document doc = new Document();
> doc.add(Field.Text("id",
> Long.toString(item.getId(;
> doc.add(Field.Text("title", item.getTitle()));
> doc.add(Field.Text("description", item.getTitle()));
> try {
> final IndexWriter writer = new
> IndexWriter(indexLocation, new StandardAnalyzer(),
> true);
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
> indexCreated = true;
> }
> catch (IOException e) {
> e.printStackTrace();
> }
> }
>
>
>
>
>
>
>
>
>
>
> ___
> Gesendet von Yahoo! Mail - Jetzt mit 1GB Speicher kostenlos -
> Hier anmelden: http://mail.yahoo.de
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index downwards compatible?

2005-10-24 Thread Otis Gospodnetic
Eva,

Please see the CHANGES file (you can see it directly in the
repository), where we record all important changes to the code,
including index compatibility changes.

Otis


--- Eva Rissmann <[EMAIL PROTECTED]> wrote:

> Hi all,
> currently we are using Lucene 1.3, but soon we'd like to switch to
> Lucene 1.4. Can the old index be used or does it have to be
> recreated?
> And what about Lucene 1.9/2.0. Is the index downwards compatible?
> 
> Thanks
> Eva
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Another index corruption problem

2005-10-24 Thread Bill Tschumy
Many months ago I wrote this list about a corrupted index that one of  
my customers had.  It was a mystery that was never really solved.   
Well, it has happened again and the stack trace looks almost  
identical.  Here is the exception:


java.io.FileNotFoundException: /Users/samegan/Library/Preferences/ 
Parsnips/IndexData/_1d.fnm (No such file or directory)

at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:204)
at org.apache.lucene.store.FSInputStream$Descriptor. 
(FSDirectory.java:376)
at org.apache.lucene.store.FSInputStream.(FSDirectory.java: 
405)
at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java: 
268)

at org.apache.lucene.index.FieldInfos.(FieldInfos.java:53)
at org.apache.lucene.index.SegmentReader.initialize 
(SegmentReader.java:109)
at org.apache.lucene.index.SegmentReader. 
(SegmentReader.java:94)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java: 
122)

at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:106)
at org.apache.lucene.search.IndexSearcher. 
(IndexSearcher.java:43)
at com.otherwise.parsnips.MySearcher.getSearcher(MySearcher.java: 
96)
at com.otherwise.parsnips.IndexUpdater.checkIndexVersion 
(IndexUpdater.java:35)

at com.otherwise.parsnips.Parsnips.initIndex(Parsnips.java:1043)
at com.otherwise.parsnips.Parsnips.(Parsnips.java:148)
at com.otherwise.parsnips.Parsnips.mainInEventThread 
(Parsnips.java:1158)

at com.otherwise.parsnips.Parsnips$4.run(Parsnips.java:1115)
at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java: 
189)

at java.awt.EventQueue.dispatchEvent(EventQueue.java:478)
at java.awt.EventDispatchThread.pumpOneEventForHierarchy 
(EventDispatchThread.java:234)
at java.awt.EventDispatchThread.pumpEventsForHierarchy 
(EventDispatchThread.java:184)
at java.awt.EventDispatchThread.pumpEvents 
(EventDispatchThread.java:178)
at java.awt.EventDispatchThread.pumpEvents 
(EventDispatchThread.java:170)

at java.awt.EventDispatchThread.run(EventDispatchThread.java:100)

This is a compound index, so I guess the .fnm file it is looking for  
is internal to it or temporary in some way.  The customer thinks the  
problem was caused by accidentally pasting an entire document into  
the "title" field and saving.  I kind of doubt this caused the  
problem, but you never know.  I treat the "title" and the "body"  
identically for indexing.


The person is pretty panicked about his lost data.  Does anyone have  
any hints as to how to edit the file to get it back functioning  
again?  I've heard of people using hex editors for this.


After solving his immediate problem, I need to figure out why this is  
happening.  I haven't followed this list for a couple of months.  Has  
anything like this come up recently?  I am using Lucene-1.4.3.

--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Frustrated with tokenized listing terms

2005-10-24 Thread Chris Hostetter

: I've solved this by indexing the field twice, once as author:(searchable/not
: stored/tokenized)
: and once as author_phrased:(not searchable/stored/not tokenized).

: This works, but is it the proper way to do it?

It's the most effective/efficient method i can think of.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Webfarm and Index Location

2005-10-24 Thread Chris Hostetter

This thread proved very usefull for several of us when discussing this
topic in the past...

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200503.mbox/[EMAIL 
PROTECTED]



: Date: Mon, 24 Oct 2005 07:06:49 -0400
: From: [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Webfarm and Index Location
:
: Hey-
:
: I would like to store my index in one location, and then have all my IIS 
servers on the farm call that one index. Basically, I am looking for the best 
approach here...and any ideas anyone has...
:
: Options:
:
: 1. Store index on SAN and have each server call that location...seems this is 
an issue because on the SAN I can not have more than one shared drive per 
computer calling it...would need LUN for each.
:
: 2. Store index on a shared drive (not on a SAN), and then cluster that box 
that I store the index on...will this work? What over-head for a shared drive 
call?
:
: 3. Make a webservice call
:
: 4. Make a remoting call
:
: Anything else?
:
: Regards!
: -Joe
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Delete doesn't delete?

2005-10-24 Thread Chris Hostetter

can you provide some sample code that demonstrates this problem? ...
preferably something that uses hardcoded data to built up an index in a
RAMDirectory so ayone can run the test without needing any external data?

it would make it a lot easier for other people to help you diagnose what's
going on.


: Date: Mon, 24 Oct 2005 17:41:25 -0400
: From: Dan Quaroni <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Delete doesn't delete?
:
: I know there's a little bit of trickery when it comes to deletes (i.e. it's 
still in the index until optimize, still available to open readers, etc) 
however I'm having this problem:
:
: I've implemented a call to delete by term.  It tells me that it deleted 1 
item, but then I go and open a new reader on the index, search for this 
document, and I find it.  Confused, I run the delete again, and it once again 
tells me 1 document deleted.  And still, I can open a new searched and search 
for it.
:
: I've tried closing the reader that I used to delete it, still no luck.
:
: Any suggestions?
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Cross-field multi-word and query

2005-10-24 Thread Chris Hostetter

I may be wrong, but i think what you are talking about is a BooleanQuery
containing several MaxDisjunctionQuery.  take a look at the code in this
patch...

http://issues.apache.org/jira/browse/LUCENE-323


: Date: Mon, 24 Oct 2005 20:13:55 +0300
: From: Maxim Patramanskij <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org, Maxim Patramanskij <[EMAIL PROTECTED]>
: To: java-user@lucene.apache.org
: Subject: Cross-field multi-word and query
:
:
: I have the following problem:
:
: I need to construct programmatically a Boolean query against n fields
: having m words in my query.
:
: All possible unique combinations(sub-queries) are disjunctive between
: each other while boolean clauses of each combination combines with AND
: operator.
:
: The reason of such complexity is that I have to find a result of AND
: query against several field, when parts of my query could appear in
: different fields and I can't create just one single field because each
: field has its own boost level.
:
: Does anyone have an experience of writing such query builder?
:
: Best regards,
:  Maxim
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]