Re:..

2012-04-04 Thread sol myr

hello! http://downloads.supportandmore.de/amrv-10-11.html?jhID=8aus2

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Newbie question: optimized files?

2011-01-10 Thread sol myr
Hi, 

I'm new to Lucene (using 3.0.3), and just started to check out the behavior of 
the 'optimize()' method (which is quite important for our application).
Could it be that 'optimize' cancels out the 'compoundFile' mode? Or am I doing 
something wrong? 

Here's my test: I create an indexWriter with compoundFile=true, then perform 
some writes+commits (which generates several 'cfs' files).
Then I call 'optimize()'... I expected this to yield a single optimized 'cfs' 
file, but instead I get lots of different files - 'fdt', 'fdx', 'fnm' etc...

The detailed code:
// Create indexWriter with 'compoundfile=true':
Version version=Version.LUCENE_30;
Directory dir = FSDirectory.open(new File("c:/luceneTemp"));
Analyzer analyzer = new StandardAnalyzer(version);
IndexWriter writer = new IndexWriter(
 dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(true);

// Perform some writes + commits.
// This yields several 'cfs' files as expected:
writer.addDocument(...);
writer.commit();
writer.addDocument(...);

writer.commit();
Thread.sleep(2);  // give myself time to see the generated 'cfs' files

// Optimize - why does it yield lots of separate files ('fdt', 'fdx', etc)?
writer.optimize();


Thanks.



  

Re: Newbie question: optimized files?

2011-01-10 Thread sol myr
Hi,

Continuing my question - I now suspect a bug in Lucene 3.0.3, because I ran the 
test with Lucene 3.0.0 and it worked okay (no junk files)... could anyone 
please confirm?

--- On Mon, 1/10/11, sol myr  wrote:

From: sol myr 
Subject: Newbie question: optimized files?
To: java-user@lucene.apache.org
Date: Monday, January 10, 2011, 9:56 AM

Hi, 

I'm new to Lucene (using 3.0.3), and just started to check out the behavior of 
the 'optimize()' method (which is quite important for our application).
Could it be that 'optimize' cancels out the 'compoundFile' mode? Or am I doing 
something wrong? 

Here's my test: I create an indexWriter with compoundFile=true, then perform 
some writes+commits (which generates several 'cfs' files).
Then I call 'optimize()'... I expected this to yield a single optimized 'cfs' 
file, but instead I get lots of different files - 'fdt', 'fdx', 'fnm' etc...

The detailed code:
// Create indexWriter with 'compoundfile=true':
Version version=Version.LUCENE_30;
Directory dir = FSDirectory.open(new File("c:/luceneTemp"));
Analyzer analyzer = new StandardAnalyzer(version);
IndexWriter writer = new IndexWriter(
 dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(true);

// Perform some writes + commits.
// This yields several 'cfs' files as expected:
writer.addDocument(...);
writer.commit();
writer.addDocument(...);

writer.commit();
Thread.sleep(2);  // give myself time to see the generated 'cfs' files

// Optimize - why does it yield lots of separate files ('fdt', 'fdx', etc)?
writer.optimize();


Thanks.



      


  

Re: Newbie question: optimized files?

2011-01-12 Thread sol myr
That was it - thank :)

--- On Tue, 1/11/11, Simon Willnauer  wrote:

From: Simon Willnauer 
Subject: Re: Newbie question: optimized files?
To: java-user@lucene.apache.org
Date: Tuesday, January 11, 2011, 12:06 AM

Hey,

this looks like you are hitting the optimization done in LUCENE-2773
(https://issues.apache.org/jira/browse/LUCENE-2773) that prevents
merged segments that are larger than 10% (by default - see
LogMergePolicy#noCFSRatio) of the index size it will be left in
non-compound format even if compound format is on.

for some background see:
http://lucene.apache.org/java/3_0_3/changes/Changes.html#3.0.3.changes_in_runtime_behavior

so its a feature not a bug :)

Simon

On Tue, Jan 11, 2011 at 7:46 AM, sol myr  wrote:
> Hi,
>
> Continuing my question - I now suspect a bug in Lucene 3.0.3, because I ran 
> the test with Lucene 3.0.0 and it worked okay (no junk files)... could anyone 
> please confirm?
>
> --- On Mon, 1/10/11, sol myr  wrote:
>
> From: sol myr 
> Subject: Newbie question: optimized files?
> To: java-user@lucene.apache.org
> Date: Monday, January 10, 2011, 9:56 AM
>
> Hi,
>
> I'm new to Lucene (using 3.0.3), and just started to check out the behavior 
> of the 'optimize()' method (which is quite important for our application).
> Could it be that 'optimize' cancels out the 'compoundFile' mode? Or am I 
> doing something wrong?
>
> Here's my test: I create an indexWriter with compoundFile=true, then perform 
> some writes+commits (which generates several 'cfs' files).
> Then I call 'optimize()'... I expected this to yield a single optimized 'cfs' 
> file, but instead I get lots of different files - 'fdt', 'fdx', 'fnm' etc...
>
> The detailed code:
> // Create indexWriter with 'compoundfile=true':
> Version version=Version.LUCENE_30;
> Directory dir = FSDirectory.open(new File("c:/luceneTemp"));
> Analyzer analyzer = new StandardAnalyzer(version);
> IndexWriter writer = new IndexWriter(
>  dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
> writer.setUseCompoundFile(true);
>
> // Perform some writes + commits.
> // This yields several 'cfs' files as expected:
> writer.addDocument(...);
> writer.commit();
> writer.addDocument(...);
>
> writer.commit();
> Thread.sleep(2);  // give myself time to see the generated 'cfs' files
>
> // Optimize - why does it yield lots of separate files ('fdt', 'fdx', etc)?
> writer.optimize();
>
>
> Thanks.
>
>
>
>
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

Maintaining index for "flattened" database tables

2011-01-13 Thread sol myr
Hi,

Our main data storage is MySql tables.
We index it on Lucene in order to improve the search (boosting, proximate 
spelling, etc).
We naturally maintain it - for example, to insert a new "Contract" entity, we 
have:
addContract(Contract cont){
 // INSERT into MySQL:
 hibernateSession.save(cont); 
 // Index in Lucene:
 Document d=new Document();
 d.addField(new Field ("title", cont.getTitle()...)    
 d.addField(new Field ("body", cont.getBody()...)    
 indexWriter.addDocument(d);
 ...
}

This works, but becomes a maintenance nightmare when the index fields are 
collected from JOINS. For example, we also want Lucene to search contracts by 
"author name", which is taken from another table AUTHOR (through Foreign Key). 
And then if the author name changes (e.g. through marriage), you must remember 
to re-index all her contracts...!

Does anyone have experience with such database/lucene combination, and knows of 
a solution that would be reasonable to maintain?
Thanks.


  

Re: Maintaining index for "flattened" database tables

2011-01-13 Thread sol myr
Sounds good, thanks :)

--- On Thu, 1/13/11, mark harwood  wrote:

From: mark harwood 
Subject: Re: Maintaining index for "flattened" database tables
To: java-user@lucene.apache.org
Date: Thursday, January 13, 2011, 6:20 AM

Probably off-topic for a Lucene list but the typical database options are:
1) an auto-updated "last changed" timestamp column on related tables that can 
be 
queried
2) a database trigger automatically feeding a "to-be-indexed" table

Option 1 would also need a "marked as deleted" column adding to help synch 
things up whereas option 2 can automatically record create/update/delete 
changes 
cleanly in a separate table.

Either of these options help you to "remember to reindex" just the changed 
items.

Cheers
Mark



- Original Message 
From: sol myr 
To: java-user@lucene.apache.org
Sent: Thu, 13 January, 2011 12:10:29
Subject: Maintaining index for "flattened" database tables

Hi,

Our main data storage is MySql tables.
We index it on Lucene in order to improve the search (boosting, proximate 
spelling, etc).
We naturally maintain it - for example, to insert a new "Contract" entity, we 
have:
addContract(Contract cont){
     // INSERT into MySQL:
     hibernateSession.save(cont); 
     // Index in Lucene:
     Document d=new Document();
     d.addField(new Field ("title", cont.getTitle()...)    
     d.addField(new Field ("body", cont.getBody()...)    
     indexWriter.addDocument(d);
     ...
}

This works, but becomes a maintenance nightmare when the index fields are 
collected from JOINS. For example, we also want Lucene to search contracts by 
"author name", which is taken from another table AUTHOR (through Foreign Key). 
And then if the author name changes (e.g. through marriage), you must remember 
to re-index all her contracts...!

Does anyone have experience with such database/lucene combination, and knows of 
a solution that would be reasonable to maintain?
Thanks.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

Newbie: "Life span" of IndexWriter / IndexSearcher?

2011-01-13 Thread sol myr
Hi,

We're writing a web application, which naturally needs
- "IndexSearcher" when users use our search screen
- "IndexWriter" in a background process that periodically updates and optimizes 
our index.
Note our writer is exclusive - no other applications/threads ever write to our 
index files.

What's the common practice in terms of resource creation and sharing?
Specifically:

1) Should I have a single IndexSearcher to serve all (concurrent) users?
I saw such a recommendation in a tutorial, but discovered that an open 
IndexSearcher prevents 'optimize' from merging my files... so should I close it 
just before optimization? Or should I open an individual (short-lived) 
IndexSearcher for each search request?

2) Our tests also imply that IndexWriter.optimize()  takes effect only after 
you close() that writer - which is a shame, because I hoped to keep using the 
same writer (I hear it's expensive to instantiate). I doing something wrong? 

Thanks



  

RE: Newbie: "Life span" of IndexWriter / IndexSearcher?

2011-01-16 Thread sol myr
Hi,

Thank you kindly for replying. 
Unfortunately, reopen() doesn't help me see the changes.
Here's my test:
First I write & commit a document, and run a search - which correctly finds 
this document.
Then I write & commit another document, re-open the reader and run another 
search - this should find 2 documents, but it only finds 1 document (the first 
one).
BTW if instead of 'reader.reopen()' I instantiate a brand-new searcher (and 
reader), it correctly finds 2 documents...

// Shared objects:
Directory directory = FSDirectory.open(new File("c:/myDir"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30); 
IndexWriter writer = new IndexWriter(directory, analyzer, 
 IndexWriter.MaxFieldLength.LIMITED);
Query query =  new TermQuery(new Term("title", "hello"));

// Write document #1:
writer.addDocument(makeDoc("hello world 1")); // Field title="hello world 1"
writer.commit();

// First search (yields document #1 as expected):
IndexReader reader=IndexReader.open(directory, true);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results1 = searcher.search(query, 1);
printResults(searcher, results1);

// Write document #2:
writer.addDocument(makeDoc("hello world 2")); // Field title="hello world 2"
writer.commit();

// Reopen reader, and search (should yield 2 documents, but I only see 1):
reader.reopen(true);
TopDocs results2 = searcher.search(query, 1);
printResults(searcher, results2);


--- On Thu, 1/13/11, Uwe Schindler  wrote:

From: Uwe Schindler 
Subject: RE: Newbie: "Life span" of IndexWriter / IndexSearcher?
To: java-user@lucene.apache.org
Date: Thursday, January 13, 2011, 7:40 AM

You can leave the IndexWriter and IndexSearcher all the time. The only
important thing, changes made by IndexWriter's commit() method are only seen
by IndexSearcher, when the underlying IndexReader is reopened (e.g. by using
IndexReader.reopen()) - please note that this only works with direct access
to the IndexReaders, so I would recommend using the constructors of
IndexSearcher that take IndexReaders (the Directory ones are only for easy
beginner's use). 




  

Re: Newbie: "Life span" of IndexWriter / IndexSearcher?

2011-01-16 Thread sol myr
Worked like a charm - thanks a lot.


--- On Sun, 1/16/11, Raf  wrote:

From: Raf 
Subject: Re: Newbie: "Life span" of IndexWriter / IndexSearcher?
To: java-user@lucene.apache.org
Date: Sunday, January 16, 2011, 3:16 AM

Look at the JavaDoc:
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/index/IndexReader.html#reopen()

The *reopen* method returns a *new reader* if the index has changed since
the original reader was opened.

So, you should do something like this:
IndexReader newReader = reader.reopen(true);
if (newReader != reader) {
reader.close();
 reader = newReader;
searcher = new IndexSearcher(reader);
}

instead of
reader.reopen(true);

Bye.
*Raf*

On Sun, Jan 16, 2011 at 11:06 AM, sol myr  wrote:

> Hi,
>
> Thank you kindly for replying.
> Unfortunately, reopen() doesn't help me see the changes.
> Here's my test:
> First I write & commit a document, and run a search - which correctly finds
> this document.
> Then I write & commit another document, re-open the reader and run another
> search - this should find 2 documents, but it only finds 1 document (the
> first one).
> BTW if instead of 'reader.reopen()' I instantiate a brand-new searcher (and
> reader), it correctly finds 2 documents...
>
> // Shared objects:
> Directory directory = FSDirectory.open(new File("c:/myDir"));
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
> IndexWriter writer = new IndexWriter(directory, analyzer,
>      IndexWriter.MaxFieldLength.LIMITED);
> Query query =  new TermQuery(new Term("title", "hello"));
>
> // Write document #1:
> writer.addDocument(makeDoc("hello world 1")); // Field title="hello world
> 1"
> writer.commit();
>
> // First search (yields document #1 as expected):
> IndexReader reader=IndexReader.open(directory, true);
> IndexSearcher searcher = new IndexSearcher(reader);
> TopDocs results1 = searcher.search(query, 1);
> printResults(searcher, results1);
>
> // Write document #2:
> writer.addDocument(makeDoc("hello world 2")); // Field title="hello world
> 2"
> writer.commit();
>
> // Reopen reader, and search (should yield 2 documents, but I only see 1):
> reader.reopen(true);
> TopDocs results2 = searcher.search(query, 1);
> printResults(searcher, results2);
>
>
> --- On Thu, 1/13/11, Uwe Schindler  wrote:
>
> From: Uwe Schindler 
> Subject: RE: Newbie: "Life span" of IndexWriter / IndexSearcher?
> To: java-user@lucene.apache.org
> Date: Thursday, January 13, 2011, 7:40 AM
>
> You can leave the IndexWriter and IndexSearcher all the time. The only
> important thing, changes made by IndexWriter's commit() method are only
> seen
> by IndexSearcher, when the underlying IndexReader is reopened (e.g. by
> using
> IndexReader.reopen()) - please note that this only works with direct access
> to the IndexReaders, so I would recommend using the constructors of
> IndexSearcher that take IndexReaders (the Directory ones are only for easy
> beginner's use).
>
>
>
>
>



  

Question on writer optimize() / file merging?

2011-01-16 Thread sol myr
Hi,

I'm trying to understand the behavior of file merging / optimization.
I see that whenever my IndexWriter calls 'commit()', it creates a new file (or 
fileS).
I also see these files merged when calling 'optimize()' , as much as allowed by 
the parameter 'NoCFSRatio' .

But I'm still trying to figure out:

1) Will my writer still perform some file merging, even if I don't explicitly 
call 'optimize()'?

2) Is there a way to configure the number or files, or their size?

3) I always keep an open IndexSearcher (and IndexReader). I know they should be 
re-opened when a change occurs, but it's not crucial to see changes 
immediately, so I just poll periodically, and it might be a few minutes before 
my reader is re-opened and allowed to see changes.
But will this approach disturb the writer's ability to optimize / merge files? 
If a reader is open, will it prevent file merging?

Thanks




  

[Lucene] custom Query, and Stop Words

2011-02-09 Thread sol myr
Hi,

I'm building my own BooleanQuery (rather than using Query Parser). That's 
because I need different defaults from my users: 
If a user types:  java program
I need to run the query: +java* +program* (namely AND search, with Prefix so as 
to hit "programS", "programMER").

So naively I split the user's input into words, and build my query term-by-term:
   String[] words= userInput.split(" ");
   BooleanQuery query=new BooleanQuery();
   for(String word: words)
   query.add(new PrefixQuery(new Term("topic", word)), Occur.MUST);

But since my index is built with StandardAnalyzer, I'm having trouble with Stop 
Words.
If the user types: program of java
Then of course my query (+program* +of* +java+) returns zero results.

Is there an easy solution?
I'd like to keep using StandardAnalyzer (I don't want to index by stupid 
keywords such as "of"). I would just like stop-words to be removed from my 
query (as QueryParser does).
Is there some utility method for this? Direct access to the list of stop-words?

Thanks 



 

Bored stiff? Loosen up... 
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front

Re: [Lucene] custom Query, and Stop Words

2011-02-10 Thread sol myr
Thanks so much - I used STOP_WORDS_SET and it works fine (luckily, punctuation 
and case are not a problem in our case).
Thanks !

--- On Wed, 2/9/11, Ian Lea  wrote:

From: Ian Lea 
Subject: Re: [Lucene] custom Query, and Stop Words
To: java-user@lucene.apache.org
Date: Wednesday, February 9, 2011, 1:58 AM

Have you considered using stemming instead?  Sounds like that might
make most of your problems go away and achieve the same result.

I'm not aware of a utility method to remove stop words from a string
but there are ways of passing data through analyzers/tokenizers and
grabbing the output. StandardAnalyzer stores the stop words in
STOP_WORDS_SET and you might be able to use that directly, or pass in
your own set of stop works which you would obviously have access to.

If you stick with your split/word by word approach, watch out for
punctuation.and Mixed Case and other complications.


--
Ian.

On Wed, Feb 9, 2011 at 8:43 AM, sol myr  wrote:
> Hi,
>
> I'm building my own BooleanQuery (rather than using Query Parser). That's 
> because I need different defaults from my users:
> If a user types:  java program
> I need to run the query: +java* +program* (namely AND search, with Prefix so 
> as to hit "programS", "programMER").
>
> So naively I split the user's input into words, and build my query 
> term-by-term:
>    String[] words= userInput.split(" ");
>    BooleanQuery query=new BooleanQuery();
>    for(String word: words)
>    query.add(new PrefixQuery(new Term("topic", word)), Occur.MUST);
>
> But since my index is built with StandardAnalyzer, I'm having trouble with 
> Stop Words.
> If the user types: program of java
> Then of course my query (+program* +of* +java+) returns zero results.
>
> Is there an easy solution?
> I'd like to keep using StandardAnalyzer (I don't want to index by stupid 
> keywords such as "of"). I would just like stop-words to be removed from my 
> query (as QueryParser does).
> Is there some utility method for this? Direct access to the list of 
> stop-words?
>
> Thanks
>
>
>
>
> 
> Bored stiff? Loosen up...
> Download and play hundreds of games for free on Yahoo! Games.
> http://games.yahoo.com/games/front

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

"shared fields"?

2011-03-09 Thread sol myr
Hi,

I have several documents that share the same (large) searchable data.
For example, say my Documents represent movies, and  2 movies share the same 
actorBiography of Brad Pitt (assuming I want 
to search movies by actorBiography words, far-fetched as it might seem):


Document1:  
- movieName="Benjamin Button"
- actorBiography="Brad Pitt was born in 1963 in Oklahoma and raised in..."
Document2:  

- movieName="Ocean 11"

- actorBiography="Brad Pitt was born in 1963 in Oklahoma and raised in..."

My question: I'm afraid my index files will become very large, due to the 
duplication of information. Is there any trick that would keep my index files 
in a reasonable size, while still allowing the functionality of "search movie 
by actorBiography"?
Thanks :)



  

Distributing a Lucene application?

2011-03-22 Thread sol myr
Hi,
What are my options for distributing an application that uses Lucene?

Our current application works against a database of INVENTORY. We schedule
hourly checks for modified items (timestamp-based), and update a single
Lucene index.
Now we want to distribute out application, to a Grid, with failover, and a
bit of data sharing:
Say we have 2 branches - New York and Los Angeles.

(1) Inventory of the NY branch is handled by 2 application servers, and 2
database copies. They are exact replicates, for failover/load balance.
Similarly, the LA branch gets 2 application servers and 2 databases.

(2) 90% of the time, each branch "minds its own business" and isn't
interested in the other branch's inventory.
However on rare occasions, an LA administrator needs to search the NY
inventory (we can compromise on data freshness, e.g. show data 10 hours
old).

Does Lucene have built-in support for any of this?
If I'm to do this "from scratch" I'll probably just let each application
server maintain its own copy of Lucene index (with data only from its own
city, and hourly updates as before).
And for the requirement of "LA admin searching the NY inventory" I'd
schedule a task to copy the NY index into the LA server, every 10 hours.

Is this a reasonable approach? Or are there Lucene-management tools that
would handle it better?
Thanks :)


Re: Distributing a Lucene application?

2011-03-23 Thread sol myr
Thanks :)
Thankfully we don't delete from the database - just mark items as "inactive"
(actual delete occurs only in a yearly cleanup process).
We can live with inaccurate results, including deleted/inactive items.

Have you used DBSight? Would you mind sharing your opinion - did you like it
better than, say, SOLR?
Thanks :)


On Tue, Mar 22, 2011 at 7:06 PM, Chris Lu  wrote:

> Each database having its own index should be fine.
> However, just checking modified timestamp may not be enough, since there
> could be items deleted.
>
> You can check DBSight for this purpose. It can do remote index replication
> across WAN.
> But, if the NY index is synchronized before NY database does, this means
> the NY index could return results not found in NY database, correct?
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
>
>
> On 3/22/2011 1:30 AM, sol myr wrote:
>
>> Hi,
>> What are my options for distributing an application that uses Lucene?
>>
>> Our current application works against a database of INVENTORY. We schedule
>> hourly checks for modified items (timestamp-based), and update a single
>> Lucene index.
>> Now we want to distribute out application, to a Grid, with failover, and a
>> bit of data sharing:
>> Say we have 2 branches - New York and Los Angeles.
>>
>> (1) Inventory of the NY branch is handled by 2 application servers, and 2
>> database copies. They are exact replicates, for failover/load balance.
>> Similarly, the LA branch gets 2 application servers and 2 databases.
>>
>> (2) 90% of the time, each branch "minds its own business" and isn't
>> interested in the other branch's inventory.
>> However on rare occasions, an LA administrator needs to search the NY
>> inventory (we can compromise on data freshness, e.g. show data 10 hours
>> old).
>>
>> Does Lucene have built-in support for any of this?
>> If I'm to do this "from scratch" I'll probably just let each application
>> server maintain its own copy of Lucene index (with data only from its own
>> city, and hourly updates as before).
>> And for the requirement of "LA admin searching the NY inventory" I'd
>> schedule a task to copy the NY index into the LA server, every 10 hours.
>>
>> Is this a reasonable approach? Or are there Lucene-management tools that
>> would handle it better?
>> Thanks :)
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Should I use MultiSearcher?

2011-03-24 Thread sol myr
Hi,

I need to search a Catalog.
Most users search *this* year's catalog, but on rare occasions they may ask
for old products (from previous years).
I'm trying to select between 2 options:

1) Keep huge big index for all years (where documents have a "year" field,
so I can filter out the current year, when needed)

2) Keep separate indexes - FSDirectory per year:
FSDirectory.open("c:/index_2009/"),  FSDirectory.open("c:/index_2010/") ...
Most searches will run on the current year's FSDirectory, but if I want old
product I can use MultiSearcher.

Which option sounds better?
The 1st seems easier to code.
But I thought the 2nd might have better performance - especially since most
searches are on the current year.
Moreover, since changes occur only on current year (old products never
change), I though the 2nd approach would be easier on the IndexWriter
(especially on heavy actions like "optimize()").

What do you thing?
Thanks :)


Re: Distributing a Lucene application?

2011-03-31 Thread sol myr
Thanks very much, sounds great :)

On Thu, Mar 24, 2011 at 9:13 PM, Chris Lu  wrote:

> It's great that the requirement is loose...
> But I suppose users would ask for more later.
>
> Well, I worked on DBSight, which covers more than just search. It also
> includes scheduling indexing, reindexing, and even rendering.
> In your case, you just need to specify a SQL and have index up and running
> in several minutes.
> You can even embed a widget to put search UI to any page.
>
> btw, DBSight also has facet search.
>
>
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 2011/3/23 0:57, sol myr wrote:
>
>> Thanks :)
>> Thankfully we don't delete from the database - just mark items as
>> "inactive"
>> (actual delete occurs only in a yearly cleanup process).
>> We can live with inaccurate results, including deleted/inactive items.
>>
>> Have you used DBSight? Would you mind sharing your opinion - did you like
>> it
>> better than, say, SOLR?
>> Thanks :)
>>
>>
>> On Tue, Mar 22, 2011 at 7:06 PM, Chris Lu  wrote:
>>
>>  Each database having its own index should be fine.
>>> However, just checking modified timestamp may not be enough, since there
>>> could be items deleted.
>>>
>>> You can check DBSight for this purpose. It can do remote index
>>> replication
>>> across WAN.
>>> But, if the NY index is synchronized before NY database does, this means
>>> the NY index could return results not found in NY database, correct?
>>>
>>> --
>>> Chris Lu
>>> -
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>>
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>
>>>
>>>
>>> On 3/22/2011 1:30 AM, sol myr wrote:
>>>
>>>  Hi,
>>>> What are my options for distributing an application that uses Lucene?
>>>>
>>>> Our current application works against a database of INVENTORY. We
>>>> schedule
>>>> hourly checks for modified items (timestamp-based), and update a single
>>>> Lucene index.
>>>> Now we want to distribute out application, to a Grid, with failover, and
>>>> a
>>>> bit of data sharing:
>>>> Say we have 2 branches - New York and Los Angeles.
>>>>
>>>> (1) Inventory of the NY branch is handled by 2 application servers, and
>>>> 2
>>>> database copies. They are exact replicates, for failover/load balance.
>>>> Similarly, the LA branch gets 2 application servers and 2 databases.
>>>>
>>>> (2) 90% of the time, each branch "minds its own business" and isn't
>>>> interested in the other branch's inventory.
>>>> However on rare occasions, an LA administrator needs to search the NY
>>>> inventory (we can compromise on data freshness, e.g. show data 10 hours
>>>> old).
>>>>
>>>> Does Lucene have built-in support for any of this?
>>>> If I'm to do this "from scratch" I'll probably just let each application
>>>> server maintain its own copy of Lucene index (with data only from its
>>>> own
>>>> city, and hourly updates as before).
>>>> And for the requirement of "LA admin searching the NY inventory" I'd
>>>> schedule a task to copy the NY index into the LA server, every 10 hours.
>>>>
>>>> Is this a reasonable approach? Or are there Lucene-management tools that
>>>> would handle it better?
>>>> Thanks :)
>>>>
>>>>
>>>>  -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Performance and index size (rephrased question)

2011-03-31 Thread sol myr
Hi,

I'm rephrasing a previous performance question, in light of new data...
I have a Lucene index of about 0.5 GB.
Currently performance is good - up to 200 milliseconds per search (with
complex boolean queries, but never retrieving more than 200 top results).

The question: how much can the index grow, before there's noticeable
performance degradation?

1) Does anyone please have production experience with, say, 5 GB index? 10
GB?
If so, are there recommendations about merge policy, file size
configuration, etc?
If it degrades, I have other solutions (involving a change in logic), but I
don't want to get into it unless necessary.

2) Also, about 5% of my documents are editable (= the application
occasionally deletes them, and adds a modified document instead).
The other 90% are "immutable" (never deleted/edited).
Can Lucene take advantage of this? E.g. will it be smart enough to keep
changes in a single small file (which needs to be optimized), while the
other files remain unchanged?

Thanks :)


PhraseQuery with huge "slop"?

2011-04-07 Thread sol myr
Hi,

I need to run and "AND" query with a twist: give higher ranking for
"exact match".
So when searching for  BIG BEN
- Give high rank for the Document  "BIG BEN is in London"
- Lower rank for  "It's a BIG day for my dear friend BEN"

Following good advice from this list, I combined 2 separate queries
(the query "+BIG +BEN"  and the exact-phrase "\"BIG BEN\"").
But someone suggested an alternative: PhraseQuery with a very large SLOP.
Such SLOP would cover all appearances of theses words in the document
(even far apart).
While the PrahseQuery would automatically give higher ranking when
words are close apart.

Does that make sense?

1) What SLOP is required if my documents are about 100 words each?
Is it simply SLOP=100, or would it be exponential ( like 100! )

2) Will I get reasonable performace?
Or would the large SLOP cause horrible performance degradation?

Thanks :)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Large index merging/optimization?

2011-06-15 Thread sol myr
Hi,

Our Lucene index grew to about 4 GB .
Unfortunately it brought up a performance problem of slow file merging.
We have:
1. A writer thread: once an Hour it looks for modified documents, and
updates the Lucene index.
Usually there are only few modifications, but sometimes we switch the
entire content and re-index everything.

2. The default Lucene Merge thread (ConcurrentMergeScheduler)

Usually it works great. But every several hours the
'ConcurrentMergeScheduler' thread gets stuck (for hours - I'm guessing
it got to the point where it needs to merge large files).
During this, our Writer thread is stuck (waiting on a lock), so users
will see stale data.

My questions please:

1. Is there any configuration that would either speed up file merging,
or allow IndexWriter to write simultaneously?

2. And when do I call 'optimize'?
Won't it be another very operation, that holds the 'write' lock and
prevents updates?

Thanks:)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[Lucene] Frequencies and positions - are they stored per field?

2011-10-04 Thread sol myr


Hi,

I use Lucene, but an not familiar with its internals. 
I'd appreciate help understanding whether Term Frequences and Positions - are 
stored  per Document of per Field?
On the one hand, I never ask for "Field.TermVector" because I read it's only 
required for "MoreLikeThis" (which I don't need).
On the other hand, my searches *are* based on fields...

Here's my code:
// Write (without Field.TermVector):

Document doc=new Document();
doc.add(new Field("subject",  "Requisition request", Store.YES, 
Index.ANALYZED));
doc.add(new Field("body",  "Attached is an Urgent requisition request", 
Store.YES, Index.ANALYZED));
write.addDocument(doc);

// And my Query:
Query query=parser.parse("subject : urgent");

Now how does Lucene manage this query?
I asked it to search the "subject" Field.
But if the "inverted index" doesn't keep fields, it would only remember that 
"The term 'Urgent' appears in SOME FIELD of document#1 "...
Isn't it true?

If so, how would it make sure to retrieve only documents that match in the 
Subject ?

Thanks.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [Lucene] Frequencies and positions - are they stored per field?

2011-10-04 Thread sol myr
Thanks a lot.
But then what's the added value of Field.TermVector?

Can't it be deduced from the overall Lucene index? Or is it just inefficient to 
deduce?

Thanks again :)



- Original Message -
From: Uwe Schindler 
To: java-user@lucene.apache.org; 'sol myr' 
Cc: 
Sent: Tuesday, October 4, 2011 11:53 AM
Subject: RE: [Lucene] Frequencies and positions - are they stored per field?

Lucene always uses a field, a query using a term without a field is
impossible. See each field as a parallel inverted index; all statistics are
per field, too. If you pass a query without a field name to QueryParser it
will chose the default field, that’s given when creating the QueryParser.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: sol myr [mailto:solmy...@yahoo.com]
> Sent: Tuesday, October 04, 2011 11:46 AM
> To: lucene
> Subject: [Lucene] Frequencies and positions - are they stored per field?
> 
> 
> 
> Hi,
> 
> I use Lucene, but an not familiar with its internals.
> I'd appreciate help understanding whether Term Frequences and Positions -
are
> stored  per Document of per Field?
> On the one hand, I never ask for "Field.TermVector" because I read it's
only
> required for "MoreLikeThis" (which I don't need).
> On the other hand, my searches *are* based on fields...
> 
> Here's my code:
> // Write (without Field.TermVector):
> 
> Document doc=new Document();
> doc.add(new Field("subject",  "Requisition request", Store.YES,
> Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> requisition request", Store.YES, Index.ANALYZED)); write.addDocument(doc);
> 
> // And my Query:
> Query query=parser.parse("subject : urgent");
> 
> Now how does Lucene manage this query?
> I asked it to search the "subject" Field.
> But if the "inverted index" doesn't keep fields, it would only remember
that
> "The term 'Urgent' appears in SOME FIELD of document#1 "...
> Isn't it true?
> 
> If so, how would it make sure to retrieve only documents that match in the
> Subject ?
> 
> Thanks.
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [Lucene] Frequencies and positions - are they stored per field?

2011-10-10 Thread sol myr
Thanks so much, this helped a lot :)



- Original Message -
From: Uwe Schindler 
To: java-user@lucene.apache.org; 'sol myr' 
Cc: 
Sent: Tuesday, October 4, 2011 12:14 PM
Subject: RE: [Lucene] Frequencies and positions - are they stored per field?

Hi,

Term Vectors are somehow duplicate information. It is used to get quickly *per 
document* all vectors for *one field*. This means you get the positions, 
offsets, and frequencies for the requested document as one blob like a stored 
field that can be used e.g. for more like this or highlighting 
(FastVectorHighligter also needs term vectors).

It's identical to the difference between indexed fields and stored field (in 
fact the information stored if you enable TermVectors during indexing is 
similar to stored fields, see it like a binary stored field containing all 
vectors for the corresponding document).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-----
> From: sol myr [mailto:solmy...@yahoo.com]
> Sent: Tuesday, October 04, 2011 12:08 PM
> To: java-user@lucene.apache.org
> Subject: Re: [Lucene] Frequencies and positions - are they stored per field?
> 
> Thanks a lot.
> But then what's the added value of Field.TermVector?
> 
> Can't it be deduced from the overall Lucene index? Or is it just inefficient 
> to
> deduce?
> 
> Thanks again :)
> 
> 
> 
> - Original Message -
> From: Uwe Schindler 
> To: java-user@lucene.apache.org; 'sol myr' 
> Cc:
> Sent: Tuesday, October 4, 2011 11:53 AM
> Subject: RE: [Lucene] Frequencies and positions - are they stored per field?
> 
> Lucene always uses a field, a query using a term without a field is 
> impossible.
> See each field as a parallel inverted index; all statistics are per field, 
> too. If you
> pass a query without a field name to QueryParser it will chose the default 
> field,
> that’s given when creating the QueryParser.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: sol myr [mailto:solmy...@yahoo.com]
> > Sent: Tuesday, October 04, 2011 11:46 AM
> > To: lucene
> > Subject: [Lucene] Frequencies and positions - are they stored per field?
> >
> >
> >
> > Hi,
> >
> > I use Lucene, but an not familiar with its internals.
> > I'd appreciate help understanding whether Term Frequences and
> > Positions -
> are
> > stored  per Document of per Field?
> > On the one hand, I never ask for "Field.TermVector" because I read
> > it's
> only
> > required for "MoreLikeThis" (which I don't need).
> > On the other hand, my searches *are* based on fields...
> >
> > Here's my code:
> > // Write (without Field.TermVector):
> >
> > Document doc=new Document();
> > doc.add(new Field("subject",  "Requisition request", Store.YES,
> > Index.ANALYZED)); doc.add(new Field("body",  "Attached is an Urgent
> > requisition request", Store.YES, Index.ANALYZED));
> > write.addDocument(doc);
> >
> > // And my Query:
> > Query query=parser.parse("subject : urgent");
> >
> > Now how does Lucene manage this query?
> > I asked it to search the "subject" Field.
> > But if the "inverted index" doesn't keep fields, it would only
> > remember
> that
> > "The term 'Urgent' appears in SOME FIELD of document#1 "...
> > Isn't it true?
> >
> > If so, how would it make sure to retrieve only documents that match in
> > the Subject ?
> >
> > Thanks.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using BlockJoinQuery?

2011-10-11 Thread sol myr
Hi,

I noticed that the new Lucene 3.4 supports "BlockJoinQuery" (allowing for 
'join' or 'relation' between documents).
I understand the documented limitations on the feature (nowhere near the power 
of SQL join), but it's still very useful for me :)

My question: does anyone please have a code example, on how to use this query?
Not just the javadoc, but a concrete working example?

Thanks very much.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



performance question - number of documents

2011-10-23 Thread sol myr
Hi,

We've noticed some Lucene performance phenomenon, and would appreciate an 
explanation from anyone familiar with Lucene internals

(I know Lucene as a user, but haven't looked under its hood).

We have a Lucene index of about 30 million records.
We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe").
The AND query had much better performance (AND takes about 500 millis, while OR 
takes about 2000 millis).

We wondered whether this has anything to do with the number of potential 
matches?
Our AND has only about 5000 matches (5000 documents contain *both* "john" and 
"doe").
Our OR has about 8 million matches (8 million documents contain *either* "john" 
or "doe").


Does this explain the performance difference?
But why would it matter, as long as we take only the top 5 matches ( 
indexSearcher.search(query, 5))...?
Is there any other explanation?

Thanks :)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: performance question - number of documents

2011-10-24 Thread sol myr
Hi,

Thanks for this reply.

Could I please just ask - doesn't Lucene keep the data sorted, at least 
partially (heuristically)?

E.g. if the reverse index says "the word DOE appears in documents #1, #7, #5" .
Won't Lucene do some smart sorting on this list of documents? Maybe by 
frequency, first listing documents that contain many  appearances of  DOE?

I know ranking considers more subtle factors such as document length, "idf" to 
prioritize rare words, etc.
But if there are 8 million documents with the word DOE, and I only asked for 
the top 5, I might take a risk and limit the change to (say) 1000 documents 
that contain most appearances of that word, and only between them bother to 
calculate the exact ranking...

That's not criticism, I'm no algorithms expert, I just raise the question and 
try to learn...
Insights would be appreciated :)
Thanks again.




- Original Message -
From: Erick Erickson 
To: java-user@lucene.apache.org; sol myr 
Cc: 
Sent: Sunday, October 23, 2011 7:18 PM
Subject: Re: performance question - number of documents

"Why would it matter...top 5 matches" Because Lucene has to calculate
the score of all documents in order to insure that it returns those 5 documents.
What if the very last document scored was the most relevant?

Best
Erick

On Sun, Oct 23, 2011 at 3:06 PM, sol myr  wrote:
> Hi,
>
> We've noticed some Lucene performance phenomenon, and would appreciate an 
> explanation from anyone familiar with Lucene internals
>
> (I know Lucene as a user, but haven't looked under its hood).
>
> We have a Lucene index of about 30 million records.
> We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe").
> The AND query had much better performance (AND takes about 500 millis, while 
> OR takes about 2000 millis).
>
> We wondered whether this has anything to do with the number of potential 
> matches?
> Our AND has only about 5000 matches (5000 documents contain *both* "john" and 
> "doe").
> Our OR has about 8 million matches (8 million documents contain *either* 
> "john" or "doe").
>
>
> Does this explain the performance difference?
> But why would it matter, as long as we take only the top 5 matches ( 
> indexSearcher.search(query, 5))...?
> Is there any other explanation?
>
> Thanks :)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



"AND" Query "under the hood" ?

2011-10-25 Thread sol myr
Hi,

Could I please ask another question regarding Lucene "under the hood" / 
performance.

I wondered how "AND" queries are implemented?
Say we query for "+hello +world".
Would Lucene simply find 2 lists of documents ("documents containing HELLO",  
and "documents containing WORLD"), 

and then intersect them (yielding documents with both words)?
Or does Lucene do smarter tricks? 


And in regards to performance, is there any importance to query order ( "+hello 
+world"  as opposed to "+world +hello")?


Thanks :)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org