escaping characters

2004-04-21 Thread Rosen Marinov
Hi all,
I am usung folowing query to find exact match document with title: Abramovich says 
Chelsea win showed Russian character .
In both cases with/without escaping  character, and all others form 
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
i recieve these exceptions:


CASE 1 - without escaping:
Query == ((TITLE:([Abramovich says Chelsea win showed Russian character TO 
Abramovich says Chelsea win showed Russian character]))
org.apache.lucene.queryParser.ParseException: Encountered character\\ 
Was expecting:
] ...

CASE 2 - with escaping:
Query == (TITLE:([Abramovich says Chelsea win showed \Russian character\ TO 
Abramovich says Chelsea win showed \Russian character\]))
org.apache.lucene.queryParser.ParseException: Encountered character
Was expecting:
] ...

Question 1:
Where is the problem? (i am usung SimpleAnalyzer, is this true?)

Question 2:
Is there more sly way to get the doc with exact maching this title? (for info: my 
titles are unique)

Pls answer me in both questions.

Best regards!

Rosen

Re: Does a RAMDirectory ever need to merge segments... (performanceissue)

2004-04-21 Thread Gerard Sychay
I've always wondered about this too.  To put it another way, how does
mergeFactor affect an IndexWriter backed by a RAMDirectory?  Can I set
mergeFactor to the highest possible value (given the machine's RAM) in
order to avoid merging segments?

 Kevin A. Burton [EMAIL PROTECTED] 04/20/04 04:40AM 
I've been benchmarking our indexer to find out if I can squeeze any
more 
performance out of it.

I noticed one problem with RAMDirectory... I'm storing documents in 
memory and then writing them to disk every once in a while. ...

IndexWriter.maybeMergeSegments is taking up 5% of total runtime. 
DocumentWriter.addDocument is taking up another 17% of total runtime.

Notice that this doesn't == 100% becuase there are other tasks taking
up 
CPU before and after Lucene is called.

Anyway... I don't see why RAMDirectory is trying to merge segments.  Is

there anyway to prevent this?  I could just store them in a big 
ArrayList until I'm ready to write them to a disk index but I'm not
sure 
how efficient this will be.

Anyone run into this before?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searcher not aware of index changes

2004-04-21 Thread lucene
Hi!

My Searcher's instance it not aware of changes to the index. I even create a 
new instance but it seems only a complete restart does help(?):

indexSearcher = new IndexSearcher(IndexReader.open(index));

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query

2004-04-21 Thread Erik Hatcher
On Apr 21, 2004, at 10:01 AM, Rosen Marinov wrote:
Does it query work:   my name is \Rosen\?
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis

Short answer: it depends.

Questions for you to answer:
What field type and analyzer did you use during indexing?  What 
analyzer used with QueryParser?  What does the generated Query.toString 
return?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: query

2004-04-21 Thread Rosen Marinov
 Short answer: it depends.
 
 Questions for you to answer:
 What field type and analyzer did you use during indexing?  What 
 analyzer used with QueryParser?  What does the generated Query.toString 
 return?

in both cases SimpleAnalyzer
QueryParser.parse(\abc\) throws an exception and i can't see what does
Query.toString return in this case

what analizer should i use if i want to execute folowing queries:
   simple key word seach (+bush -president , etc)
   range queries including  characters in searching values

thank you



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread Stephane James Vaucher
This is not normal behaviour. Normally using a new IndexSearcher should
reflect the modified state of your index. Could you post a more
informative bit of code?

sv

On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:

 Hi!

 My Searcher's instance it not aware of changes to the index. I even create a
 new instance but it seems only a complete restart does help(?):

 indexSearcher = new IndexSearcher(IndexReader.open(index));

 Timo

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread lucene
On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote:
 This is not normal behaviour. Normally using a new IndexSearcher should
 reflect the modified state of your index. Could you post a more
 informative bit of code?

BTW Why can't Lucene care for it itself?


Well, according to my logging it does create a new instance. I use only one 
instance of SessoinFacade:

public class SearchFacade extends Observable
{
protected class IndexObserver implements Observer
{
private final Log log = LogFactory.getLog(getClass());

public Searcher indexSearcher;

public IndexObserver()
{
newSearcher();  // init
}

public void update(Observable o, Object arg)
{
log.debug(Index has changed, creating new Searcher );
newSearcher();
}

private void newSearcher()
{
try
{
indexSearcher = new 
IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN));
}
catch (IOException e)
{
log.error(Could not instantiate searcher:  + e);
}
}

public Searcher getIndexSearcher()
{
return indexSearcher;
}
}

private IndexObserver indexObserver;

public SearchFacade()
{
addObserver(indexObserver = new IndexObserver());
}

public void createIndex()
{
...
setChanged();   // index has changed
notifyObservers();
}

public Hits search(String query)
{
Searcher searcher = indexObserver.getIndexSearcher();
}

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread Stephane James Vaucher
Normally the code should work, iif your you don't keep references to the
old Searcher (and not try cacheing it). Make sure you aren't doing this by
mistake.

For the design of your facade, you could always implement Searchable and
do the delegation to the up-to-date instance of IndexSearcher.

Quick comment: you should call .close() on your searcher before removing
the reference. If this causes exceptions in future searches, it would
indicate incorrect cacheing.

HTH,
sv

On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:

 On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote:
  This is not normal behaviour. Normally using a new IndexSearcher should
  reflect the modified state of your index. Could you post a more
  informative bit of code?

 BTW Why can't Lucene care for it itself?


 Well, according to my logging it does create a new instance. I use only one
 instance of SessoinFacade:

 public class SearchFacade extends Observable
 {
   protected class IndexObserver implements Observer
   {
   private final Log log = LogFactory.getLog(getClass());

   public Searcher indexSearcher;

   public IndexObserver()
   {
   newSearcher();  // init
   }

   public void update(Observable o, Object arg)
   {
   log.debug(Index has changed, creating new Searcher );
   newSearcher();
   }

   private void newSearcher()
   {
   try
   {
   indexSearcher = new
 IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN));
   }
   catch (IOException e)
   {
   log.error(Could not instantiate searcher:  + e);
   }
   }

   public Searcher getIndexSearcher()
   {
   return indexSearcher;
   }
   }

   private IndexObserver indexObserver;

   public SearchFacade()
   {
   addObserver(indexObserver = new IndexObserver());
   }

   public void createIndex()
   {
   ...
   setChanged();   // index has changed
   notifyObservers();
   }

   public Hits search(String query)
   {
   Searcher searcher = indexObserver.getIndexSearcher();
   }

 }

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does a RAMDirectory ever need to merge segments... (performanceissue)

2004-04-21 Thread Kevin A. Burton
Gerard Sychay wrote:

I've always wondered about this too.  To put it another way, how does
mergeFactor affect an IndexWriter backed by a RAMDirectory?  Can I set
mergeFactor to the highest possible value (given the machine's RAM) in
order to avoid merging segments?
 

Yes... actually I was thinking of increasing these vars on the 
RAMDirectory in the hope to avoid this CPU overhead..

Also I think the var you want is minMergeDocs not mergeFactor.  the only 
problem is that the source to maybeMergeSegments says:

  private final void maybeMergeSegments() throws IOException {
long targetMergeDocs = minMergeDocs;
while (targetMergeDocs = maxMergeDocs) {
So I guess to prevent this we would have to set minMergeDocs to 
maxMergeDocs+1 ... which makes not sense.  Also by default maxMergeDocs 
is Integer.MAX_VALUE so that will have to be changed.

Anyway... I'm still playing with this myself. It might be easier to just 
use an ArrayList of N documents if you know for sure how big your RAM 
dir will grow to.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


what web crawler work best with Lucene?

2004-04-21 Thread Tuan Jean Tee
Have anyone implemented any open source web crawler with Lucene? I have
a dynamic website and are looking at putting in a search tools. Your
advice is very much appreciated.

Thank you.


IMPORTANT -

This email and any attachments are confidential and may be privileged in which case 
neither is intended to be waived. If you have received this message in error, please 
notify us and remove it from your system. It is your responsibility to check any 
attachments for viruses and defects before opening or sending them on. Where 
applicable, liability is limited by the Solicitors Scheme approved under the 
Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to 
provide and market our services. For more information about use, disclosure and 
access, see our privacy policy at www.minterellison.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query

2004-04-21 Thread Erik Hatcher
On Apr 21, 2004, at 12:17 PM, Rosen Marinov wrote:
Short answer: it depends.

Questions for you to answer:
What field type and analyzer did you use during indexing?  What
analyzer used with QueryParser?  What does the generated 
Query.toString
return?
in both cases SimpleAnalyzer
QueryParser.parse(\abc\) throws an exception and i can't see what 
does
Query.toString return in this case
This is clean and green for me:

  public void testAbc() throws ParseException {
Query query = QueryParser.parse(\abc\, field, new 
SimpleAnalyzer());
assertEquals(abc, query.toString(field));
  }

Either you're using an old version of Lucene that is broken in this 
regard (I'm at CVS HEAD) or something else is fishy.

Note that a single term in quotes is optimized into a TermQuery, not a 
quoted PhraseQuery in the assert above.

what analizer should i use if i want to execute folowing queries:
   simple key word seach (+bush -president , etc)
   range queries including  characters in searching values
Ranges with spaces in them doesn't work.  It is for single term ranges, 
not phrases that were tokenized.  If you indexed the entire phrase as 
a single term (Field.Keyword), then you could do an API RangeQuery, but 
QueryParser won't be happy.

QueryParser syntax is documented on the Lucene website if you need 
assistance with the type of syntax it supports.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: what web crawler work best with Lucene?

2004-04-21 Thread Stephane James Vaucher
How big is the site?

I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
small sites (because of its high-level api).

Here is a hello world example:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample

For a small/simple site, small modifications to this class could suffice.
IT WILL NOT function on large sites because of memory problems.

For larger sites, there are questions like:

- memory:
For example, spidering all links on every page can lead to visiting too
many links. Keeping all visited links in memory can be problematic

- noise
If you get every page on your web site, you might be adding noise to the
search engine. Spider navigation rules can help out, like saying that you
should only follow links/index documents of a specific form like
www.mysite.com/news/article.jsp?articleid=xxx

- speed:
Too much speed can be bad if you doing 100 hits/sec on a site could hurt
it (especially if it's not you who are the webmaster)
Too little speed can be bad if you want to make sure you quickly get new
pages.

- categorisation:
You might want to separate information in your index. For example, you
might want a user to do a search in the documentation section or in the
press release section. This categorisation can be done by specifying
sections to the site, or a subsequent analysis of available docs.

-up-to-date information
You'll want to think of your update schedule, so that if you add a new
page, it gets indexed quickly. This problem also occurs when you modify an
existing page, you might want the modification to be detected rapidly.

HTH,
sv

On Thu, 22 Apr 2004, Tuan Jean Tee wrote:

 Have anyone implemented any open source web crawler with Lucene? I have
 a dynamic website and are looking at putting in a search tools. Your
 advice is very much appreciated.

 Thank you.


 IMPORTANT -

 This email and any attachments are confidential and may be privileged in
 which case neither is intended to be waived. If you have received this
 message in error, please notify us and remove it from your system. It is
 your responsibility to check any attachments for viruses and defects
 before opening or sending them on. Where applicable, liability is
 limited by the Solicitors Scheme approved under the Professional
 Standards Act 1994 (NSW). Minter Ellison collects personal information
 to provide and market our services. For more information about use,
 disclosure and access, see our privacy policy at www.minterellison.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]