Some problems with lucene in searching

2005-06-13 Thread sriram Thota
Hi,

   I am working on lucene.I had seen ur suggestion about lucene in google
search.Iam facing some problems in searching.Please go through my sample code
and suggest me where i had gone wrong.

I will be thankful to you.



This is my sample code:

private static Document createDocument(File fFile) {
Document document = new Document();
try{
   // here the contents of the file "file.txt" are converted to document
object.
   document.add(Field.Text("sFileDoc",new FileReader (fFile)));
   }catch(FileNotFoundException fnfe) {
   ;
   }
return document;
}


public static void indexDocs(IndexWriter writer , File file) throws
IOException{
String[] files = new String[]{};
File fFile = new File("file.txt");
Document doc = new Document();
int iCount = 0;
BufferedWriter out = null;
if (file.canRead()) {
if (file.isDirectory()) {
files = file.list();
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
iCount++;
}
 }
 }else {
sFile = sFile+file.getName();
 }
   }
if(files.length == iCount){
try{
   out = new BufferedWriter(new FileWriter(fFile));
   //here iam writing the names of files to file.txt
   out.write(sFile.toLowerCase());
   //here iam converting the file.txt to document object
   doc = createDocument(fFile);
   writer.addDocument(doc);
   out.close();
   }catch(Exception excep){
   ;
   }
}
}

   Here i am able to write the names of the files in file.txt and able to
index file.txt and before indexing i had converted the file names to
lowercase,but iam not able to search with filenames.Iam giving the complete
filename.

Please help me.


  Sriram.T


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to navigate through indexed terms

2005-06-13 Thread Antoine Brun

Hi,

thanks for the hint.
I guess that the best solution would be to implement a previous() method.
I was wondering if anyone has ever planned on doing this?

Antoine Brun



Of course, you could also add a previous() method into the source and
submit the patch, as the code would be very similar to the next()
functionality, I think.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ideas Needed - Finding Duplicate Documents

2005-06-13 Thread Paul Libbrecht

Have you tried comparing TermVectors ?
I would expect them, or an adjustment of them, to allow comparison to 
focus on "important terms" (e.g. about a 100-200 terms) and then allow 
a more reasonable computation.


paul


Le 12 juin 05, à 16:37, Dave Kor a écrit :


Hi,

I would like to poll the community's opinion on good strategies for 
identifying

duplicate documents in a lucene index.

You see, I have an index containing roughly 25 million lucene 
documents. My task
requires me to work at sentence level so each lucene document actually 
contains
exactly one sentence. The issue I have right now is that sometimes, 
certain
sentences are duplicated and I'ld like to be able to identify them as 
a BitSet

so that I can filter away these duplicates in my search.

Obviously the brute force method of pairwise compares would take 
forever. I have
tried grouping sentences using their hashCodes() and then do a 
pairwise compare
between sentences that has the same hashCode, but even with a 1GB heap 
I ran

out of memory after comparing 200k sentences.

Any other ideas?


Regards
Dave Kor.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hypenated word

2005-06-13 Thread Markus Wiederkehr
Hello,

I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.

I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses something like that at http://www.lucenebook.com/.

Thanks in advance,

Markus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
Hello lucene-list readers,

first I want to introduce myself a little. Because I am new at this List:

I am a programmer in a publishing company, 32 years of Age and you can
find my picture at http://www.idowa.de/service/kontakt.
We release some local newspapers and a website (http://www.idowa.de)
with the main focus on regional content.

We use Lucene to create an index over the whole newspaper and website
content. So there is more than 2GB text to indicate.

And now I will tell you my problems in my implementation[1]:

1. Sorting by Date is ruinously slow. So I deactivated it.
2. Because the sorting is so slow, I want to allow the user specifying a
Date-Range. But Lucene throws an BooleanQuery$TooManyClauses[2].
Anywhere I read if you give lucene a higher MaxClauseCount, this will
solve that Problem. But it doesn't work :-(
3. I also read that we should save the Date as MMDD-String. I don't
like this solution, because I don't know that this will work. And then I
have to reindex the whole Data!

So could you give me a little hint, how i can solve my Date-Prblems?



[1]
Implementation:

  BooleanQuery query= new BooleanQuery();
  query.setMaxClauseCount(262144);
  Query q1= QueryParser.parse(query,"content",analyzer);
  query.add(q1,true,false);
  if(area.length()>2)
  {
Query q2=new TermQuery( new Term("bereich",area) );
query.add(q2,true,false);
  }
  try {
DateFormat df = DateFormat.getDateInstance(
   DateFormat.DATE_FIELD, Locale.GERMAN);
df.setLenient(true);
Date d1 = df.parse(date_from);
Date d2 = df.parse(date_to);
date_from = DateField.dateToString(d1);
date_to = DateField.dateToString(d2);
  }   catch (Exception e) { }
  Query q3=new RangeQuery( new Term("datum",date_from),
   new Term("datum",date_to),true );
  query.add(q3,true,false);
  /*Sort csort= new Sort();
  if (sort.length()>2)
  {
 csort.setSort(sort,reverse);
  }*/
  Hits hits = searcher.search(query);
  //Hits hits = searcher.search(query,csort);
  makeOutput(hits, start, length);
  Date ende= new Date();
  long zeit=(ende.getTime()-anfang.getTime())/100 ;
  ausgabe.append("|" + (float)zeit/10);



  private void makeOutput(Hits hits,int start,int length)
throws Exception
  {
int i=start;
if (hits.length()>0)
{
  ausgabe.append("");
  for (;(i");
ausgabe.append(doc.getField("bereich").stringValue()
ausgabe.append();
DateFormat df = DateFormat.getDateInstance(
  DateFormat.DATE_FIELD, Locale.GERMAN);
df.setLenient(true);
ausgabe.append(df.format(
  DateField.stringToDate(doc.getField("datum").stringValue(;
ausgabe.append("");
ausgabe.append("");
ausgabe.append(doc.getField("content_vorschau").stringValue()
ausgabe.append("");
ausgabe.append("");
  }
  ausgabe.append("");
}
ausgabe.append("|X|" + hits.length() + "|" + start + "|" + i);
  }

__

[2]
StackTrace:

org.apache.lucene.search.BooleanQuery$TooManyClauses
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:117)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:41)
at suchmaschine.LuceneSearcher.erweitert(LuceneSearcher.java:138)
at suchmaschine.XmlRpcSearcher.erweitert(XmlRpcSearcher.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.xmlrpc.Invoker.execute(Invoker.java:168)
at
org.apache.xmlrpc.XmlRpcWorker.invokeHandler(XmlRpcWorker.java:123)
at org.apache.xmlrpc.XmlRpcWorker.execute(XmlRpcWorker.java:185)
at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:151)
at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:139)
at org.apache.xmlrpc.WebServer$Connection.run(WebServer.java:773)
at org.apache.xmlrpc.WebServer$Runner.run(WebServer.java:656)
at java.lang.Thread.run(Thread.java:595)

__
[3]
My Fields:
  neu.setBoost( boost  );
  neu.add(Field.UnStored("content",content));
  neu.add(Field.Keywo

Re: TooManyClauses in BooleanQuery

2005-06-13 Thread a . herberger
Hi Harald,

its nice too see, that there are others out there in Germany dealing with 
the same problems as we have been doing in the past years :-)

So for the "too many clauses" problem I have a solution for you, that I 
want to share:
Just include somewhere at the very beginning of your program (retrieval 
part) the call:

BooleanQuery.setMaxClauseCount(1000*1000);

We have had similar problems (it applies also to searches with left 
truncation: *word) and could work around this quite good with increasing 
this setting.

Regarding the sorting we have also implemented our own class (at the time 
beeing there was no sorting support in Lucene), but this was very 
application specific and we had to limit it to about 5000 hits we are 
sorting due to speed limitations. I can give you more information on this, 
if you want.

Hope, I have been of some help
best regards from Wiesbaden

Andreas M. Herberger
mailto: [EMAIL PROTECTED]
http://www.makrolog.de






Harald Stowasser <[EMAIL PROTECTED]> 
13.06.2005 13:47

Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
TooManyClauses in BooleanQuery






Hello lucene-list readers,

first I want to introduce myself a little. Because I am new at this List:

I am a programmer in a publishing company, 32 years of Age and you can
find my picture at http://www.idowa.de/service/kontakt.
We release some local newspapers and a website (http://www.idowa.de)
with the main focus on regional content.

We use Lucene to create an index over the whole newspaper and website
content. So there is more than 2GB text to indicate.

And now I will tell you my problems in my implementation[1]:

1. Sorting by Date is ruinously slow. So I deactivated it.
2. Because the sorting is so slow, I want to allow the user specifying a
Date-Range. But Lucene throws an BooleanQuery$TooManyClauses[2].
Anywhere I read if you give lucene a higher MaxClauseCount, this will
solve that Problem. But it doesn't work :-(
3. I also read that we should save the Date as MMDD-String. I don't
like this solution, because I don't know that this will work. And then I
have to reindex the whole Data!

So could you give me a little hint, how i can solve my Date-Prblems?



[1]
Implementation:

  BooleanQuery query= new BooleanQuery();
  query.setMaxClauseCount(262144);
  Query q1= QueryParser.parse(query,"content",analyzer);
  query.add(q1,true,false);
  if(area.length()>2)
  {
Query q2=new TermQuery( new Term("bereich",area) );
query.add(q2,true,false);
  }
  try {
DateFormat df = DateFormat.getDateInstance(
   DateFormat.DATE_FIELD, Locale.GERMAN);
df.setLenient(true);
Date d1 = df.parse(date_from);
Date d2 = df.parse(date_to);
date_from = DateField.dateToString(d1);
date_to = DateField.dateToString(d2);
  }   catch (Exception e) { }
  Query q3=new RangeQuery( new Term("datum",date_from),
   new Term("datum",date_to),true );
  query.add(q3,true,false);
  /*Sort csort= new Sort();
  if (sort.length()>2)
  {
 csort.setSort(sort,reverse);
  }*/
  Hits hits = searcher.search(query);
  //Hits hits = searcher.search(query,csort);
  makeOutput(hits, start, length);
  Date ende= new Date();
  long zeit=(ende.getTime()-anfang.getTime())/100 ;
  ausgabe.append("|" + (float)zeit/10);



  private void makeOutput(Hits hits,int start,int length)
throws Exception
  {
int i=start;
if (hits.length()>0)
{
  ausgabe.append("");
  for (;(i");
ausgabe.append(doc.getField("bereich").stringValue()
ausgabe.append();
DateFormat df = DateFormat.getDateInstance(
  DateFormat.DATE_FIELD, Locale.GERMAN);
df.setLenient(true);
ausgabe.append(df.format(
  DateField.stringToDate(doc.getField("datum").stringValue(;
ausgabe.append("");
ausgabe.append("");
ausgabe.append(doc.getField("content_vorschau").stringValue()
ausgabe.append("");
ausgabe.append("");
  }
  ausgabe.append("");
}
ausgabe.append("|X|" + hits.length() + "|" + start + "|" + i);
  }

__

[2]
StackTrace:

org.apache.lucene.search.BooleanQuery$TooManyClauses
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:117)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:41)
at suchmaschine.L

Updating documents

2005-06-13 Thread Markus Wiederkehr
Hi all,

I would like to update a document as follows.

1) retrieve the document from an IndexReader/Searcher
2) delete the document
3) manipulate the document, that is remove and add fields
4) save the document using an IndexWriter

When I do this all fields that were indexed and/or tokenized but not
stored get lost.

So is there any way to preserve fields that were not stored?
Reconstructing these fields is to expensive in my application.

Thanks in advance,

Markus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Erik Hatcher


On Jun 13, 2005, at 7:47 AM, Harald Stowasser wrote:

1. Sorting by Date is ruinously slow. So I deactivated it.


How were you sorting by date?

3. I also read that we should save the Date as MMDD-String. I  
don't
like this solution, because I don't know that this will work. And  
then I

have to reindex the whole Data!


It will work :)  Terms need to be lexicographically orderable - and  
using MMDD will do just that as long as you don't need  
granularity beyond day.  However, before reindexing with MMDD -  
what are your searching/sorting needs?  If day is the granularity,  
then MMDD will be fine.  However you may want to break it into  
more fields such as year, month, and day separately.  Note: keep  
numbers padded to the same number of characters (1 for a day field  
should be "01" for example).


For sorting, you may find that once you've used MMDD that you can  
then sort with the field type as INT on that same field (use  
Field.Keyword for indexing).



[3]
My Fields:
  neu.setBoost( boost  );
  neu.add(Field.UnStored("content",content));
  neu.add(Field.Keyword("keyword",keyword));
  ConfDate date = new ConfDate(datum);
  neu.add(Field.Keyword("datum",(Date)date.getUtilDate()));
  neu.add(Field.UnIndexed("content_vorschau",content_vorschau));
  neu.add(Field.UnIndexed("content_id",""+content_id));
  neu.add(Field.UnIndexed("zeitstempel",zeitstempel));
  neu.add(Field.UnIndexed("link",link));
  neu.add(Field.Keyword("bereich",bereich));
  index.addDocument(neu);


What kind of granularity for dates does ConfDate.getUtilDate() return?

Using Date for Field.Keyword indexes to the millisecond granularity -  
that is very unlikely to be of use to you at that level.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



OutOfMemory when indexing

2005-06-13 Thread Stanislav Jordanov

High guys,
Building some huge index (about 500,000 docs totaling to 10megs of plain 
text) we've run into the following problem:
Most of the time the IndexWriter process consumes a fairly small amount 
of memory (about 32 megs).
However, as the index size grows, the memory usage sporadically bursts 
to levels of (say) 1000 gigs and then falls back to its level.
The problem is that unless te process is started with some option like 
-Xmx1000m this situation causes an OutOfMemoryException which terminates 
the indexing process.


My question is - is there a way to avoid it?

Regards
Stanislav

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Erik Hatcher


On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:

I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.

I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses something like that at http://www.lucenebook.com/.


Markus - you're right, I did develop something to handle hyphenated  
words for lucenebook.com.  It was sort of a hack in that I had to  
build in a static list of exceptions in how I handled this, so you'll  
likely have to use caution as well.  The LiaAnalyzer is this:


  public TokenStream tokenStream(String fieldName, Reader reader) {
TokenFilter filter = new DashSplitterFilter(
  new HyphenatedFilter(
new DashDashFilter(
  new LiaTokenizer(reader;

filter = new LengthFilter(3, filter);
filter = new StopFilter(filter, stopSet);

if (stem) {
  filter = new SnowballFilter(filter, "English");
}

return filter;
  }


And my HyphenatedFilter is this:

public class HyphenatedFilter extends TokenFilter {
  private HashMap exceptions = new HashMap();

  private static final String[] EXCEPTION_LIST = {
 "full-text", "information-retrieval", "license-code", "old- 
fashioned",
 "well-designed", "free-form", "file-based", "ramdirectory- 
based", "ram-based",

 "index-modifying", "read-only",
 "top-scoring", "most-recently-used", "queryparser-parsed",
 "in-order", "per-document", "lower-caser", "domain-specific",  
"high-level",

 "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
 "date-range", "computation-intensive", "hits-returning", "lower- 
level",
 "number-padding", "utf-address-book", "third-party", "plain- 
text", "google-like",
 "re-add", "english-specific", "file-handling", "already- 
created", "d-add", "d-add",
 "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",  
"porteranalyzer-new",
 "writer-set", "document-new", "doc-add", "field-keyword",  
"field-unstored", "writer-add",
 "writer-optimize", "queryparser-new", "porteranalyzer-new",  
"parser-parse", "indexsearcher-new",
 "hitcollector-new", "searcher-doc", "searcher-search", "jakarta- 
lucene", "www-ibm", "java-specific",
 "non-java", "vis--vis", "medium-sized", "browser-based", "utf- 
before", "concept-based",
 "natural-language", "queue-based", "high-likelihood", "slp-or",  
"noisy-channel", "al-rasheed",
 "hands-free", "top-notch", "google-esque", "search-config",  
"java-related",
 "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",  
"lucene-web", "lucene-webindex",

 "command-line", "lucene-version", "issue-tracking"
  };

  protected HyphenatedFilter(TokenStream tokenStream) {
super(tokenStream);

for (int i = 0; i < EXCEPTION_LIST.length; i++) {
  exceptions.put(EXCEPTION_LIST[i], "");
}
  }

  private Token savedToken;

  public Token next() throws IOException {

if (savedToken != null) {
  Token token = savedToken;
  savedToken = null;
  return token;
}

Token firstToken = input.next();

if (firstToken == null)
  return firstToken;


if (firstToken.termText().endsWith("-")) {
  String firstPart;
  firstPart = firstToken.termText();

  // consume next token
  Token secondToken = input.next();
  if (secondToken == null)
return firstToken;

  String termText = firstPart.substring(0, firstPart.length() -  
1) + secondToken.termText();


  if (exceptions.containsKey(firstPart + secondToken.termText())) {
savedToken = secondToken;
return firstToken;
  }

  return new Token(termText, firstToken.startOffset(),  
firstToken.endOffset() + secondToken.termText().length() + 1);

}

return firstToken;
  }
}

Not all that pretty, I'm afraid, but by all means use it if its useful.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
[EMAIL PROTECTED] schrieb:

> Hi Harald,
> 
> its nice too see, that there are others out there in Germany dealing with 
> the same problems as we have been doing in the past years :-)
> 
> So for the "too many clauses" problem I have a solution for you, that I 
> want to share:
> Just include somewhere at the very beginning of your program (retrieval 
> part) the call:
> 
> BooleanQuery.setMaxClauseCount(1000*1000);

As you can see in the source code, I tried this already:
  query.setMaxClauseCount(262144);
It even don't work with higher values, it just crashed with Not enough
Memory -Error :-(


signature.asc
Description: OpenPGP digital signature


Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Harald Stowasser
Harald Stowasser schrieb:

P.S.
I tried now to use DateFilter. This works, but is very slow on longer
Date-Ranges. (30sec. )




signature.asc
Description: OpenPGP digital signature


Re: OutOfMemory when indexing

2005-06-13 Thread Markus Wiederkehr
I am not an expert, but maybe the occasionally high memory usage is
because Lucene is merging multiple index segments together.

Maybe it would help if you set maxMergeDocs to 10,000 or something. In
your case that would mean that the minimum number of index segments
would be 50.

But again, this may be completely wrong...

Markus

On 6/13/05, Stanislav Jordanov <[EMAIL PROTECTED]> wrote:
> High guys,
> Building some huge index (about 500,000 docs totaling to 10megs of plain
> text) we've run into the following problem:
> Most of the time the IndexWriter process consumes a fairly small amount
> of memory (about 32 megs).
> However, as the index size grows, the memory usage sporadically bursts
> to levels of (say) 1000 gigs and then falls back to its level.
> The problem is that unless te process is started with some option like
> -Xmx1000m this situation causes an OutOfMemoryException which terminates
> the indexing process.
> 
> My question is - is there a way to avoid it?
> 
> Regards
> Stanislav

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TooManyClauses in BooleanQuery

2005-06-13 Thread Omar Didi
if you get an OutOfMemoryException, I beleive the only thing you can do
is just increase the JVM heap to a larger size.

-Original Message-
From: Harald Stowasser [mailto:[EMAIL PROTECTED]
Sent: Monday, June 13, 2005 8:28 AM
To: java-user@lucene.apache.org
Subject: Re: TooManyClauses in BooleanQuery


[EMAIL PROTECTED] schrieb:

> Hi Harald,
> 
> its nice too see, that there are others out there in Germany dealing
with 
> the same problems as we have been doing in the past years :-)
> 
> So for the "too many clauses" problem I have a solution for you, that
I 
> want to share:
> Just include somewhere at the very beginning of your program
(retrieval 
> part) the call:
> 
> BooleanQuery.setMaxClauseCount(1000*1000);

As you can see in the source code, I tried this already:
  query.setMaxClauseCount(262144);
It even don't work with higher values, it just crashed with Not enough
Memory -Error :-(

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Markus Wiederkehr
I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!

Markus

On 6/13/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> 
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
> > Naturally there occur many words that are hyphenated across lines.
> >
> > I wonder if there is already an Analyzer or maybe a TokenFilter that
> > can merge those syllables back into whole words? It looks like Erik
> > Hatcher uses something like that at http://www.lucenebook.com/.
> 
> Markus - you're right, I did develop something to handle hyphenated
> words for lucenebook.com.  It was sort of a hack in that I had to
> build in a static list of exceptions in how I handled this, so you'll
> likely have to use caution as well.  The LiaAnalyzer is this:
> 
>public TokenStream tokenStream(String fieldName, Reader reader) {
>  TokenFilter filter = new DashSplitterFilter(
>new HyphenatedFilter(
>  new DashDashFilter(
>new LiaTokenizer(reader;
> 
>  filter = new LengthFilter(3, filter);
>  filter = new StopFilter(filter, stopSet);
> 
>  if (stem) {
>filter = new SnowballFilter(filter, "English");
>  }
> 
>  return filter;
>}
> 
> 
> And my HyphenatedFilter is this:
> 
> public class HyphenatedFilter extends TokenFilter {
>private HashMap exceptions = new HashMap();
> 
>private static final String[] EXCEPTION_LIST = {
>   "full-text", "information-retrieval", "license-code", "old-
> fashioned",
>   "well-designed", "free-form", "file-based", "ramdirectory-
> based", "ram-based",
>   "index-modifying", "read-only",
>   "top-scoring", "most-recently-used", "queryparser-parsed",
>   "in-order", "per-document", "lower-caser", "domain-specific",
> "high-level",
>   "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
>   "date-range", "computation-intensive", "hits-returning", "lower-
> level",
>   "number-padding", "utf-address-book", "third-party", "plain-
> text", "google-like",
>   "re-add", "english-specific", "file-handling", "already-
> created", "d-add", "d-add",
>   "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
> "porteranalyzer-new",
>   "writer-set", "document-new", "doc-add", "field-keyword",
> "field-unstored", "writer-add",
>   "writer-optimize", "queryparser-new", "porteranalyzer-new",
> "parser-parse", "indexsearcher-new",
>   "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
> lucene", "www-ibm", "java-specific",
>   "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
> before", "concept-based",
>   "natural-language", "queue-based", "high-likelihood", "slp-or",
> "noisy-channel", "al-rasheed",
>   "hands-free", "top-notch", "google-esque", "search-config",
> "java-related",
>   "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
> "lucene-web", "lucene-webindex",
>   "command-line", "lucene-version", "issue-tracking"
>};
> 
>protected HyphenatedFilter(TokenStream tokenStream) {
>  super(tokenStream);
> 
>  for (int i = 0; i < EXCEPTION_LIST.length; i++) {
>exceptions.put(EXCEPTION_LIST[i], "");
>  }
>}
> 
>private Token savedToken;
> 
>public Token next() throws IOException {
> 
>  if (savedToken != null) {
>Token token = savedToken;
>savedToken = null;
>return token;
>  }
> 
>  Token firstToken = input.next();
> 
>  if (firstToken == null)
>return firstToken;
> 
> 
>  if (firstToken.termText().endsWith("-")) {
>String firstPart;
>firstPart = firstToken.termText();
> 
>// consume next token
>Token secondToken = input.next();
>if (secondToken == null)
>  return firstToken;
> 
>String termText = firstPart.substring(0, firstPart.length() -
> 1) + secondToken.termText();
> 
>if (exceptions.containsKey(firstPart + secondToken.termText())) {
>  savedToken = secondToken;
>  return firstToken;
>}
> 
>return new Token(termText, firstToken.startOffset(),
> firstToken.endOffset() + secondToken.termText().length() + 1);
>  }
> 
>  return firstToken;
>}
> }
> 
> Not all that pretty, I'm afraid, but by all means use it if its useful.
> 
>  Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
Always remember you're unique. Just like everyone else.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>

I expect you'll need to do some pre-processing. Read in your text into a 
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate 
the buffer to merge the hyphenated tokens.

Andy


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory when indexing

2005-06-13 Thread Harald Stowasser
Stanislav Jordanov schrieb:

> High guys,
> Building some huge index (about 500,000 docs totaling to 10megs of plain
> text) we've run into the following problem:
> Most of the time the IndexWriter process consumes a fairly small amount
> of memory (about 32 megs).
> However, as the index size grows, the memory usage sporadically bursts
> to levels of (say) 1000 gigs and then falls back to its level.
> The problem is that unless te process is started with some option like
> -Xmx1000m this situation causes an OutOfMemoryException which terminates
> the indexing process.
> 
> My question is - is there a way to avoid it?


1.
I start my programm with:
java -Xms256M -Xmx512M -jar Suchmaschine.jar &

This protect me now from OutOfMemoryException. After I use
iterative-subroutines.

2.
Free your variables as soon as possible.
like "term=null;"
This will help your Garbage-Collector!

3.
Maybe you should watch totalMemory and R.freeMemory() from
Runtime.getRuntime()
That will help you to find the "Memory-dissipater"

4.
I had the problem when deleting Documents from Index. I used a
Subroutine to delete single Documents.
It runs much better when I replaced it into a "iterative" subroutine
like this:

  public int deleteMany(String keywords)
  {
int anzahl=0;
try
{
  openReader();
  String[] temp = keywords.split(",");
  //Runtime R = Runtime.getRuntime();
  for (int i = 0 ; i < temp.length ; i++)
  {
Term term =new Term("keyword",temp[i]);
anzahl+= mReader.delete(term);
term=null;
/*System.out.println("deleted " + temp[i]
   +" t:"+R.totalMemory()
   +" f:"+R.freeMemory()
   +" m"+R.maxMemory());
*/
  }
  close();
} catch (Exception e){
  cIdowa.error( "Could not delete Documents:" + keywords
+". Because:"+ e.getMessage() + "\n" +e.toString() );
}
return anzahl;
  }





signature.asc
Description: OpenPGP digital signature


Re: TooManyClauses in BooleanQuery

2005-06-13 Thread Erik Hatcher


On Jun 13, 2005, at 8:44 AM, Harald Stowasser wrote:


Harald Stowasser schrieb:

P.S.
I tried now to use DateFilter. This works, but is very slow on longer
Date-Ranges. (30sec. )


Filters in general were meant for one-time creation and caching.  If  
the date ranges are fixed and the index not being updated, then  
DateFilters will work fine as you only create each filter once.  If  
the index updates, thats ok, as you can simply reinstantiate the  
filters when that occurs.


My recommendation is for you to consider using MMDD format for  
your dates to begin with, but I'd like to see more about the range of  
dates that you're indexing and what kind of ranges you need to  
accommodate for searching.


Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Markus Wiederkehr
On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > I see, the list of exceptions makes this a lot more complicated than I
> > thought... Thanks a lot, Erik!
> >
> 
> I expect you'll need to do some pre-processing. Read in your text into a
> buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
> the buffer to merge the hyphenated tokens.

As Erik wrote it is not that simple, unfortunately. For example, if
one line ends with "read-" and the next line begins with "only" the
correct word is "read-only" not "readonly". Whereas "work-" and "ing"
should of course be merged into "working".

Markus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Displaying relevant text with Search results

2005-06-13 Thread Kadlabalu, Hareesh
Hi,

I have a simple index with one default field that is stored and indexed. I
want to display the query results along with some relevant text from the
default field, the way search is implemented at http://www.lucenebook.com/
 . 

 

For example, searching for 'wonderful'
(http://www.lucenebook.com/search?query=Wonderful
 ) generates results that
have highlighting on relevant words in the result. 

 

One way to implement this would be get documents from search result and
physically parse the contents of the default field for the occurrence of the
search word or one of its synonyms (Wonderful: wonder, wonderfully.. ). Then
display a few words before and after a match for contextual information. 

However, in order to really do it correctly, one needs to get to the 'best'
part field's text where the density of searched word(s) is highest. This
could be a very expensive process. Does Lucene give any help is achieving
this? 

 

Thanks

-Hareesh

 



Re: Hypenated word

2005-06-13 Thread Peter A. Friend


On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:


I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!


There is a section about the problems that hyphens create in  
"Foundations of Statistical Natural Language Processing". Not only  
are the cases numerous, but seemingly simple rules such as joining  
hyphenated forms at the ends of lines does not always work. Sometimes  
the hyphen was added to break the word, sometimes you are already  
dealing with a hyphenated form that just happened to occur at the end  
of a line, so the hyphen serves two purposes. I've toyed with the  
idea of indexing hyphenated words in their raw as well as split  
forms, but I think that would wreak havoc on the word position stuff,  
as well as bloat the index with potentially meaningless gibberish.


Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexes auto creation

2005-06-13 Thread Stephane Bailliez
I have a very stupid question that puzzles me so far in the API. (I'm 
using Lucene 1.4.3)


There is a boolean flag over the creation of the Directory which is 
basically: use it as is or delete the storage area


Same for the index, the IndexWriter use a flag 'use the existing or 
create a new one'.


If you're creating an indexwriter with 'create' set to false. It could 
blow up with an IOException because the index does not exist.
But it could also blow up for other reasons with an IOException..which 
does not help much in identifying the source problem.



What I would like to is something like: if the index does not exist, 
then create one for me, otherwise use it.


I could do that with something like

try {
   writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
writer = new IndexWriter(directory, analyzer, true);
}

but this is not exactly true, and I could possibly delete an existing 
index if an IOException happens which is not due to a non-existing index.


Apparently a way to check there is an existing index would be (based on 
the Lucene source code) to do something like:


try {
   writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
   if ( !directory.exists(IndexFileNames.SEGMENTS) ) {
   // the index really does not exists, so create it
   writer = new IndexWriter(directory, analyzer, true);
   } else {
   throw e;
   }
}

Is this correct or is there something even more simpler that I'm missing ?

Ideally I would have liked a subclassed IOException on the IndexWriter 
to differentiate the cases (like FileNotFoundException for example) but 
maybe I'm missing some trivial thing.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexes auto creation

2005-06-13 Thread Kadlabalu, Hareesh
I ran into a related problem; when I create an IndexWriter with a
FSDirectory created with create=true, an existing index would somehow get
corrupted (Luke would come back with a message saying that the index is
corrupt). IndexWriter will tell you that it has 0 documents at that stage
even though the index had several documents prior to creating this instance
of IndexWriter. It is interesting that the sizes of the index files remain
same. It seems that, creating an IndexWriter with a FSDirectory with
create=true on an existing index somehow corrupts the index. 

I figured there must be a bug here (I am using 1.4.3), has anyone run into
this? I had to fall back on an inelegant solution. First check if a
directory exists, if not create a dummy index and close it. After that
always create IndexWriter with create=false.

Thanks
-Hareesh




-Original Message-
From: Stephane Bailliez [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 13, 2005 11:48 AM
To: java-user@lucene.apache.org
Subject: Indexes auto creation

I have a very stupid question that puzzles me so far in the API. (I'm 
using Lucene 1.4.3)

There is a boolean flag over the creation of the Directory which is 
basically: use it as is or delete the storage area

Same for the index, the IndexWriter use a flag 'use the existing or 
create a new one'.

If you're creating an indexwriter with 'create' set to false. It could 
blow up with an IOException because the index does not exist.
But it could also blow up for other reasons with an IOException..which 
does not help much in identifying the source problem.


What I would like to is something like: if the index does not exist, 
then create one for me, otherwise use it.

I could do that with something like

try {
writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
 writer = new IndexWriter(directory, analyzer, true);
}

but this is not exactly true, and I could possibly delete an existing 
index if an IOException happens which is not due to a non-existing index.

Apparently a way to check there is an existing index would be (based on 
the Lucene source code) to do something like:

try {
writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
if ( !directory.exists(IndexFileNames.SEGMENTS) ) {
// the index really does not exists, so create it
writer = new IndexWriter(directory, analyzer, true);
} else {
throw e;
}
}

Is this correct or is there something even more simpler that I'm missing ?

Ideally I would have liked a subclassed IOException on the IndexWriter 
to differentiate the cases (like FileNotFoundException for example) but 
maybe I'm missing some trivial thing.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Out of Memory (correction)

2005-06-13 Thread Stanislav Jordanov
A small correction to my last letter: "1000gigs" should be "1000 megs" 
(sorry)

Here's the corrected version:

High guys,
Building some huge index (about 500,000 docs totaling to 10megs of plain 
text) we've run into the following problem:
Most of the time the IndexWriter process consumes a fairly small amount 
of memory (about 32 megs).
However, as the index size grows, the memory usage sporadically bursts 
to levels of (say) 1000 megs and then falls back to its level.
The problem is that unless te process is started with some option like 
-Xmx1000m this situation causes an OutOfMemoryException which terminates 
the indexing process.


My question is - is there a way to avoid it?

Regards
Stanislav

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexes auto creation

2005-06-13 Thread Stephane Bailliez

Stephane Bailliez wrote:
[...]

try {
   writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
writer = new IndexWriter(directory, analyzer, true);
}


On a related note, the code above does not work if the index does not 
exist because of the lock created by the first IndexWriter that is still 
lying around.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexes auto creation

2005-06-13 Thread Luke Francl
You may want to try using IndexReader's indexExists family of methods.
They will tell you whether or not an index is there.

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(org.apache.lucene.store.Directory)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexes auto creation

2005-06-13 Thread Volodymyr Bychkoviak

hello

I'm using following code in the startup of my program


   String indexDirectory = //some init
   try {
 if ( !IndexReader.indexExists(indexDirectory)) {
   // working index doesn't exist so try to create a dummy index.
   IndexWriter iw = new IndexWriter(indexDirectory, new 
StandardAnalyzer(), true);

   iw.close();
 } else {
   IndexReader.unlock(FSDirectory.getDirectory(indexDirectory, false));
 }
   } catch (IOException e) {
 // Exception happened when trying to unlock working index
   }

regards,
Volodymyr Bychkoviak


Stephane Bailliez wrote:

I have a very stupid question that puzzles me so far in the API. (I'm 
using Lucene 1.4.3)


There is a boolean flag over the creation of the Directory which is 
basically: use it as is or delete the storage area


Same for the index, the IndexWriter use a flag 'use the existing or 
create a new one'.


If you're creating an indexwriter with 'create' set to false. It could 
blow up with an IOException because the index does not exist.
But it could also blow up for other reasons with an IOException..which 
does not help much in identifying the source problem.



What I would like to is something like: if the index does not exist, 
then create one for me, otherwise use it.


I could do that with something like

try {
   writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
writer = new IndexWriter(directory, analyzer, true);
}

but this is not exactly true, and I could possibly delete an existing 
index if an IOException happens which is not due to a non-existing index.


Apparently a way to check there is an existing index would be (based 
on the Lucene source code) to do something like:


try {
   writer = new IndexWriter(directory, analyzer, false)
} catch (IOException e){
   if ( !directory.exists(IndexFileNames.SEGMENTS) ) {
   // the index really does not exists, so create it
   writer = new IndexWriter(directory, analyzer, true);
   } else {
   throw e;
   }
}

Is this correct or is there something even more simpler that I'm 
missing ?


Ideally I would have liked a subclassed IOException on the IndexWriter 
to differentiate the cases (like FileNotFoundException for example) 
but maybe I'm missing some trivial thing.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Displaying relevant text with Search results

2005-06-13 Thread Pasha Bizhan
Hi, 

> From: Kadlabalu, Hareesh [mailto:[EMAIL PROTECTED] 

> However, in order to really do it correctly, one needs to get 
> to the 'best'
> part field's text where the density of searched word(s) is 
> highest. This could be a very expensive process. Does Lucene 
> give any help is achieving this? 

You need the Highlighter package.
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/highlighter/src/
java/org/apache/lucene/search/highlight/package.html?view=markup

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/highlighter/

Pasha Bizhan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Displaying relevant text with Search results

2005-06-13 Thread Pasha Bizhan
Hi, 

> From: Kadlabalu, Hareesh [mailto:[EMAIL PROTECTED] 

> However, in order to really do it correctly, one needs to get 
> to the 'best'
> part field's text where the density of searched word(s) is 
> highest. This could be a very expensive process. Does Lucene 
> give any help is achieving this? 

You need the Highlighter package.
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/highlighter/src/
java/org/apache/lucene/search/highlight/package.html?view=markup

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/highlighter/

Pasha Bizhan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexes auto creation

2005-06-13 Thread Pasha Bizhan
Hi, 

> From: news [mailto:[EMAIL PROTECTED] On Behalf Of Stephane Bailliez
> 
> What I would like to is something like: if the index does not 
> exist, then create one for me, otherwise use it.

Look at IndexReader.indexExists method.

Your code will be like this:

bool createIndex = ! (IndexReader.indexExists(directory));
writer = new IndexWriter(directory, analyzer, createIndex );

Pasha Bizhan
http://lucenedotnet.com 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexes auto creation

2005-06-13 Thread Pasha Bizhan
Hi, 

> From: news [mailto:[EMAIL PROTECTED] On Behalf Of Stephane Bailliez
> 
> What I would like to is something like: if the index does not 
> exist, then create one for me, otherwise use it.

Look at IndexReader.indexExists method.

Your code will be like this:

bool createIndex = ! (IndexReader.indexExists(directory));
writer = new IndexWriter(directory, analyzer, createIndex );

Pasha Bizhan
http://lucenedotnet.com 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Displaying relevant text with Search results

2005-06-13 Thread Erik Hatcher


On Jun 13, 2005, at 10:58 AM, Kadlabalu, Hareesh wrote:


Hi,

I have a simple index with one default field that is stored and  
indexed. I
want to display the query results along with some relevant text  
from the
default field, the way search is implemented at http:// 
www.lucenebook.com/

 .



For example, searching for 'wonderful'
(http://www.lucenebook.com/search?query=Wonderful
 ) generates  
results that

have highlighting on relevant words in the result.



One way to implement this would be get documents from search result  
and
physically parse the contents of the default field for the  
occurrence of the
search word or one of its synonyms (Wonderful: wonder,  
wonderfully.. ). Then
display a few words before and after a match for contextual  
information.


However, in order to really do it correctly, one needs to get to  
the 'best'
part field's text where the density of searched word(s) is highest.  
This
could be a very expensive process. Does Lucene give any help is  
achieving

this?


The Highlighter (under contrib in the Lucene Subversion repository)  
does a bit of finding the best fragments - you can customize this  
aspect of it.  Check the source code and test cases for more details  
on how to customize this sort of thing.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Erik Hatcher


On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:


On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:

I see, the list of exceptions makes this a lot more complicated  
than I

thought... Thanks a lot, Erik!




I expect you'll need to do some pre-processing. Read in your text  
into a
buffer, line-by-line. If a given line ends with a hyphen, you can  
manipulate

the buffer to merge the hyphenated tokens.


The problem I encountered when indexing "Lucene in Action" was that I  
couldn't just blindly concatenate two tokens because the first ends  
with a hyphen.  Some lines ended with a hyphen because it was a dash,  
not a hyphenated word.


I'm sure other more clever implementations could do this better, by  
looking up the concatenated word in a dictionary for instance.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexes auto creation

2005-06-13 Thread Stephane Bailliez

Luke Francl wrote:

You may want to try using IndexReader's indexExists family of methods.
They will tell you whether or not an index is there.

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(org.apache.lucene.store.Directory)


Good grief ! I missed that one.

Thanks to all who have replied.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Pros/Cons of a split index over a single large index

2005-06-13 Thread Aalap Parikh
Hi,

I just a general question: What are the pros and cons
of a split index(a number of small indexes) as opposed
to a single large index?

As I have repeatedly seen in various posts at this
group, people have opted for split indexes in cases
where they have a large number of documents (say > 1
million) in the index and in such cases updating and
merging of indexes becomes costly in terms of
performance. I have also had the same experience. So
was wondering if splitting the index into a number of
smaller indexes would improve the performance of
index-modifying operations, while at the same time NOT
SACRIFICING the index search operation.

Any ideas on this?

Thanks,
Aalap.

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory when indexing

2005-06-13 Thread Gusenbauer Stefan
Harald Stowasser wrote:

>Stanislav Jordanov schrieb:
>
>  
>
>>High guys,
>>Building some huge index (about 500,000 docs totaling to 10megs of plain
>>text) we've run into the following problem:
>>Most of the time the IndexWriter process consumes a fairly small amount
>>of memory (about 32 megs).
>>However, as the index size grows, the memory usage sporadically bursts
>>to levels of (say) 1000 gigs and then falls back to its level.
>>The problem is that unless te process is started with some option like
>>-Xmx1000m this situation causes an OutOfMemoryException which terminates
>>the indexing process.
>>
>>My question is - is there a way to avoid it?
>>
>>
>
>
>1.
>I start my programm with:
>java -Xms256M -Xmx512M -jar Suchmaschine.jar &
>
>This protect me now from OutOfMemoryException. After I use
>iterative-subroutines.
>
>2.
>Free your variables as soon as possible.
>like "term=null;"
>This will help your Garbage-Collector!
>
>3.
>Maybe you should watch totalMemory and R.freeMemory() from
>Runtime.getRuntime()
>That will help you to find the "Memory-dissipater"
>
>4.
>I had the problem when deleting Documents from Index. I used a
>Subroutine to delete single Documents.
>It runs much better when I replaced it into a "iterative" subroutine
>like this:
>
>  public int deleteMany(String keywords)
>  {
>int anzahl=0;
>try
>{
>  openReader();
>  String[] temp = keywords.split(",");
>  //Runtime R = Runtime.getRuntime();
>  for (int i = 0 ; i < temp.length ; i++)
>  {
>Term term =new Term("keyword",temp[i]);
>anzahl+= mReader.delete(term);
>term=null;
>/*System.out.println("deleted " + temp[i]
>   +" t:"+R.totalMemory()
>   +" f:"+R.freeMemory()
>   +" m"+R.maxMemory());
>*/
>  }
>  close();
>} catch (Exception e){
>  cIdowa.error( "Could not delete Documents:" + keywords
>+". Because:"+ e.getMessage() + "\n" +e.toString() );
>}
>return anzahl;
>  }
>
>
>
>  
>
A few weeks before I had a similar problem too. I will write my problem
and the solution for it:
I'm indexing docs and every parsed document is stored in an ArrayList.
This solution worked for little directories with a little number of
files in it but when the things are growing you're in trouble.
My solution was whenever I will run out of memory I will "save" the
documents. I open the indexwriter and write every document from the
arraylist to the index. Then I set the arraylist and some other stuff =
null and try to invoke the garbage collector. Then I do some
reinitializing and continue indexing.
 Looks easy but it wasn't. How do I check if i will run out of memory?
Runtimeclass and its methods for getting information about the free
memory were very unreliable.
Therefore I changed to Java 1.5 and implemented a memorynotification
listener which is support by the java.lang.management package. There you
can adjust a threshold when you should be informed. After the
notification I perform a "save".

Hope this will help you
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Mobile Lucene

2005-06-13 Thread Dan Funk
I have Sharp zaurus SL-C3000 running J2me - I was able to use the 
current lucene without modification.


christopher may wrote:

Hey all I am working on a project that requires a search engine on a 
embedded linux that is also bluetooth capable. Is there a lucene 
mobile or can I recompile the code in the J2me wireless toolkit ? Any 
help would be appreciated, Thanks




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Dan Funk
Software Engineer

Information Technology Solutions
Battelle Charlottesville Operations
1000 Research Park Boulevard, Suite 105
Charlottesville, Virginia 22911

434.984.0951 x244
434.984.0947 (fax)
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus

Perhaps you do some crude checking against a dictionary. Combine the word 
anyway and check if it's in the dictionary. If so, keep it merged otherwise, 
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other 
language resources, like the BNC word lists etc.

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Determining the IDF while searching for documents

2005-06-13 Thread Barbara Krausz

Hi all,

is it possible to determine the IDF (the documents in which a term 
appears) while searching for documents? I implemented an index based on 
trigrams, i.e. the indexterms are now Strings of 3 characters so that my 
search engine finds documents with OCR-Errors. When I'm searching for 
the term "rainstorm" for example I split it up into the trigrams __r, 
_ra, rai, ain, ins...
First I look for documents which contain at least 8 of the 11 trigrams 
of "rainstorm" (the misspelled "ranstorm" contains 8 of the 11 
trigrams), then I check if the trigrams form a term like "rainstorm". In 
order to compute the TF I count the occurences of terms which are 
similar to the term. But I've got problems to compute the IDF, because I 
must know the number of documents in which the term appears before 
searching for the documents (in the method sumOfSquaredWeights() in my 
weight). I used hsqldb during indexing and saved the number of documents 
for each term. But it's really slow.
My question is the following: When I'm searching for documents which 
contain terms similar to the searchterm I actually get the number of 
documents that contain the term. But I need the IDF before searching 
these documents for example for BooleanQueries which need the IDF to 
normalize the queryvector. Can I solve this problem, i.e. can I 
determine the IDF later and normalize the BooleanQuery?


Thanks
Barbara

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexes auto creation

2005-06-13 Thread Daniel Naber
On Monday 13 June 2005 18:37, Kadlabalu, Hareesh wrote:

> I ran into a related problem; when I create an IndexWriter with a
> FSDirectory created with create=true, an existing index would somehow
> get corrupted

Well, it doesn't get corrupted, it gets deleted. That's what create=true is 
supposed to do, isn't it?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Mobile Lucene

2005-06-13 Thread christopher may


 What are you running as far as the OS ? And thanks for the responce.


From: Dan Funk <[EMAIL PROTECTED]>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Mobile Lucene
Date: Mon, 13 Jun 2005 15:10:46 -0400

I have Sharp zaurus SL-C3000 running J2me - I was able to use the current 
lucene without modification.


christopher may wrote:

Hey all I am working on a project that requires a search engine on a 
embedded linux that is also bluetooth capable. Is there a lucene mobile or 
can I recompile the code in the J2me wireless toolkit ? Any help would be 
appreciated, Thanks




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Dan Funk
Software Engineer

Information Technology Solutions
Battelle Charlottesville Operations
1000 Research Park Boulevard, Suite 105
Charlottesville, Virginia 22911

434.984.0951 x244
434.984.0947 (fax)
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Mobile Lucene

2005-06-13 Thread Dan Funk
It comes with Linux installed - looks like they wrapped their own - no 
official distribution.
You can, with a little effort, put debian on it ( 
http://www.eleves.ens.fr/home/leurent/zaurus.html ).


It doesn't come with Java, but you can get an official version from sun 
that works very well ( 
http://java.sun.com/developer/earlyAccess/pp4zaurus/ )


I re-compiled my application for java 1.1  and put the Lucene binary I 
downloaded in my path - I didn't alter it at all.  Everything worked 
right out of the box.  I should note that a GUI based application would 
have been far more difficult to port - all I had  was a web service- and 
that moved over without a hitch.



christopher may wrote:



 What are you running as far as the OS ? And thanks for the responce.


From: Dan Funk <[EMAIL PROTECTED]>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Mobile Lucene
Date: Mon, 13 Jun 2005 15:10:46 -0400

I have Sharp zaurus SL-C3000 running J2me - I was able to use the 
current lucene without modification.


christopher may wrote:

Hey all I am working on a project that requires a search engine on a 
embedded linux that is also bluetooth capable. Is there a lucene 
mobile or can I recompile the code in the J2me wireless toolkit ? 
Any help would be appreciated, Thanks




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Dan Funk
Software Engineer

Information Technology Solutions
Battelle Charlottesville Operations
1000 Research Park Boulevard, Suite 105
Charlottesville, Virginia 22911

434.984.0951 x244
434.984.0947 (fax)
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Dan Funk
Software Engineer

Information Technology Solutions
Battelle Charlottesville Operations
1000 Research Park Boulevard, Suite 105
Charlottesville, Virginia 22911

434.984.0951 x244
434.984.0947 (fax)
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



wiki now sends Vary: Cookie (was Re: DBSight, search on database by Lucene)

2005-06-13 Thread Joshua Slive



Paul Querna wrote:

Joshua Slive wrote:
What we want is for anything with a Cookie: header to totally bypass 
the cache.  I don't know of any way to configure that.



Moin should be sending Cache-Control: Private in these cases, in 
addition to the Vary: Cookie header.  If they don't they will break with 
other upstream proxies that we have no control over.  Fixing it so httpd 
can cache fixes upstream proxies too, so it is the right thing to do.


I've added the Vary: Cookie header.  I believe that even with the 
current naive Vary handling, this should work ok in mod_cache, since it 
won't store any of the logged-in pages due to the Cache-Control headers.

So the non-cookie version should hang around in the cache.

Anyway, I hope this makes things much less confusing for people trying 
to edit the pages.


Joshua.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Updating documents

2005-06-13 Thread Chris Hostetter

: When I do this all fields that were indexed and/or tokenized but not
: stored get lost.
:
: So is there any way to preserve fields that were not stored?
: Reconstructing these fields is to expensive in my application.

"preserving" those fields is pretty much the oposite of "not storing"
them.

i think some people have discussed the idea of using the term vector info
to reconstruct the token stream to recreate a doc with identical
properties from a search perspective, but in general, the most straight
forward way to achieve what you want is to store every field.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Determining the IDF while searching for documents

2005-06-13 Thread Chris Hostetter

I'm not 100% sure I understand your question, but...

: order to compute the TF I count the occurences of terms which are
: similar to the term. But I've got problems to compute the IDF, because I
: must know the number of documents in which the term appears before
: searching for the documents (in the method sumOfSquaredWeights() in my

...to get the number of docs that contain a specific term, you can use
IndexReader.docFreq(Term)



: Date: Mon, 13 Jun 2005 21:30:21 +0200
: From: Barbara Krausz <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org, java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Determining the IDF while searching for documents
:
: Hi all,
:
: is it possible to determine the IDF (the documents in which a term
: appears) while searching for documents? I implemented an index based on
: trigrams, i.e. the indexterms are now Strings of 3 characters so that my
: search engine finds documents with OCR-Errors. When I'm searching for
: the term "rainstorm" for example I split it up into the trigrams __r,
: _ra, rai, ain, ins...
: First I look for documents which contain at least 8 of the 11 trigrams
: of "rainstorm" (the misspelled "ranstorm" contains 8 of the 11
: trigrams), then I check if the trigrams form a term like "rainstorm". In
: order to compute the TF I count the occurences of terms which are
: similar to the term. But I've got problems to compute the IDF, because I
: must know the number of documents in which the term appears before
: searching for the documents (in the method sumOfSquaredWeights() in my
: weight). I used hsqldb during indexing and saved the number of documents
: for each term. But it's really slow.
: My question is the following: When I'm searching for documents which
: contain terms similar to the searchterm I actually get the number of
: documents that contain the term. But I need the IDF before searching
: these documents for example for BooleanQueries which need the IDF to
: normalize the queryvector. Can I solve this problem, i.e. can I
: determine the IDF later and normalize the BooleanQuery?
:
: Thanks
: Barbara
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]