Re: Exchange/PST/Mail parsing

2007-07-02 Thread jm

We had to develop vb code to convert pst to eml files.

I am using mbox, works fine for me. And I am also using aperture, but only
for extracting text from non-mail files (like office etc), works fine too.

On 7/2/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Anyone have any recommendations on a decent, open (doesn't have to be
Apache license, but would prefer non-GPL if possible), extractor for
MS Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.

For mbox, I think mstor will suffice for me and I think tropo (from
the FAQ should work for IMAP).  Does anyone have experience with
http://aperture.sourceforge.net/

[1] http://wiki.apache.org/lucene-java/LuceneFAQ#head-
bcba2effabe224d5fb8c1761e4da1fedceb9800e

Cheers,
Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Pagination

2007-07-02 Thread Lee Li Bin
Hi,

I still have no idea of how to get it done. Can give me some details?

The web application is in jsp btw.

Thanks a lot.

 
Regards,
Lee Li Bin
-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 30, 2007 2:21 AM
To: java-user@lucene.apache.org
Subject: Re: Pagination

After search, you will just get an object Hits, and go through all of the
documents by hits.doc(i).

The pagination is controlled by you. Lucene is pre-caching first 200
documents and lazy loading the rest by batch size 200.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes

On 6/29/07, Lee Li Bin <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> does anyone knows how to do pagination on jsp page using the number of
> hits
> return? Or any other solutions?
>
>
>
> Do provide me with some sample coding if possible or a step by step guide.
> Sry if I'm asking too much, I'm new to lucene.
>
>
>
> Thanks
>
>
>
>
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pagination

2007-07-02 Thread mark harwood
The Hits class is OK but can be inefficient due to re-running the query 
unnecessarily.

The class below illustrates how to efficiently retrieve a particular page of 
results and lends itself to webapps where you don't want to retain server side 
state (i.e. a Hits object) for each client.
It would make sense to put an upper limit on the "start" parameter (as Google 
etc do) to avoid consuming to much RAM per client request.

Cheers,
Mark

[Begin code]




package lucene.pagination;

import org.apache.lucene.index.Term;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.util.PriorityQueue;

/**
 * A HitCollector that retrieves a specific page of results 
 * @author maharwood
 */
public class HitPageCollector extends HitCollector
{
//Demo code showing pagination
public static void main(String[] args) throws Exception
{
IndexSearcher s=new IndexSearcher("/indexes/nasa");
HitPageCollector hpc=new HitPageCollector(1,10);
Query q=new TermQuery(new Term("contents","sea"));
s.search(q,hpc);
ScoreDoc[] sd = hpc.getScores();
System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of 
"+hpc.getTotalAvailable());
for (int i = 0; i < sd.length; i++)
{
System.out.println(sd[i].doc);
}
s.close();
}

int nDocs;
PriorityQueue hq;
float minScore = 0.0f;
int totalHits = 0;
int start;
int maxNumHits;
int totalInThisPage;

public HitPageCollector(int start, int maxNumHits)
{
this.nDocs = start + maxNumHits;
this.start = start;
this.maxNumHits = maxNumHits;
hq = new HitQueue(nDocs);
}

public void collect(int doc, float score)
{
totalHits++;
if((hq.size()= minScore))
{
ScoreDoc scoreDoc = new ScoreDoc(doc,score);
hq.insert(scoreDoc);  // update hit queue
minScore = ((ScoreDoc)hq.top()).score; // reset minScore
}
totalInThisPage=hq.size();
}


public ScoreDoc[] getScores()
{
//just returns the number of hits required from the required start point
/*
So, given hits:
1234567890
and a start of 2 + maxNumHits of 3 should return:
234
or, given hits
12
should return
2
and so, on.
*/
if (start <= 0)
{
throw new IllegalArgumentException("Invalid start :" + start+" - 
start should be >=1");
}
int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1)));
if (numReturned <= 0)
{
return new ScoreDoc[0];
}
ScoreDoc[] scoreDocs = new ScoreDoc[numReturned];
ScoreDoc scoreDoc;
for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working 
backwards from lowest count
{
scoreDoc = (ScoreDoc) hq.pop();
if (i < (start - 1))
{
break; //off the beginning of the results array
}
if (i < (scoreDocs.length + (start - 1)))
{
scoreDocs[i - (start - 1)] = scoreDoc; //within scope of 
results array
}
}
return scoreDocs;
}

public int getTotalAvailable()
{
return totalHits;
}

public int getStart()
{
return start;
}

public int getEnd()
{
return start+totalInThisPage-1;
}

public class HitQueue extends PriorityQueue 
{
  public HitQueue(int size) 
  {
initialize(size);
  }
  public final boolean lessThan(Object a, Object b) 
  {
ScoreDoc hitA = (ScoreDoc)a;
ScoreDoc hitB = (ScoreDoc)b;
if (hitA.score == hitB.score)
  return hitA.doc > hitB.doc;
else
  return hitA.score < hitB.score;
  }
}
}



- Original Message 
From: Lee Li Bin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, 2 July, 2007 9:59:14 AM
Subject: RE: Pagination

Hi,

I still have no idea of how to get it done. Can give me some details?

The web application is in jsp btw.

Thanks a lot.

 
Regards,
Lee Li Bin
-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 30, 2007 2:21 AM
To: java-user@lucene.apache.org
Subject: Re: Pagination

After search, you will just get an object Hits, and go through all of the
documents by hits.doc(i).

The pagination is controlled by you. Lucene is pre-caching first 200
documents and lazy loading the rest by batch size 200.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database

Re: Exchange/PST/Mail parsing

2007-07-02 Thread Nick Burch

On Sun, 1 Jul 2007, Grant Ingersoll wrote:
Anyone have any recommendations on a decent, open (doesn't have to be 
Apache license, but would prefer non-GPL if possible), extractor for MS 
Exchange and/or PST files?


There has been an offer to contribute a PST parser to Apache POI. We're 
hoping that Travis will have something to go into the POI scratchpad quite 
soon, but we understand he's currently still working on the first version.


I can only suggest you keep an eye on the poi dev list for when the code 
comes through


Nick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Geneology, nicknames, levenstein, soundex/metaphone, etc

2007-07-02 Thread Darren Hartford
Thank you for the link to the previous thread, lot of information there!

*Synonym use of nicknames - that sounds quite feasible.  Do you
specifically mean the WordNet module in the Sandbox, or something
different? 


> -Original Message-
> From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 29, 2007 12:30 PM
> To: java-user@lucene.apache.org
> Subject: Re: Geneology, nicknames, levenstein, soundex/metaphone, etc
> 
> You may find this thread useful: http://www.gossamer-threads.com/
> lists/lucene/java-user/47824?search_string=record%20linkage;#47824
> although it doesn't answer all your ?'s
> 
> > *nickname:  would it be feasible to create an Analyzer that 
> will tie 
> > to an external/internal nickname datasource (datasource would vary 
> > dramatically based on nationality).  Usecase:  Jon, John, Johnny, 
> > Jonathan would have 'weight' in the relevance.  Similarly 'Dick', 
> > 'Chuck', and 'Charles'.
> 
> Maybe you could inject these as synonyms?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Geneology, nicknames, levenstein, soundex/metaphone, etc

2007-07-02 Thread Grant Ingersoll


On Jul 2, 2007, at 8:07 AM, Darren Hartford wrote:

Thank you for the link to the previous thread, lot of information  
there!


*Synonym use of nicknames - that sounds quite feasible.  Do you
specifically mean the WordNet module in the Sandbox, or something
different?


No, I think I was thinking along the lines of the SynonymAnalyzer in  
Lucene In Action whereby you add the nicknames as tokens at the same  
position as the original, that way searches on the nicknames would  
still match.  Don't know that it solves your need for "weight" in the  
relevance, but maybe it would.






-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Friday, June 29, 2007 12:30 PM
To: java-user@lucene.apache.org
Subject: Re: Geneology, nicknames, levenstein, soundex/metaphone, etc

You may find this thread useful: http://www.gossamer-threads.com/
lists/lucene/java-user/47824?search_string=record%20linkage;#47824
although it doesn't answer all your ?'s


*nickname:  would it be feasible to create an Analyzer that

will tie

to an external/internal nickname datasource (datasource would vary
dramatically based on nationality).  Usecase:  Jon, John, Johnny,
Jonathan would have 'weight' in the relevance.  Similarly 'Dick',
'Chuck', and 'Charles'.


Maybe you could inject these as synonyms?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exchange/PST/Mail parsing

2007-07-02 Thread Christiaan Fluit

Hello Grant (cc-ing aperture-devel),

I am one of the Aperture admins, I can tell you a bit more about 
Aperture's mail facilities.


Short intro: Aperture is a framework for crawling and full-text and 
metadata extraction of a growing number of sources and file formats. We 
try to select the best-of-breed of the large number of open source 
libraries that tackle a specific source or format (e.g. PDFBox, Poi, 
JavaMail) and write some glue code around it so that they can be invoked 
in a uniform way. It's currently used in a number of desktop and 
enterprise search applications, both research and production systems.


At the moment we support a number of mail systems.

We can crawl IMAP mail boxes through JavaMail. In general it seems to 
work well, problems are usually caused by IMAP servers not conforming to 
the IMAP specs.


Some people have used the ImapCrawler to crawl MS Exchange as well. Some 
succeeded, some didn't. I don't really know whether the fault is in 
Aperture's code or in the Exchange configuration but I would be happy to 
take a look at it when someone runs into problems.


Outlook can also be crawled by connecting to a running Outlook process 
through jacob.dll. Others on aperture-devel can tell you more about its 
current status. Besides this crawler, I would also be very interested in 
having a crawler that directly processes .pst files, as to stay clear 
from communicating with other processes outside your own control.


People have been working on crawling Thunderbird mailboxes but I don't 
know what the current status is.


Ultimately, we try to support any major mail system. In practice, effort 
is usually dependent on knowledge and experience as well as customer demand.


We are happy to help you out with trying to get Aperture working in your 
domain and looking into the problems that you may encounter.



Kind regards,

Chris
--

Grant Ingersoll wrote:
Anyone have any recommendations on a decent, open (doesn't have to be 
Apache license, but would prefer non-GPL if possible), extractor for MS 
Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.


For mbox, I think mstor will suffice for me and I think tropo (from the 
FAQ should work for IMAP).  Does anyone have experience with 
http://aperture.sourceforge.net/


[1] 
http://wiki.apache.org/lucene-java/LuceneFAQ#head-bcba2effabe224d5fb8c1761e4da1fedceb9800e 



Cheers,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Auto Slop

2007-07-02 Thread Walt Stoneburner

I just ran into an interesting problem today, and wanted to know if it
was my understanding or Lucene that was out of whack -- right now I'm
leaning toward a fault between the chair and the keyboard.

I attempted to do a simple phrase query using the StandardAnalyzer:
"United States"

Against my corpus of test documents, I got no results returned, which
surprised me.  I know it's in there.

So, I ran this same query in Luke, and it also returned no results.

Luke explains:
 PhraseQuery: boost=1., slop=0
 pos[0,1]
 Term 0: field='contents' text='united'
 Term 1: field='contents' text='states'

Now I know Lucene handles phrases, so I tried manually setting the
slop to 1, given that there were two terms:  "United States"~1

...and suddenly I got the results I was expecting!

In fact, after a little trial and error with larger phrases, I always
get no results unless I *manually* specify at least slop value of the
number of terms minus one.

Isn't this supposed to be the default behavior if no slop is specified?

Lucene's standard analyzer, which clear knows the number of terms,
should be able to deduce the minimum slop amount.  Why must it be
manually specified?

Could I be missing some configuration setting, have a bad
understanding of the query syntax, or is there a clever reason (like
searching for encoding synonyms) that makes more sense as a default
value for slop that I'm not seeing?

Many thanks to all that unravel my confusion.

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Auto Slop

2007-07-02 Thread Mark Miller
Examine your indexes and analyzers. The default slop is 0, which means 
allow 0 terms between the terms in the phrase. That would be an exact 
match. A slop of 1 is not the default and would allow a term movement of 
one position to match the phrase.


- Mark

Walt Stoneburner wrote:

I just ran into an interesting problem today, and wanted to know if it
was my understanding or Lucene that was out of whack -- right now I'm
leaning toward a fault between the chair and the keyboard.

I attempted to do a simple phrase query using the StandardAnalyzer:
"United States"

Against my corpus of test documents, I got no results returned, which
surprised me.  I know it's in there.

So, I ran this same query in Luke, and it also returned no results.

Luke explains:
 PhraseQuery: boost=1., slop=0
 pos[0,1]
 Term 0: field='contents' text='united'
 Term 1: field='contents' text='states'

Now I know Lucene handles phrases, so I tried manually setting the
slop to 1, given that there were two terms:  "United States"~1

...and suddenly I got the results I was expecting!

In fact, after a little trial and error with larger phrases, I always
get no results unless I *manually* specify at least slop value of the
number of terms minus one.

Isn't this supposed to be the default behavior if no slop is specified?

Lucene's standard analyzer, which clear knows the number of terms,
should be able to deduce the minimum slop amount.  Why must it be
manually specified?

Could I be missing some configuration setting, have a bad
understanding of the query syntax, or is there a clever reason (like
searching for encoding synonyms) that makes more sense as a default
value for slop that I'm not seeing?

Many thanks to all that unravel my confusion.

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Auto Slop

2007-07-02 Thread Ard Schrijvers

> I just ran into an interesting problem today, and wanted to know if it
> was my understanding or Lucene that was out of whack -- right now I'm
> leaning toward a fault between the chair and the keyboard.
> 
> I attempted to do a simple phrase query using the StandardAnalyzer:
> "United States"

And you also analyzed this the particular field with this same StandardAnalyzer 
during indexing? Sounds like you used another analyzer during creating the index

Regards Ard

> 
> Against my corpus of test documents, I got no results returned, which
> surprised me.  I know it's in there.
> 
> So, I ran this same query in Luke, and it also returned no results.
> 
> Luke explains:
>   PhraseQuery: boost=1., slop=0
>   pos[0,1]
>   Term 0: field='contents' text='united'
>   Term 1: field='contents' text='states'
> 
> Now I know Lucene handles phrases, so I tried manually setting the
> slop to 1, given that there were two terms:  "United States"~1
> 
> ...and suddenly I got the results I was expecting!
> 
> In fact, after a little trial and error with larger phrases, I always
> get no results unless I *manually* specify at least slop value of the
> number of terms minus one.
> 
> Isn't this supposed to be the default behavior if no slop is 
> specified?
> 
> Lucene's standard analyzer, which clear knows the number of terms,
> should be able to deduce the minimum slop amount.  Why must it be
> manually specified?
> 
> Could I be missing some configuration setting, have a bad
> understanding of the query syntax, or is there a clever reason (like
> searching for encoding synonyms) that makes more sense as a default
> value for slop that I'm not seeing?
> 
> Many thanks to all that unravel my confusion.
> 
> -wls
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



highlighting phrase query

2007-07-02 Thread sandeep chawla

Hi All,

I am developing a search tool using lucene. I am using lucene 2.1.

i have a requirement to highlight query words in the results.
.Lucene-highlighter 2.1 doesn't work well in highlighting phase query.

For example - if i have a query string "lucene Java" .It highlights
not only occurrences of "lucene java" but occurrences of lucene and
java too in the text.

I think, this is a known problem..is this issue solved in lucene 2.2.
well my application is almost complete and i really don't wanna switch
to lucene 2.2.

I was going through previous posts but i couldn't find a solution of
this problem. There r some alternate highlighter s but it seems, they
r not stable and still in evolution phase.

I am looking for a standard n stable API for this purpose..

I'd appreciate any thoughts/guidance in this issue.

Thanks
Sandeep

--
SANDEEP CHAWLA
House No- 23
10th main   
BTM 1st  Stage  
Bangalore   Mobile: 91-9986150603

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pagination

2007-07-02 Thread Alixandre Santana

Mark,

The ScoreDoc[] contains only the  IDs of each lucene document. what
would be the best way of getting the entire (lucene)document ?
Should i do a new search with the ID retrivied by hpc.getScores() -
(searcher.doc(idDoc))?

thanks.

Alixandre

On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote:

The Hits class is OK but can be inefficient due to re-running the query 
unnecessarily.

The class below illustrates how to efficiently retrieve a particular page of 
results and lends itself to webapps where you don't want to retain server side 
state (i.e. a Hits object) for each client.
It would make sense to put an upper limit on the "start" parameter (as Google 
etc do) to avoid consuming to much RAM per client request.

Cheers,
Mark

[Begin code]




package lucene.pagination;

import org.apache.lucene.index.Term;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.util.PriorityQueue;

/**
 * A HitCollector that retrieves a specific page of results
 * @author maharwood
 */
public class HitPageCollector extends HitCollector
{
//Demo code showing pagination
public static void main(String[] args) throws Exception
{
IndexSearcher s=new IndexSearcher("/indexes/nasa");
HitPageCollector hpc=new HitPageCollector(1,10);
Query q=new TermQuery(new Term("contents","sea"));
s.search(q,hpc);
ScoreDoc[] sd = hpc.getScores();
System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+" of 
"+hpc.getTotalAvailable());
for (int i = 0; i < sd.length; i++)
{
System.out.println(sd[i].doc);
}
s.close();
}

int nDocs;
PriorityQueue hq;
float minScore = 0.0f;
int totalHits = 0;
int start;
int maxNumHits;
int totalInThisPage;

public HitPageCollector(int start, int maxNumHits)
{
this.nDocs = start + maxNumHits;
this.start = start;
this.maxNumHits = maxNumHits;
hq = new HitQueue(nDocs);
}

public void collect(int doc, float score)
{
totalHits++;
if((hq.size()= minScore))
{
ScoreDoc scoreDoc = new ScoreDoc(doc,score);
hq.insert(scoreDoc);  // update hit queue
minScore = ((ScoreDoc)hq.top()).score; // reset minScore
}
totalInThisPage=hq.size();
}


public ScoreDoc[] getScores()
{
//just returns the number of hits required from the required start point
/*
So, given hits:
1234567890
and a start of 2 + maxNumHits of 3 should return:
234
or, given hits
12
should return
2
and so, on.
*/
if (start <= 0)
{
throw new IllegalArgumentException("Invalid start :" + start+" - start 
should be >=1");
}
int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1)));
if (numReturned <= 0)
{
return new ScoreDoc[0];
}
ScoreDoc[] scoreDocs = new ScoreDoc[numReturned];
ScoreDoc scoreDoc;
for (int i = hq.size() - 1; i >= 0; i--) // put docs in array, working 
backwards from lowest count
{
scoreDoc = (ScoreDoc) hq.pop();
if (i < (start - 1))
{
break; //off the beginning of the results array
}
if (i < (scoreDocs.length + (start - 1)))
{
scoreDocs[i - (start - 1)] = scoreDoc; //within scope of 
results array
}
}
return scoreDocs;
}

public int getTotalAvailable()
{
return totalHits;
}

public int getStart()
{
return start;
}

public int getEnd()
{
return start+totalInThisPage-1;
}

public class HitQueue extends PriorityQueue
{
  public HitQueue(int size)
  {
initialize(size);
  }
  public final boolean lessThan(Object a, Object b)
  {
ScoreDoc hitA = (ScoreDoc)a;
ScoreDoc hitB = (ScoreDoc)b;
if (hitA.score == hitB.score)
  return hitA.doc > hitB.doc;
else
  return hitA.score < hitB.score;
  }
}
}



- Original Message 
From: Lee Li Bin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, 2 July, 2007 9:59:14 AM
Subject: RE: Pagination

Hi,

I still have no idea of how to get it done. Can give me some details?

The web application is in jsp btw.

Thanks a lot.


Regards,
Lee Li Bin
-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED]
Sent: Saturday, June 30, 2007 2:21 AM
To: java-user@lucene.apache.org
Subject: Re: Pagination

After search, you will just g

Modify search results

2007-07-02 Thread Robert Mullin

I have managed to download and install Lucene.  In addition, I have reached
the point at which I am able to generate an index and run a search.  The
search returns a 'raw' list of the HTML pages in which my search term
occurs. . . . chapter17, chapter18, etc.

Question: how do I go about manipulating the search results?  Is it possible
to "intercept" the listing of HTML pages returned by the Lucene search
function and modify the report it sends to the screen.

Can this be as simple as adding a line to the Lucene Java code so that
instead of reporting a simple chapter number, it will report the chapter
surrounded by HTML code , e.g. instead of simply seeing "Chapter 17" on the
screen, I want the report to read Chapter
17, paragraph 3. . . . of course then I'll need to get that info to an
HTML page . . . later . . . later . .

I suspect this will be handled by 1) modification to the Lucene source code
or 2) an addition of javascript or perl-script . . . but am not at all sure.

Thanks in advance for any help that might be provided.

r. mullin


Lucene index in memcache

2007-07-02 Thread Cathy Murphy

Is there a way to store lucene index in memcache. During high traffic search
becomes very slow. :(

--
Cathy
www.nachofoto.com


Re: Lucene index in memcache

2007-07-02 Thread Erick Erickson

You can always read the current index into a RAMdir, but I really
wonder if that will make much of a difference, as your op system should
be taking care of this kind of thing for you.

How big is your index? What kind of performance are you
seeing? What else is running on that box?

I'd do some profiling to see where things are actually slow. In
particular, think about logging how long each query takes to
complete, just the Lucene part. I've seen similar situations
where the actual time was being taken *outside* of lucene
itself by XML manipulations, for instance.

Also, are you iterating over a Hits object for more than the top
100 entries? That would be very inefficient. Are you using a
collector and calling IndexReader.doc() inside the loop?

I'd *very* strongly recommend that you pinpoint where the
time is actually being spent before jumping to the conclusion
that using a RAMdir would fix your problem. I can't tell you how
many times I've been *sure* I knew where the bottleneck was
only to find out that it's someplace completely different.
You simply cannot reliably optimize performance without
really understanding where the time is being spent. Trust
me on this one ...

Some simple timings with System.currentTimeMilliseconds() will
tell you a lot.

Best
Erick

On 7/2/07, Cathy Murphy <[EMAIL PROTECTED]> wrote:


Is there a way to store lucene index in memcache. During high traffic
search
becomes very slow. :(

--
Cathy
www.nachofoto.com



Re: highlighting phrase query

2007-07-02 Thread Mark Miller
There has been a lot of Highlighter discussion lately, but just to try 
and sum up the state of Highlighting in the Lucene world:


There are four Highlighter implementations that I know of. From what I 
can tell, only the original Contrib Highlighter has received sustained 
active development by more than one individual.


Contrib Highlighter:
The Contrib Highlighter supports the widest array of analyzers and 
corner cases and has had the widest exposure. It is generally slower on 
larger documents due to the requirement that you re-analyze text and to 
support a wider variety of use cases -- the TokenGroup for token 
overlaps and inspecting every term for Fragmentation contribute to a 
huge performance drain on large documents. This highlighter does not 
support highlighting based on position and all terms from the query will 
be highlighted in the text. You can avoid some of the cost of 
re-analyzing by using the TokenSources class to rebuild a TokenStream 
using stored offsets and/or positions, but this is unlikely to be faster 
unless you are using very large documents with a complex analyzer. 
Getting and sorting offsets/positions is relatively slow and for smaller 
docs it is faster to just re-analyze.


LUCENE-403:
I have not spent a lot of time with this approach, but it is similar to 
the Contrib Highlighter approach. It almost certainly does not cover as 
many odd corner cases as Contrib Highlighter and the framework is 
lacking, but it does add some support for proper PhraseQuery 
highlighting by implementing some custom PhraseQuery search logic. 
Because LUCENE-403 is not as rigorous as the Contrib Highlighter, it may 
well be a bit faster. The author claims that HTML tags will not be 
broken when fragmenting.


LUCENE-644:
This Highlighter approach requires that you have stored term offsets in 
the index. This Highlighter can be very fast if you are using a 
complicated analyzer since there is no need for re-analyzing the text 
(due to the stored offsets). Also, rather then scoring every term like 
the Contrib Highlighter, only terms from the query are effectively 
"handled". For smaller documents and simpler analyzers there is not much 
speed improvement over the Contrib Highlighter (due to the time it takes 
to retrieve and sort offsets), but for larger documents , especially 
with more complex analyzers,  this Highlighter can be extremely fast. 
Again, positional highlighting for Phrase and Span queries is not 
supported.  

The biggest reason this implementation performs so well is that it deals 
with the text in much bigger chunks. Contrib Highlighter can also avoid 
re-analyzing by storing offsets and positions, but then it scores the 
document and rebuilds the text one token at a time using the performance 
draining TokenGroup (which helps cover some of those corner cases). This 
is very slow on very large documents.


LUCENE-794:
This approach extends the Contrib Highlighter to support Highlighting 
Span and Phrase queries. The approach used for non position sensitive 
Query clauses is the same as the Contrib Highlighter, and if you use the 
latest CachingTokenFilter the speed is roughly about the same. Position 
sensitive Query clauses are a bit slower as a MemoryIndex is used to 
retrieve the correct positions to Highlight. This gives exact 
highlighting without reimplementing search logic. Also, all of the use 
cases and corner cases that have been solved for the Contrib Highlighter 
are retained. All of the deficiencies of the Contrib Highlighter (slower 
on large docs) are also retained. The majority of the code for this 
comes from the Contrib Highlighter -- it uses the Contrib Highlighter 
framework. Which points out a plus for the Contrib Highlighter setup -- 
it allows for an extension like this, while LUCENE-644 could not easily 
be expanded to handle position sensitive queries.



There has been some discussion of getting Lucene to identify correct 
highlights as the search is processed. I am not very optimistic that 
this will be fruitful, but those that are discussing it know more more 
about this than I do.


- Mark

sandeep chawla wrote:

Hi All,

I am developing a search tool using lucene. I am using lucene 2.1.

i have a requirement to highlight query words in the results.
.Lucene-highlighter 2.1 doesn't work well in highlighting phase query.

For example - if i have a query string "lucene Java" .It highlights
not only occurrences of "lucene java" but occurrences of lucene and
java too in the text.

I think, this is a known problem..is this issue solved in lucene 2.2.
well my application is almost complete and i really don't wanna switch
to lucene 2.2.

I was going through previous posts but i couldn't find a solution of
this problem. There r some alternate highlighter s but it seems, they
r not stable and still in evolution phase.

I am looking for a standard n stable API for this purpose..

I'd appreciate any thoughts/guidance in this issue.

Thanks
Sandeep



-

Re: Lucene index in memcache

2007-07-02 Thread Chris Hostetter

: Is there a way to store lucene index in memcache. During high traffic search
: becomes very slow. :(

http://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


If you provide some more info about how you are using Lucene (ie: what you
code looks like) and what the concepts of "high traffic" and "slow" mean
to you, we might be able to help you better.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reusing Document Objects (was Auto Slop)

2007-07-02 Thread Walt Stoneburner

If I create a Document object, can I pass it to multiple index writers
without harm?

Or, does the process of being handed to an Index Writer somehow mutate the
state of the Document object, say during tokenizing, that would cause it's
re-use with a totally separate index to cause problems ...such as I'm seeing
with slop?

-wls


RE: highlighting phrase query

2007-07-02 Thread Renaud Waldura
Mark:

Thanks a million for this comprehensive analysis. This is going straight to
my manager. :)

--Renaud
 

-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 02, 2007 2:11 PM
To: java-user@lucene.apache.org
Subject: Re: highlighting phrase query

There has been a lot of Highlighter discussion lately, but just to try and
sum up the state of Highlighting in the Lucene world:

There are four Highlighter implementations that I know of. From what I can
tell, only the original Contrib Highlighter has received sustained active
development by more than one individual.

Contrib... [snip]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multi-term query weighting

2007-07-02 Thread Tim Sturge
I have an index with two different sources of information, one small but 
of high quality (call it "title"), and one large, but of lower quality 
(call it "body").  I give boosts to certain documents related to their 
popularity (this is very similar to what one would do indexing the web).


The problem I have is a query like "John Bush". I translate that into " 
(title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ". But the 
results I get are:


1. George Bush
...
4. John Kerry
...
10. John Bush

The reason is (looking at explain) that George Bush is scored:
169 = sum(
1 =  
)
168 = sum(
160 = 
8 = 
)
)

and John Kerry is similar but reversed. Poor old "John Bush" only scores:

72 = sum(
 40 = (+)
 32 = (+ )
)

because his initial boost was only 1/4 of George's.

The question I have is, how can tell the searcher to care about 
"balance"? I really want the score over 2 terms to be more like 
(sqrt(X)+sqrt(Y))^2 or maybe even exp(log(X)+log(Y))  rather than just 
X+Y. Is that supported in some obvious way, or is there some other way 
to phrase my query to say "I want both terms but they should both be 
important if possible?"


Thanks,

Tim







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Pagination

2007-07-02 Thread Lee Li Bin
Hi,

Thanks Mark! 

I do have the same question as Alixandre. How do I get the content of the
document instead of the document id?

Thanks.

Regards,
Lee Li Bin
-Original Message-
From: Alixandre Santana [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 03, 2007 12:55 AM
To: java-user@lucene.apache.org
Subject: Re: Pagination

Mark,

The ScoreDoc[] contains only the  IDs of each lucene document. what
would be the best way of getting the entire (lucene)document ?
Should i do a new search with the ID retrivied by hpc.getScores() -
(searcher.doc(idDoc))?

thanks.

Alixandre

On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote:
> The Hits class is OK but can be inefficient due to re-running the query
unnecessarily.
>
> The class below illustrates how to efficiently retrieve a particular page
of results and lends itself to webapps where you don't want to retain server
side state (i.e. a Hits object) for each client.
> It would make sense to put an upper limit on the "start" parameter (as
Google etc do) to avoid consuming to much RAM per client request.
>
> Cheers,
> Mark
>
> [Begin code]
>
>
>
>
> package lucene.pagination;
>
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.util.PriorityQueue;
>
> /**
>  * A HitCollector that retrieves a specific page of results
>  * @author maharwood
>  */
> public class HitPageCollector extends HitCollector
> {
> //Demo code showing pagination
> public static void main(String[] args) throws Exception
> {
> IndexSearcher s=new IndexSearcher("/indexes/nasa");
> HitPageCollector hpc=new HitPageCollector(1,10);
> Query q=new TermQuery(new Term("contents","sea"));
> s.search(q,hpc);
> ScoreDoc[] sd = hpc.getScores();
> System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+"
of "+hpc.getTotalAvailable());
> for (int i = 0; i < sd.length; i++)
> {
> System.out.println(sd[i].doc);
> }
> s.close();
> }
>
> int nDocs;
> PriorityQueue hq;
> float minScore = 0.0f;
> int totalHits = 0;
> int start;
> int maxNumHits;
> int totalInThisPage;
>
> public HitPageCollector(int start, int maxNumHits)
> {
> this.nDocs = start + maxNumHits;
> this.start = start;
> this.maxNumHits = maxNumHits;
> hq = new HitQueue(nDocs);
> }
>
> public void collect(int doc, float score)
> {
> totalHits++;
> if((hq.size()= minScore))
> {
> ScoreDoc scoreDoc = new ScoreDoc(doc,score);
> hq.insert(scoreDoc);  // update hit queue
> minScore = ((ScoreDoc)hq.top()).score; // reset minScore
> }
> totalInThisPage=hq.size();
> }
>
>
> public ScoreDoc[] getScores()
> {
> //just returns the number of hits required from the required start
point
> /*
> So, given hits:
> 1234567890
> and a start of 2 + maxNumHits of 3 should return:
> 234
> or, given hits
> 12
> should return
> 2
> and so, on.
> */
> if (start <= 0)
> {
> throw new IllegalArgumentException("Invalid start :" + start+"
- start should be >=1");
> }
> int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1)));
> if (numReturned <= 0)
> {
> return new ScoreDoc[0];
> }
> ScoreDoc[] scoreDocs = new ScoreDoc[numReturned];
> ScoreDoc scoreDoc;
> for (int i = hq.size() - 1; i >= 0; i--) // put docs in array,
working backwards from lowest count
> {
> scoreDoc = (ScoreDoc) hq.pop();
> if (i < (start - 1))
> {
> break; //off the beginning of the results array
> }
> if (i < (scoreDocs.length + (start - 1)))
> {
> scoreDocs[i - (start - 1)] = scoreDoc; //within scope of
results array
> }
> }
> return scoreDocs;
> }
>
> public int getTotalAvailable()
> {
> return totalHits;
> }
>
> public int getStart()
> {
> return start;
> }
>
> public int getEnd()
> {
> return start+totalInThisPage-1;
> }
>
> public class HitQueue extends PriorityQueue
> {
>   public HitQueue(int size)
>   {
> initialize(size);
>   }
>   public final boolean lessThan(Object a, Object b)
>   {
> ScoreDoc hitA = (ScoreDoc)a;
> ScoreDoc hitB = (ScoreDoc)b;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc

RE: Pagination

2007-07-02 Thread Lee Li Bin
Hi Mark,

How do I display results on the second page? 
I manage to display on one page using your coding.

 
Regards,
Lee Li Bin

-Original Message-
From: Alixandre Santana [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 03, 2007 12:55 AM
To: java-user@lucene.apache.org
Subject: Re: Pagination

Mark,

The ScoreDoc[] contains only the  IDs of each lucene document. what
would be the best way of getting the entire (lucene)document ?
Should i do a new search with the ID retrivied by hpc.getScores() -
(searcher.doc(idDoc))?

thanks.

Alixandre

On 7/2/07, mark harwood <[EMAIL PROTECTED]> wrote:
> The Hits class is OK but can be inefficient due to re-running the query
unnecessarily.
>
> The class below illustrates how to efficiently retrieve a particular page
of results and lends itself to webapps where you don't want to retain server
side state (i.e. a Hits object) for each client.
> It would make sense to put an upper limit on the "start" parameter (as
Google etc do) to avoid consuming to much RAM per client request.
>
> Cheers,
> Mark
>
> [Begin code]
>
>
>
>
> package lucene.pagination;
>
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.util.PriorityQueue;
>
> /**
>  * A HitCollector that retrieves a specific page of results
>  * @author maharwood
>  */
> public class HitPageCollector extends HitCollector
> {
> //Demo code showing pagination
> public static void main(String[] args) throws Exception
> {
> IndexSearcher s=new IndexSearcher("/indexes/nasa");
> HitPageCollector hpc=new HitPageCollector(1,10);
> Query q=new TermQuery(new Term("contents","sea"));
> s.search(q,hpc);
> ScoreDoc[] sd = hpc.getScores();
> System.out.println("Hits "+ hpc.getStart()+" - "+ hpc.getEnd()+"
of "+hpc.getTotalAvailable());
> for (int i = 0; i < sd.length; i++)
> {
> System.out.println(sd[i].doc);
> }
> s.close();
> }
>
> int nDocs;
> PriorityQueue hq;
> float minScore = 0.0f;
> int totalHits = 0;
> int start;
> int maxNumHits;
> int totalInThisPage;
>
> public HitPageCollector(int start, int maxNumHits)
> {
> this.nDocs = start + maxNumHits;
> this.start = start;
> this.maxNumHits = maxNumHits;
> hq = new HitQueue(nDocs);
> }
>
> public void collect(int doc, float score)
> {
> totalHits++;
> if((hq.size()= minScore))
> {
> ScoreDoc scoreDoc = new ScoreDoc(doc,score);
> hq.insert(scoreDoc);  // update hit queue
> minScore = ((ScoreDoc)hq.top()).score; // reset minScore
> }
> totalInThisPage=hq.size();
> }
>
>
> public ScoreDoc[] getScores()
> {
> //just returns the number of hits required from the required start
point
> /*
> So, given hits:
> 1234567890
> and a start of 2 + maxNumHits of 3 should return:
> 234
> or, given hits
> 12
> should return
> 2
> and so, on.
> */
> if (start <= 0)
> {
> throw new IllegalArgumentException("Invalid start :" + start+"
- start should be >=1");
> }
> int numReturned = Math.min(maxNumHits, (hq.size() - (start - 1)));
> if (numReturned <= 0)
> {
> return new ScoreDoc[0];
> }
> ScoreDoc[] scoreDocs = new ScoreDoc[numReturned];
> ScoreDoc scoreDoc;
> for (int i = hq.size() - 1; i >= 0; i--) // put docs in array,
working backwards from lowest count
> {
> scoreDoc = (ScoreDoc) hq.pop();
> if (i < (start - 1))
> {
> break; //off the beginning of the results array
> }
> if (i < (scoreDocs.length + (start - 1)))
> {
> scoreDocs[i - (start - 1)] = scoreDoc; //within scope of
results array
> }
> }
> return scoreDocs;
> }
>
> public int getTotalAvailable()
> {
> return totalHits;
> }
>
> public int getStart()
> {
> return start;
> }
>
> public int getEnd()
> {
> return start+totalInThisPage-1;
> }
>
> public class HitQueue extends PriorityQueue
> {
>   public HitQueue(int size)
>   {
> initialize(size);
>   }
>   public final boolean lessThan(Object a, Object b)
>   {
> ScoreDoc hitA = (ScoreDoc)a;
> ScoreDoc hitB = (ScoreDoc)b;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc;
> else
>

Re: highlighting phrase query

2007-07-02 Thread sandeep chawla

Thanks a lot Mark,

has any one used Lucene-794? how stable it it. is it widely used in industry.

These are some of my questions :)

Thanks
Sandeep

On 03/07/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:

Mark:

Thanks a million for this comprehensive analysis. This is going straight to
my manager. :)

--Renaud


-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED]
Sent: Monday, July 02, 2007 2:11 PM
To: java-user@lucene.apache.org
Subject: Re: highlighting phrase query

There has been a lot of Highlighter discussion lately, but just to try and
sum up the state of Highlighting in the Lucene world:

There are four Highlighter implementations that I know of. From what I can
tell, only the original Contrib Highlighter has received sustained active
development by more than one individual.

Contrib... [snip]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
SANDEEP CHAWLA
House No- 23
10th main
BTM 1st  Stage
Bangalore Mobile: 91-9986150603

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene index in memcache

2007-07-02 Thread Cathy Murphy

Hi Erick & Chris ,
Thanks for your response.
I have done some profiling , and it seems the response is slow when there
are long queries(more than 5-6 words per query).
The way I have implemented is : I pass in the search query and lucene
returns the total number of hits, along with ids . I then fetch objects
for only those ids , as required per the pagination.
Also it is a dedicated search box .

Thanks,
--
Cathy
www.nachofoto.com




On 7/2/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Is there a way to store lucene index in memcache. During high traffic
search
: becomes very slow. :(

http://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


If you provide some more info about how you are using Lucene (ie: what you
code looks like) and what the concepts of "high traffic" and "slow" mean
to you, we might be able to help you better.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]