Re: JavaCC Download

2007-06-27 Thread Mahdi Rahimi

How can I access to Certificate of this site? 


Steven Rowe wrote:
> 
> I don't think you need to register - I am not registered and I can
> download from there.
> 
> My guess is that Mahdi Rahimi's browser doesn't know how to speak the
> HTTPS protocol.
> 
> Here's an invocation of wget (I have version 1.10.2) that works for me
> to get the .zip archive (all on one line):
> 
> wget --no-check-certificate
> https://javacc.dev.java.net/files/documents/17/26777/javacc-4.0.zip
> 
> Or if you want the .tar.gz archive:
> 
> wget --no-check-certificate
> https://javacc.dev.java.net/files/documents/17/26776/javacc-4.0.tar.gz
> 
> jiang jialin wrote:
>> you must registe first
>> 
>> 2007/6/23, Mahdi Rahimi <[EMAIL PROTECTED]>:
>>>
>>>
>>> Hi Steven.
>>>
>>> When i access to this address, this message appread
>>>
>>> Forbidden
>>> You don't have permission to access /servlets/ProjectHome on this
>>> server.
>>>
>>> What's the problem?
>>>
>>> Thakns.
>>>
>>>
>>> Steven Rowe wrote:
>>> >
>>> > Mahdi Rahimi wrote:
>>> >> Hi.
>>> >>
>>> >> How can I access JavaCC??
>>> >>
>>> >> Thanks
>>> >
>>> > https://javacc.dev.java.net/
>>> >
>>> > --
>>> > Steve Rowe
>>> > Center for Natural Language Processing
>>> > http://www.cnlp.org/tech/lucene.asp
> 
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/JavaCC-Download-tf3958940.html#a11319544
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Update documents

2007-06-27 Thread WATHELET Thomas
High,
Is-it possible to update a document's field without deleting the
document and add it again into the index?


RE: Update documents

2007-06-27 Thread Liu_Andy2
Perhaps it is not possible if you have written the document to index.

Andy

-Original Message-
From: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 27, 2007 3:46 PM
To: java-user@lucene.apache.org
Subject: Update documents

High,
Is-it possible to update a document's field without deleting the
document and add it again into the index?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Update documents

2007-06-27 Thread Doron Cohen
WATHELET Thomas wrote:

> Is-it possible to update a document's field without deleting the
> document and add it again into the index?

Not really... see the FAQ, especially "How do I update a document or a set
of documents that are already indexed?", and also see javadocs for
IndexWriter's updateDocument() methods.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Update documents

2007-06-27 Thread Liu_Andy2
In effect, IndexWriter's updateDocument() will first delete the document
containing specific term, then add the document. It just wrap delete&add
as a thread safe method.

Andy

-Original Message-
From: Doron Cohen [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 27, 2007 3:58 PM
To: java-user@lucene.apache.org
Subject: Re: Update documents

WATHELET Thomas wrote:

> Is-it possible to update a document's field without deleting the
> document and add it again into the index?

Not really... see the FAQ, especially "How do I update a document or a
set
of documents that are already indexed?", and also see javadocs for
IndexWriter's updateDocument() methods.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene as primary object storage

2007-06-27 Thread Mohammad Norouzi

Hi karl,
we did something like hibernate to map an object (Entity) with lucene by
defining a bunch of annotations just like the Limax project (as far as I
know it is led by you),
the only problem we had was how to make relationship between two or more
separate indexes. I managed to resolve it but I don't think it's very good
idea. if only the Lucene had some feature to facilitate this :)
We use these indexes for generating some dynamic reports and we are going to
create a database crawler to surf the DB and find new or deleted records to
update its index files.
our application uses only index files to persist the information comming
from DB and also uses that index as a resource

I am welcome if you want to know how to make relationship between two or
more indexes.

Good Luck

--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


Re: Highlighter that works with phrase and span queries

2007-06-27 Thread Mark Miller



markharw00d wrote:


I was thinking along the lines of wrapping some core classes such as 
IndexReader to somehow observe the query matching process and deduce 
from that what to highlight (avoiding the need for MemoryIndex)  but 
I'm not sure that is viable. It would be nice to get some more match 
info out of the main query logic as it runs to aid highlighting rather 
than reverse engineering the basis of a match after the event. 


I have been thinking about a way to pursue this, and it does not seem 
clear that there is a nice solution. Even if you could wrap Querys or 
other classes to observe matched tokens (non trivial since a Query is 
only concerned if it matches a doc, not which tokens it matches at which 
positions), you would still have the major problem of which matches do 
you keep information for. It does not seem practical to save all of the 
information to highlight *any* doc after a search and it also seems 
unlikely that you would know which docs you wanted to highlight before 
the search. The only compromise that I can see is maybe just storing 
info to highlight the first n docs, but even here, while the scoring is 
occurring you do not yet know the return order. Also, there is probably 
little value in knowing which Tokens were matches for highlighting 
unless you have stored offsets as well.


Unless someone has any suggestions on how to accomplish this, I think 
time would be better spent improving the existing Highlighter framework.


Perhaps Ronnie's Highlighter should be added as an alternate Highlighter 
that is less feature rich but much faster on large docs. It looks to me 
like there is unlikely to be a faster Highlighting method for simple 
non-position aware highlighting.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



several existential issues about Lucene's filesystem

2007-06-27 Thread Samuel LEMOINE

Hi everyone !

I'm working on bibliographical researches on Lucene as an intern in 
Lingway (which uses Lucene in its main product), and I'm currently 
studying Lucene's file system.
There are several things I don't catch in Lucene's file system, and I 
thought here was the right place to ask about those questions (I hope 
it's the case actually).

The main resource I used is this document:
http://lucene.apache.org/java/2_1_0/fileformats.html

-in the .tvf file (Term Vector file) in Lucene 2.2.0, position & offsets 
can be possibly given in the term vector... I don't understand how it 
works, since there's only one .tvf per segment (according to what I've 
understood), and in the architecture described, there is no information 
given about the documents in which appears each term stored in the 
TermVector (the informations document-related are in the .tvd file I 
assume). The position/offset informations seems to be simply a list of 
addresses, but how can be known the document it refers to? Or is there 
one .tvf file per document?


-in the .prx file (prositions file), payloads are mentionned and allow 
to attach meta-data... what's the purpose of such data? is there a 
precise use, or is it only data for the sole user's use?


-many adresses in many files are given under Delta shapes... Doesn't it 
slacken the search among the index ? I mean, when a keyword is looked 
for, in order to find its position in the right file, Lucene must find 
the adress of the previous term and add the "delta" address... but the 
previous term adress is also given by a delta address, and so on, so 
that as far as I understand it, the whole file must be climbed back, 
recursively finding the address of each term... I assume I've 
misunderstood something, but don't know what.


I apologize for the length of my mail, and the approximative english...
Thanks a lot for reading, and far more for answering ^^

Samuel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-06-27 Thread Mark Miller
Depending on what these guys are doing, here is another possibility if 
TermOffests and Ronnie's highlighter are not an option.


If you are highlighting whole documents (NullFragmenter) or are not very 
concerned about the fragments you get back, you can change the line in 
the Highlighter at about 255:


   tokenGroup.addToken(token, 
fragmentScorer.getTokenScore(token));


   TO:

   float score = fragmentScorer.getTokenScore(token);
   if(score > 0 ) {
   tokenGroup.addToken(token, score);
   }

This is not a full solution yet, but more of a hack. Fragmenters won't 
be given the opportunity to start a new Fragment at every token 
position...no problem if you are highlighting the whole document.


Essentially, instead of the the document being rebuilt from from the 
source text using each individual token, it is rebuilt from the 
highlighted tokens and the differences in offsets between them. No so 
fragment happy without some Fragmenter handling changes.


On a collection of  5,000 documents,  300-900 tokens (weighted toward 
300), this gave an improvement of 37-40%. I imagine the gains grow as 
the document grows.


I am looking into making this a more general solution, but it's a great 
quick hack for speed. It will also work with my SpanScorer that 
correctly highlights Spans and PhraseQuerys.


- Mark

Otis Gospodnetic wrote:

Hi Mark,

I know one large user (meaning: high query/highlight rates) of the current 
Highlighter and this user wasn't too happy with its performance.  I don't know 
the details, other than it was inefficient.  So now I'm wondering if you've 
benchmarked your Highlighter against that/current Highlighter to see not only 
which one is more accurate, but also which one is faster, and by how much?

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 20, 2007 12:39:27 AM
Subject: Highlighter that works with phrase and span queries

I have been working on extending the Highlighter with a new Scorer that 
correctly scores phrase and span queries. The highlighter is working 
great for me, but could really use some more banging on.


If you have a need or an interest in a more accurate Highlighter, please 
give it a whirl and let me know how it went. Unlike most of the other 
alternate Lucene Highlighters, this one builds off the original contrib 
Highlighter so as to retain all of its goodness.


http://myhardshadow.com/qsolreleases/lucene-highlighter-2.2.jar

Example Usage

IndexSearcher searcher = new IndexSearcher(ramDir);
Query query = QueryParser.parse("Kenne*", FIELD_NAME, analyzer);
query = query.rewrite(reader); //required to expand search terms
Hits hits = searcher.search(query);

for (int i = 0; i < hits.length(); i++)
{
String text = hits.doc(i).get(FIELD_NAME);
CachingTokenFilter tokenStream = new 
CachingTokenFilter(analyzer.tokenStream(

FIELD_NAME, new StringReader(text)));
Highlighter highlighter = new Highlighter(new SpanScorer(query, 
FIELD_NAME, tokenStream));

tokenStream.reset();
   
// Get 3 best fragments and seperate with a "..."
String result = highlighter.getBestFragments(tokenStream, text, 
3, "...");

System.out.println(result);
}

If you make a call to any of the getBestFragments() methods more than 
once, you must call reset() on the SpanScorer between each call.


Pass null as the FIELD_NAME to ignore fields.

If you want to Highlight the whole document, use a NullFragmenter.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JavaCC Download

2007-06-27 Thread Steven Rowe
Hi,

I don't know how to access the CA certificate for the web server at
javacc.dev.java.net - my browser automatically does this for me.

Here's an alternate route - I found another javacc-4.0.zip at another
location, and the file I downloaded from there yesterday matched exactly
the version I got from javacc.dev.java.net:

http://atlas.ucpel.tche.br/~dubois/compiladores/javacc-4.0.zip

Good luck,
Steve


Mahdi Rahimi wrote:
> How can I access to Certificate of this site? 
> 
> Steven Rowe wrote:
>> I don't think you need to register - I am not registered and I can
>> download from there.
>>
>> My guess is that Mahdi Rahimi's browser doesn't know how to speak the
>> HTTPS protocol.
>>
>> Here's an invocation of wget (I have version 1.10.2) that works for me
>> to get the .zip archive (all on one line):
>>
>> wget --no-check-certificate
>> https://javacc.dev.java.net/files/documents/17/26777/javacc-4.0.zip
>>
>> Or if you want the .tar.gz archive:
>>
>> wget --no-check-certificate
>> https://javacc.dev.java.net/files/documents/17/26776/javacc-4.0.tar.gz
>>
>> jiang jialin wrote:
>>> you must registe first
>>>
>>> 2007/6/23, Mahdi Rahimi <[EMAIL PROTECTED]>:

 Hi Steven.

 When i access to this address, this message appread

 Forbidden
 You don't have permission to access /servlets/ProjectHome on this
 server.

 What's the problem?

 Thakns.


 Steven Rowe wrote:
> Mahdi Rahimi wrote:
>> Hi.
>>
>> How can I access JavaCC??
>>
>> Thanks
> https://javacc.dev.java.net/


-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Rewrite one phrase to another in search query

2007-06-27 Thread Aliaksandr Radzivanovich

What if I need to search for synonyms, but synonyms can be expanded to
phrases of several words?
For example, user enters query "tcp", then my application should also
find documents containing phrase "Transmission Control Protocol". And
conversely, user enters "Transmission Control Protocol", then my
application should also find documents with word "tcp".

It seems like Lucene does not support this scenario out of the box.
Then where to look for the solution? What Lucene
extensions/classes/interfaces should I investigate?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Payloads and PhraseQuery

2007-06-27 Thread Peter Keegan

I'm looking at the new Payload api and would like to use it in the following
manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each phrase. I
would like to create a custom query class that extends PhraseQuery and uses
its PhraseScorer to find matching documents. The custom query class then
reads the payload from the first term of the matching query and uses it to
produce a new score. However, I don't see how to get the payload from the
PhraseScorer's TermPositions. Is this possible?


Peter


Re: Rewrite one phrase to another in search query

2007-06-27 Thread Erick Erickson

The synonym analyzer shown in Lucene In Action is a good place
to start. You need to change *all* occurrences of one form into
another, both an index and search time to get consistent results.

There are some "interesting" implications for this, though, but they
only really need to be considered if you need either phrase or
span queries. For instance, let's say you have the following doc
fragments:
doc1: "this is a tcp interaction that I want to deal with"
doc2: "this is a transmission control protocol interaction that I want to
deal with"

is "this" within 4 of "interaction" in both documents? Do you care?

Also, is the phrase "transmission control protocol" match for the
first document? Would the user be confused by matching a document
with "tcp" in it for that phrase?

For that matter, does searching on "transmission" match doc1?
Mostly, these are issues that may or may not be relevant depending
on the intent of the application...

Highlighting also becomes interesting.

Best
Erick


On 6/27/07, Aliaksandr Radzivanovich <[EMAIL PROTECTED]> wrote:


What if I need to search for synonyms, but synonyms can be expanded to
phrases of several words?
For example, user enters query "tcp", then my application should also
find documents containing phrase "Transmission Control Protocol". And
conversely, user enters "Transmission Control Protocol", then my
application should also find documents with word "tcp".

It seems like Lucene does not support this scenario out of the box.
Then where to look for the solution? What Lucene
extensions/classes/interfaces should I investigate?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Rewrite one phrase to another in search query

2007-06-27 Thread Steven Rowe
Hi Aliaksandr,

Aliaksandr Radzivanovich wrote:
> What if I need to search for synonyms, but synonyms can be expanded to
> phrases of several words?
> For example, user enters query "tcp", then my application should also
> find documents containing phrase "Transmission Control Protocol". And
> conversely, user enters "Transmission Control Protocol", then my
> application should also find documents with word "tcp".

Section 4.6 of Gospodnetić & Hatcher's excellent _Lucene_in_Action_[1]
describes a SynonymAnalyzer class, intended for use at indexing time
(AFACT, however, their approach does not address multi-word synonyms).
Although a query-time analyzer is not directly discussed, they do say
(on p. 134):

   The awkwardly named PhrasePrefixQuery (see section 5.2)
   is one option to consider, perhaps created through an
   overridden QueryParser.getFieldQuery method; this is a
   possible option to explore if you wish to implement
   synonym injection at query time.

Steve

[1] http://lucenebook.com/

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-06-27 Thread mark harwood
>>you would still have the major problem of which matches do you keep 
>>information for

Yes, doing this efficiently is the main issue. Some vague thoughts I had:
1) A special HighlightObserverQuery could wrap any query and use it's rewrite 
method to further wrap child component queries if necessary.
2) A ThreadLocal could be used to contain low-level match info generated by 
child query components e.g. position info of phrase/span queries (maybe 
generatable by a HighlightingIndexReader wrapper which observed TermPositions 
access)
3) For each call to scorer.next() on the top level query, the HighlightObserver 
class would check to see if the doc was a "keeper" (i.e. it's score places it 
in the required top "n" docs PriorityQueue) and if so, would retain a copy of 
all the transient match info currently held in the ThreadLocal for this doc and 
associate it with the new TopDoc object placed in the top docs PriorityQueue.

This approach tries hard not to require changes to existing Query/scorer 
classes by using wrappers/ThreadLocals and would only hold low-level match 
highlighting info for N documents where "N" is the maximum number of results to 
be returned. 
However there are likely to be many detailed complications with implementing 
this. I haven't pursued this train of thought further because the main killer 
is likely to be the performance overhead from all the unnecessary object 
creation when generating match info objects for documents that don't make the 
final selection anyway. That and the cost of synchronization around ThreadLocal 
accesses.

I think we're right to stick with the existing highlighting approach of 
searching for the top N docs then re-considering the basis of the match for 
just these few docs.

Cheers
Mark




- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 27 June, 2007 12:59:21 PM
Subject: Re: Highlighter that works with phrase and span queries



markharw00d wrote:
>
> I was thinking along the lines of wrapping some core classes such as 
> IndexReader to somehow observe the query matching process and deduce 
> from that what to highlight (avoiding the need for MemoryIndex)  but 
> I'm not sure that is viable. It would be nice to get some more match 
> info out of the main query logic as it runs to aid highlighting rather 
> than reverse engineering the basis of a match after the event. 

I have been thinking about a way to pursue this, and it does not seem 
clear that there is a nice solution. Even if you could wrap Querys or 
other classes to observe matched tokens (non trivial since a Query is 
only concerned if it matches a doc, not which tokens it matches at which 
positions), you would still have the major problem of which matches do 
you keep information for. It does not seem practical to save all of the 
information to highlight *any* doc after a search and it also seems 
unlikely that you would know which docs you wanted to highlight before 
the search. The only compromise that I can see is maybe just storing 
info to highlight the first n docs, but even here, while the scoring is 
occurring you do not yet know the return order. Also, there is probably 
little value in knowing which Tokens were matches for highlighting 
unless you have stored offsets as well.

Unless someone has any suggestions on how to accomplish this, I think 
time would be better spent improving the existing Highlighter framework.

Perhaps Ronnie's Highlighter should be added as an alternate Highlighter 
that is less feature rich but much faster on large docs. It looks to me 
like there is unlikely to be a faster Highlighting method for simple 
non-position aware highlighting.

- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about search

2007-06-27 Thread tanya

Hi,

>Have you used Luke to examine your index and try queries? This will tell you a 
>LOT about what's *really* happening.
>Google 'lucene' 'luke' and try it.


I've tried Luke but still have no clue what is going on:
I have the following entry:

2007-06-26T10:56:20-05:00  globus-gatekeeper:  PID: 15986 -- Notice: 5: 
Authorized as local uid: 12967


While searching  in Luke with StandardAnalyzer I can find
+uid +12967

but "No Results"
+PID +15986

Any idea?
Thanks,

Tanya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-06-27 Thread Paul Elschot
On Wednesday 27 June 2007 17:17, mark harwood wrote:
> >>you would still have the major problem of which matches do you keep 
information for
> 
> Yes, doing this efficiently is the main issue. Some vague thoughts I had:
>...
> 3) For each call to scorer.next() on the top level query, the 
HighlightObserver class would check to see if the doc was a "keeper" (i.e. 
it's score places it in the required top "n" docs PriorityQueue) and if so, 
would retain a copy of all the transient match info currently held in the 
ThreadLocal for this doc and associate it with the new TopDoc object placed 
in the top docs PriorityQueue.

This can be done more efficiently by skipping the Spans themselves to
the next document for which the matches need to be kept. For each doc,
the Spans could then be copied by iterating until the next matching doc
in the search.
Even better would be to use a Filter in the search to limit the results to the 
matches that are immediately needed, but a Filter still requires a BitSet
over all indexed documents, and that is probably overkill for highlighting.
Iterating the Spans will be in doc number order, so some mapping back
to the scored order would still be needed.

I have not looked at any highlighting code yet. Is there already an extension
of PhraseQuery that has getSpans() ?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about search

2007-06-27 Thread Erick Erickson

Please take the time, before asking others "what's going on" to at
least format your mail so we can tell what's what. For instance,
what's a field and what's a value in what you sent? I sure can't
tell because there are so many colons. Remember that you're
asking people to contribute time to solve *your* problem so it
would be a good idea to do us the courtesy of taking some time
to make it easier rather than pasting what looks like a log
file entry and expecting us to "just know" what it means.

I can say that your Luke entries are incorrect. Assuming what you're
trying to find is value 15986 in a field PID, the correct form would be
+PID:15986. Which indicates you haven't read the lucene query
syntax documentation very carefully. See
http://lucene.apache.org/java/docs/queryparsersyntax.html


+PID +15986 will look for "PID" and "15986" in whatever the
default field is, which you can identify by looking at the Luke
search page carefully.

None of which may be relevant if there is only one field
called "globbus-gatekeeper".

And what analyzer did you use to index the data? And what
was the data you indexed?

Best
Erick

On 6/27/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:



Hi,

>Have you used Luke to examine your index and try queries? This will tell
you a LOT about what's *really* happening.
>Google 'lucene' 'luke' and try it.


I've tried Luke but still have no clue what is going on:
I have the following entry:

2007-06-26T10:56:20-05:00  globus-gatekeeper:  PID: 15986 -- Notice: 5:
Authorized as local uid: 12967


While searching  in Luke with StandardAnalyzer I can find
+uid +12967

but "No Results"
+PID +15986

Any idea?
Thanks,

Tanya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: several existential issues about Lucene's filesystem

2007-06-27 Thread Grant Ingersoll


On Jun 27, 2007, at 8:51 AM, Samuel LEMOINE wrote:


Hi everyone !

I'm working on bibliographical researches on Lucene as an intern in  
Lingway (which uses Lucene in its main product), and I'm currently  
studying Lucene's file system.
There are several things I don't catch in Lucene's file system, and  
I thought here was the right place to ask about those questions (I  
hope it's the case actually).

The main resource I used is this document:
http://lucene.apache.org/java/2_1_0/fileformats.html

-in the .tvf file (Term Vector file) in Lucene 2.2.0, position &  
offsets can be possibly given in the term vector... I don't  
understand how it works, since there's only one .tvf per segment  
(according to what I've understood), and in the architecture  
described, there is no information given about the documents in  
which appears each term stored in the TermVector (the informations  
document-related are in the .tvd file I assume). The position/ 
offset informations seems to be simply a list of addresses, but how  
can be known the document it refers to? Or is there one .tvf file  
per document?


Yes, offsets and positions can be associated with a term vector.   
When you ask the IndexReader for a term vector, you give it the  
document number and, optionally, a field, which it uses to go look up  
in the tvd file the document location in the tvd file.  The tvd file  
then looks up the specific information in the tvf file.  Have a look  
at the TermVectorsReader for details on implementation.


-in the .prx file (prositions file), payloads are mentionned and  
allow to attach meta-data... what's the purpose of such data? is  
there a precise use, or is it only data for the sole user's use?


Payloads have a variety of uses.  Search the java-dev archive for the  
word Payload and you will find lots of discussion.  I also have a few  
slides on it in my ApacheCon Europe presentation at http://cnlp.org/ 
presentations/slides/AdvancedLuceneEU.pdf  See also http:// 
wiki.apache.org/jakarta-lucene/Payload_Planning


Essentially, it can be used to store information on a term by term  
level, things like font weight, or XML enclosing tag, or Part of  
Speech.  The sky really is the limit (that and your disk space) on  
what can be stored in a payload.





-many adresses in many files are given under Delta shapes...  
Doesn't it slacken the search among the index ? I mean, when a  
keyword is looked for, in order to find its position in the right  
file, Lucene must find the adress of the previous term and add the  
"delta" address... but the previous term adress is also given by a  
delta address, and so on, so that as far as I understand it, the  
whole file must be climbed back, recursively finding the address of  
each term... I assume I've misunderstood something, but don't know  
what.


Not quite sure what you are asking, but I will take a stab at it.   
Have a look at the section on the Term Dictionary, specifically the  
relationship between the tis file and the tii file.  The storage  
mechanism makes it very easy to find where the keyword is in the file  
so that the rest of the information can be easily looked up.


HTH,
Grant


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-06-27 Thread Mark Miller



I have not looked at any highlighting code yet. Is there already an extension
of PhraseQuery that has getSpans() ?
  

Currently I am using this code originally by M. Harwood:
   Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
   int i;
   SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length];

   for (i = 0; i < phraseQueryTerms.length; i++) {
   clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
   }

   SpanNearQuery sp = new SpanNearQuery(clauses,
   ((PhraseQuery) query).getSlop(), false);
   sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit distance, but 
it approximates extremely well in most cases.


I wonder if this approach to Highlighting would be worth it in the end. 
Certainly, it would seem to require that you store offsets or you would 
have to re-tokenize anyway.


Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current 
Highlighter if we grab from the source text in bigger chunks. Ronnie's 
Highlighter appears to be faster than the original due to two things: he 
doesn't have to re-tokenize text and he rebuilds the original document 
in large pieces. Depending on how you want to look at it, he loses most 
of the speed gained from just looking at the Query tokens instead of all 
tokens to pulling the Term offset information (which appears pretty slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long, you can 
actually match the speed of Ronnies highlighter with the current 
highlighter if you just rebuild the highlighted documents in bigger 
pieces i.e. instead of going through each token and adding the source 
text that it covers, build up the offset information until you get 
another hit and then pull from the source text into the highlighted text 
in one big piece rather than a tokens worth at a time. Of course this is 
not compatible with the way the Fragmenter currently works. If you use 
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter 
wins because it takes so darn long to re-analyze.


It is also interesting to note that it is very difficult to see in a 
gain in using TokenSources to build a TokenStream. Using the 
StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast 
as re-analyzing. Notice I didn't say fast, but "as fast". Anything 
smaller, or if you're using a simpler analyzer, and TokenSources is 
certainly not worth it. It just takes too long to pull TermVector info.


- Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rewrite one phrase to another in search query

2007-06-27 Thread Chris Hostetter

: (AFACT, however, their approach does not address multi-word synonyms).
: Although a query-time analyzer is not directly discussed, they do say

Solr's has a SynonymFilter that does handle multi-word synonyms, and it
can handle query-time synonyms, but there are some caveats to both of
those use cases (mainly that you can have one or hte other but not both)
that you need to consider carefully.  they are well documetned in teh SOlr
wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexing anchor text

2007-06-27 Thread Tim Sturge

Hi,

I'm trying to index some fairly standard html documents. For each of the 
documents, there is a unique  (which I believe is generally of 
high quality), some  content, and some anchor text from the 
linking documents (which is of good but more variable quality).


I'm indexing them in "title" "anchor" and "body"

"title" and "body" are obvious (you just give the text to the 
StandardAnalyzer) but I don't really know how to handle the anchor text. 
Suppose the page with the title "United States" I know has the anchor 
text "USA" 500 times, "United States" 200 times, "United States of 
America" 100 times and "Unite Stats" once.


How do I index this?

1) index a single "anchor" field containing "USA United States United 
States of America Unite Stats",
2) create the field  "USA USA ...500x... USA  United States ...200x... 
United States ... " and index that as "anchor"

3) create 801 "anchor" fields (500 containg USA etc)
4) create 4 "anchor" fields and call setBoost() on each with some 
constants. (how do I calculate them?)


I suspect these give me different results in some way, but I'm having 
trouble understanding what the difference between 2) and 3) is and how 
to make 4) work like 3). I also worry that 2) and 3) are much slower 
than they need to be.


Any help is appreciated,

Tim




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



breaking a single index in to two indexes

2007-06-27 Thread Les Fletcher
I am in need of some help with the following problem.  I have a single 
index that I am currently searching against, but it has the property 
that a small set of the documents get updated frequently while a large 
majority of them are very static and are rarely updated.  Documents can 
move from being a static type document to one that begins to be updated 
on a frequent basis.  I'd like to break this up into two separate 
indexes with one large one being the index of the static documents and a 
smaller one of the constantly updated documents. 

I am looking for some help on optimal update policies for each index and 
managing of migration of a document from the static index to the active 
index.  Hopefully this should allow me to make much better use of cached 
filters.


I am sure that someone else has run into a very similar problem.  I did 
some poking around, but was obviously not searching for the right 
thing.  Any suggestions or insight into dealing with this would be 
greatly appreciated.


Thanks,
Les

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing anchor text

2007-06-27 Thread Erick Erickson

Well, to quote the great wise one, "that depends". The reason I'm
being flippant here is because what it depends on is what you want
the result to be.

I'm asking for a use-case scenario here. Something like
"I want the docs to score equally no matter how many
links with 'United States' exist in them". Or
"A document with 100 links mentioning 'United States' should
score way higher than a document with only one link mentioning
'United States'".

Best
Erick

On 6/27/07, Tim Sturge <[EMAIL PROTECTED]> wrote:


Hi,

I'm trying to index some fairly standard html documents. For each of the
documents, there is a unique  (which I believe is generally of
high quality), some  content, and some anchor text from the
linking documents (which is of good but more variable quality).

I'm indexing them in "title" "anchor" and "body"

"title" and "body" are obvious (you just give the text to the
StandardAnalyzer) but I don't really know how to handle the anchor text.
Suppose the page with the title "United States" I know has the anchor
text "USA" 500 times, "United States" 200 times, "United States of
America" 100 times and "Unite Stats" once.

How do I index this?

1) index a single "anchor" field containing "USA United States United
States of America Unite Stats",
2) create the field  "USA USA ...500x... USA  United States ...200x...
United States ... " and index that as "anchor"
3) create 801 "anchor" fields (500 containg USA etc)
4) create 4 "anchor" fields and call setBoost() on each with some
constants. (how do I calculate them?)

I suspect these give me different results in some way, but I'm having
trouble understanding what the difference between 2) and 3) is and how
to make 4) work like 3). I also worry that 2) and 3) are much slower
than they need to be.

Any help is appreciated,

Tim




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and PhraseQuery

2007-06-27 Thread Mark Miller
You cannot do it because TermPositions is read in the 
PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and 
loaded into an array which is passed to PhraseScorer. Extend the Weight 
as well and pass the payload to the Scorer as well is a possibility.


- Mark

Peter Keegan wrote:
I'm looking at the new Payload api and would like to use it in the 
following

manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each phrase. I
would like to create a custom query class that extends PhraseQuery and 
uses

its PhraseScorer to find matching documents. The custom query class then
reads the payload from the first term of the matching query and uses 
it to

produce a new score. However, I don't see how to get the payload from the
PhraseScorer's TermPositions. Is this possible?


Peter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads and PhraseQuery

2007-06-27 Thread Grant Ingersoll
Could you get what you need combining the BoostingTermQuery with a  
SpanNearQuery to produce a score?  Just guessing here..


At some point, I would like to see more Query classes around the  
payload stuff, so please submit patches/feedback if and when you get  
a solution


On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote:

I'm looking at the new Payload api and would like to use it in the  
following

manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each  
phrase. I
would like to create a custom query class that extends PhraseQuery  
and uses
its PhraseScorer to find matching documents. The custom query class  
then
reads the payload from the first term of the matching query and  
uses it to
produce a new score. However, I don't see how to get the payload  
from the

PhraseScorer's TermPositions. Is this possible?


Peter


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing anchor text

2007-06-27 Thread Tim Sturge
Case B -- I believe the more inbound anchor text, the better the match. 
Right now I'm also boosting the documents by calling


setBoost( log( numInboundLinks+1 ) + 1 )

which seems to be quite effective; is there some sort of guidebook for this?

I'm also interested in figuring out how to rank the boost for title vs 
body vs anchor; this seems to be 90% black magic to me.


Thanks,

Tim

Erick Erickson wrote:

Well, to quote the great wise one, "that depends". The reason I'm
being flippant here is because what it depends on is what you want
the result to be.

I'm asking for a use-case scenario here. Something like
"I want the docs to score equally no matter how many
links with 'United States' exist in them". Or
"A document with 100 links mentioning 'United States' should
score way higher than a document with only one link mentioning
'United States'".

Best
Erick

On 6/27/07, Tim Sturge <[EMAIL PROTECTED]> wrote:


Hi,

I'm trying to index some fairly standard html documents. For each of the
documents, there is a unique  (which I believe is generally of
high quality), some  content, and some anchor text from the
linking documents (which is of good but more variable quality).

I'm indexing them in "title" "anchor" and "body"

"title" and "body" are obvious (you just give the text to the
StandardAnalyzer) but I don't really know how to handle the anchor text.
Suppose the page with the title "United States" I know has the anchor
text "USA" 500 times, "United States" 200 times, "United States of
America" 100 times and "Unite Stats" once.

How do I index this?

1) index a single "anchor" field containing "USA United States United
States of America Unite Stats",
2) create the field  "USA USA ...500x... USA  United States ...200x...
United States ... " and index that as "anchor"
3) create 801 "anchor" fields (500 containg USA etc)
4) create 4 "anchor" fields and call setBoost() on each with some
constants. (how do I calculate them?)

I suspect these give me different results in some way, but I'm having
trouble understanding what the difference between 2) and 3) is and how
to make 4) work like 3). I also worry that 2) and 3) are much slower
than they need to be.

Any help is appreciated,

Tim




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]