Re: JavaCC Download
How can I access to Certificate of this site? Steven Rowe wrote: > > I don't think you need to register - I am not registered and I can > download from there. > > My guess is that Mahdi Rahimi's browser doesn't know how to speak the > HTTPS protocol. > > Here's an invocation of wget (I have version 1.10.2) that works for me > to get the .zip archive (all on one line): > > wget --no-check-certificate > https://javacc.dev.java.net/files/documents/17/26777/javacc-4.0.zip > > Or if you want the .tar.gz archive: > > wget --no-check-certificate > https://javacc.dev.java.net/files/documents/17/26776/javacc-4.0.tar.gz > > jiang jialin wrote: >> you must registe first >> >> 2007/6/23, Mahdi Rahimi <[EMAIL PROTECTED]>: >>> >>> >>> Hi Steven. >>> >>> When i access to this address, this message appread >>> >>> Forbidden >>> You don't have permission to access /servlets/ProjectHome on this >>> server. >>> >>> What's the problem? >>> >>> Thakns. >>> >>> >>> Steven Rowe wrote: >>> > >>> > Mahdi Rahimi wrote: >>> >> Hi. >>> >> >>> >> How can I access JavaCC?? >>> >> >>> >> Thanks >>> > >>> > https://javacc.dev.java.net/ >>> > >>> > -- >>> > Steve Rowe >>> > Center for Natural Language Processing >>> > http://www.cnlp.org/tech/lucene.asp > > > -- > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/JavaCC-Download-tf3958940.html#a11319544 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Update documents
High, Is-it possible to update a document's field without deleting the document and add it again into the index?
RE: Update documents
Perhaps it is not possible if you have written the document to index. Andy -Original Message- From: WATHELET Thomas [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 27, 2007 3:46 PM To: java-user@lucene.apache.org Subject: Update documents High, Is-it possible to update a document's field without deleting the document and add it again into the index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Update documents
WATHELET Thomas wrote: > Is-it possible to update a document's field without deleting the > document and add it again into the index? Not really... see the FAQ, especially "How do I update a document or a set of documents that are already indexed?", and also see javadocs for IndexWriter's updateDocument() methods. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Update documents
In effect, IndexWriter's updateDocument() will first delete the document containing specific term, then add the document. It just wrap delete&add as a thread safe method. Andy -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 27, 2007 3:58 PM To: java-user@lucene.apache.org Subject: Re: Update documents WATHELET Thomas wrote: > Is-it possible to update a document's field without deleting the > document and add it again into the index? Not really... see the FAQ, especially "How do I update a document or a set of documents that are already indexed?", and also see javadocs for IndexWriter's updateDocument() methods. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene as primary object storage
Hi karl, we did something like hibernate to map an object (Entity) with lucene by defining a bunch of annotations just like the Limax project (as far as I know it is led by you), the only problem we had was how to make relationship between two or more separate indexes. I managed to resolve it but I don't think it's very good idea. if only the Lucene had some feature to facilitate this :) We use these indexes for generating some dynamic reports and we are going to create a database crawler to surf the DB and find new or deleted records to update its index files. our application uses only index files to persist the information comming from DB and also uses that index as a resource I am welcome if you want to know how to make relationship between two or more indexes. Good Luck -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
Re: Highlighter that works with phrase and span queries
markharw00d wrote: I was thinking along the lines of wrapping some core classes such as IndexReader to somehow observe the query matching process and deduce from that what to highlight (avoiding the need for MemoryIndex) but I'm not sure that is viable. It would be nice to get some more match info out of the main query logic as it runs to aid highlighting rather than reverse engineering the basis of a match after the event. I have been thinking about a way to pursue this, and it does not seem clear that there is a nice solution. Even if you could wrap Querys or other classes to observe matched tokens (non trivial since a Query is only concerned if it matches a doc, not which tokens it matches at which positions), you would still have the major problem of which matches do you keep information for. It does not seem practical to save all of the information to highlight *any* doc after a search and it also seems unlikely that you would know which docs you wanted to highlight before the search. The only compromise that I can see is maybe just storing info to highlight the first n docs, but even here, while the scoring is occurring you do not yet know the return order. Also, there is probably little value in knowing which Tokens were matches for highlighting unless you have stored offsets as well. Unless someone has any suggestions on how to accomplish this, I think time would be better spent improving the existing Highlighter framework. Perhaps Ronnie's Highlighter should be added as an alternate Highlighter that is less feature rich but much faster on large docs. It looks to me like there is unlikely to be a faster Highlighting method for simple non-position aware highlighting. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
several existential issues about Lucene's filesystem
Hi everyone ! I'm working on bibliographical researches on Lucene as an intern in Lingway (which uses Lucene in its main product), and I'm currently studying Lucene's file system. There are several things I don't catch in Lucene's file system, and I thought here was the right place to ask about those questions (I hope it's the case actually). The main resource I used is this document: http://lucene.apache.org/java/2_1_0/fileformats.html -in the .tvf file (Term Vector file) in Lucene 2.2.0, position & offsets can be possibly given in the term vector... I don't understand how it works, since there's only one .tvf per segment (according to what I've understood), and in the architecture described, there is no information given about the documents in which appears each term stored in the TermVector (the informations document-related are in the .tvd file I assume). The position/offset informations seems to be simply a list of addresses, but how can be known the document it refers to? Or is there one .tvf file per document? -in the .prx file (prositions file), payloads are mentionned and allow to attach meta-data... what's the purpose of such data? is there a precise use, or is it only data for the sole user's use? -many adresses in many files are given under Delta shapes... Doesn't it slacken the search among the index ? I mean, when a keyword is looked for, in order to find its position in the right file, Lucene must find the adress of the previous term and add the "delta" address... but the previous term adress is also given by a delta address, and so on, so that as far as I understand it, the whole file must be climbed back, recursively finding the address of each term... I assume I've misunderstood something, but don't know what. I apologize for the length of my mail, and the approximative english... Thanks a lot for reading, and far more for answering ^^ Samuel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter that works with phrase and span queries
Depending on what these guys are doing, here is another possibility if TermOffests and Ronnie's highlighter are not an option. If you are highlighting whole documents (NullFragmenter) or are not very concerned about the fragments you get back, you can change the line in the Highlighter at about 255: tokenGroup.addToken(token, fragmentScorer.getTokenScore(token)); TO: float score = fragmentScorer.getTokenScore(token); if(score > 0 ) { tokenGroup.addToken(token, score); } This is not a full solution yet, but more of a hack. Fragmenters won't be given the opportunity to start a new Fragment at every token position...no problem if you are highlighting the whole document. Essentially, instead of the the document being rebuilt from from the source text using each individual token, it is rebuilt from the highlighted tokens and the differences in offsets between them. No so fragment happy without some Fragmenter handling changes. On a collection of 5,000 documents, 300-900 tokens (weighted toward 300), this gave an improvement of 37-40%. I imagine the gains grow as the document grows. I am looking into making this a more general solution, but it's a great quick hack for speed. It will also work with my SpanScorer that correctly highlights Spans and PhraseQuerys. - Mark Otis Gospodnetic wrote: Hi Mark, I know one large user (meaning: high query/highlight rates) of the current Highlighter and this user wasn't too happy with its performance. I don't know the details, other than it was inefficient. So now I'm wondering if you've benchmarked your Highlighter against that/current Highlighter to see not only which one is more accurate, but also which one is faster, and by how much? Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mark Miller <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 20, 2007 12:39:27 AM Subject: Highlighter that works with phrase and span queries I have been working on extending the Highlighter with a new Scorer that correctly scores phrase and span queries. The highlighter is working great for me, but could really use some more banging on. If you have a need or an interest in a more accurate Highlighter, please give it a whirl and let me know how it went. Unlike most of the other alternate Lucene Highlighters, this one builds off the original contrib Highlighter so as to retain all of its goodness. http://myhardshadow.com/qsolreleases/lucene-highlighter-2.2.jar Example Usage IndexSearcher searcher = new IndexSearcher(ramDir); Query query = QueryParser.parse("Kenne*", FIELD_NAME, analyzer); query = query.rewrite(reader); //required to expand search terms Hits hits = searcher.search(query); for (int i = 0; i < hits.length(); i++) { String text = hits.doc(i).get(FIELD_NAME); CachingTokenFilter tokenStream = new CachingTokenFilter(analyzer.tokenStream( FIELD_NAME, new StringReader(text))); Highlighter highlighter = new Highlighter(new SpanScorer(query, FIELD_NAME, tokenStream)); tokenStream.reset(); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(tokenStream, text, 3, "..."); System.out.println(result); } If you make a call to any of the getBestFragments() methods more than once, you must call reset() on the SpanScorer between each call. Pass null as the FIELD_NAME to ignore fields. If you want to Highlight the whole document, use a NullFragmenter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JavaCC Download
Hi, I don't know how to access the CA certificate for the web server at javacc.dev.java.net - my browser automatically does this for me. Here's an alternate route - I found another javacc-4.0.zip at another location, and the file I downloaded from there yesterday matched exactly the version I got from javacc.dev.java.net: http://atlas.ucpel.tche.br/~dubois/compiladores/javacc-4.0.zip Good luck, Steve Mahdi Rahimi wrote: > How can I access to Certificate of this site? > > Steven Rowe wrote: >> I don't think you need to register - I am not registered and I can >> download from there. >> >> My guess is that Mahdi Rahimi's browser doesn't know how to speak the >> HTTPS protocol. >> >> Here's an invocation of wget (I have version 1.10.2) that works for me >> to get the .zip archive (all on one line): >> >> wget --no-check-certificate >> https://javacc.dev.java.net/files/documents/17/26777/javacc-4.0.zip >> >> Or if you want the .tar.gz archive: >> >> wget --no-check-certificate >> https://javacc.dev.java.net/files/documents/17/26776/javacc-4.0.tar.gz >> >> jiang jialin wrote: >>> you must registe first >>> >>> 2007/6/23, Mahdi Rahimi <[EMAIL PROTECTED]>: Hi Steven. When i access to this address, this message appread Forbidden You don't have permission to access /servlets/ProjectHome on this server. What's the problem? Thakns. Steven Rowe wrote: > Mahdi Rahimi wrote: >> Hi. >> >> How can I access JavaCC?? >> >> Thanks > https://javacc.dev.java.net/ -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Rewrite one phrase to another in search query
What if I need to search for synonyms, but synonyms can be expanded to phrases of several words? For example, user enters query "tcp", then my application should also find documents containing phrase "Transmission Control Protocol". And conversely, user enters "Transmission Control Protocol", then my application should also find documents with word "tcp". It seems like Lucene does not support this scenario out of the box. Then where to look for the solution? What Lucene extensions/classes/interfaces should I investigate? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Payloads and PhraseQuery
I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter
Re: Rewrite one phrase to another in search query
The synonym analyzer shown in Lucene In Action is a good place to start. You need to change *all* occurrences of one form into another, both an index and search time to get consistent results. There are some "interesting" implications for this, though, but they only really need to be considered if you need either phrase or span queries. For instance, let's say you have the following doc fragments: doc1: "this is a tcp interaction that I want to deal with" doc2: "this is a transmission control protocol interaction that I want to deal with" is "this" within 4 of "interaction" in both documents? Do you care? Also, is the phrase "transmission control protocol" match for the first document? Would the user be confused by matching a document with "tcp" in it for that phrase? For that matter, does searching on "transmission" match doc1? Mostly, these are issues that may or may not be relevant depending on the intent of the application... Highlighting also becomes interesting. Best Erick On 6/27/07, Aliaksandr Radzivanovich <[EMAIL PROTECTED]> wrote: What if I need to search for synonyms, but synonyms can be expanded to phrases of several words? For example, user enters query "tcp", then my application should also find documents containing phrase "Transmission Control Protocol". And conversely, user enters "Transmission Control Protocol", then my application should also find documents with word "tcp". It seems like Lucene does not support this scenario out of the box. Then where to look for the solution? What Lucene extensions/classes/interfaces should I investigate? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rewrite one phrase to another in search query
Hi Aliaksandr, Aliaksandr Radzivanovich wrote: > What if I need to search for synonyms, but synonyms can be expanded to > phrases of several words? > For example, user enters query "tcp", then my application should also > find documents containing phrase "Transmission Control Protocol". And > conversely, user enters "Transmission Control Protocol", then my > application should also find documents with word "tcp". Section 4.6 of Gospodnetić & Hatcher's excellent _Lucene_in_Action_[1] describes a SynonymAnalyzer class, intended for use at indexing time (AFACT, however, their approach does not address multi-word synonyms). Although a query-time analyzer is not directly discussed, they do say (on p. 134): The awkwardly named PhrasePrefixQuery (see section 5.2) is one option to consider, perhaps created through an overridden QueryParser.getFieldQuery method; this is a possible option to explore if you wish to implement synonym injection at query time. Steve [1] http://lucenebook.com/ -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter that works with phrase and span queries
>>you would still have the major problem of which matches do you keep >>information for Yes, doing this efficiently is the main issue. Some vague thoughts I had: 1) A special HighlightObserverQuery could wrap any query and use it's rewrite method to further wrap child component queries if necessary. 2) A ThreadLocal could be used to contain low-level match info generated by child query components e.g. position info of phrase/span queries (maybe generatable by a HighlightingIndexReader wrapper which observed TermPositions access) 3) For each call to scorer.next() on the top level query, the HighlightObserver class would check to see if the doc was a "keeper" (i.e. it's score places it in the required top "n" docs PriorityQueue) and if so, would retain a copy of all the transient match info currently held in the ThreadLocal for this doc and associate it with the new TopDoc object placed in the top docs PriorityQueue. This approach tries hard not to require changes to existing Query/scorer classes by using wrappers/ThreadLocals and would only hold low-level match highlighting info for N documents where "N" is the maximum number of results to be returned. However there are likely to be many detailed complications with implementing this. I haven't pursued this train of thought further because the main killer is likely to be the performance overhead from all the unnecessary object creation when generating match info objects for documents that don't make the final selection anyway. That and the cost of synchronization around ThreadLocal accesses. I think we're right to stick with the existing highlighting approach of searching for the top N docs then re-considering the basis of the match for just these few docs. Cheers Mark - Original Message From: Mark Miller <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 27 June, 2007 12:59:21 PM Subject: Re: Highlighter that works with phrase and span queries markharw00d wrote: > > I was thinking along the lines of wrapping some core classes such as > IndexReader to somehow observe the query matching process and deduce > from that what to highlight (avoiding the need for MemoryIndex) but > I'm not sure that is viable. It would be nice to get some more match > info out of the main query logic as it runs to aid highlighting rather > than reverse engineering the basis of a match after the event. I have been thinking about a way to pursue this, and it does not seem clear that there is a nice solution. Even if you could wrap Querys or other classes to observe matched tokens (non trivial since a Query is only concerned if it matches a doc, not which tokens it matches at which positions), you would still have the major problem of which matches do you keep information for. It does not seem practical to save all of the information to highlight *any* doc after a search and it also seems unlikely that you would know which docs you wanted to highlight before the search. The only compromise that I can see is maybe just storing info to highlight the first n docs, but even here, while the scoring is occurring you do not yet know the return order. Also, there is probably little value in knowing which Tokens were matches for highlighting unless you have stored offsets as well. Unless someone has any suggestions on how to accomplish this, I think time would be better spent improving the existing Highlighter framework. Perhaps Ronnie's Highlighter should be added as an alternate Highlighter that is less feature rich but much faster on large docs. It looks to me like there is unlikely to be a faster Highlighting method for simple non-position aware highlighting. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question about search
Hi, >Have you used Luke to examine your index and try queries? This will tell you a >LOT about what's *really* happening. >Google 'lucene' 'luke' and try it. I've tried Luke but still have no clue what is going on: I have the following entry: 2007-06-26T10:56:20-05:00 globus-gatekeeper: PID: 15986 -- Notice: 5: Authorized as local uid: 12967 While searching in Luke with StandardAnalyzer I can find +uid +12967 but "No Results" +PID +15986 Any idea? Thanks, Tanya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter that works with phrase and span queries
On Wednesday 27 June 2007 17:17, mark harwood wrote: > >>you would still have the major problem of which matches do you keep information for > > Yes, doing this efficiently is the main issue. Some vague thoughts I had: >... > 3) For each call to scorer.next() on the top level query, the HighlightObserver class would check to see if the doc was a "keeper" (i.e. it's score places it in the required top "n" docs PriorityQueue) and if so, would retain a copy of all the transient match info currently held in the ThreadLocal for this doc and associate it with the new TopDoc object placed in the top docs PriorityQueue. This can be done more efficiently by skipping the Spans themselves to the next document for which the matches need to be kept. For each doc, the Spans could then be copied by iterating until the next matching doc in the search. Even better would be to use a Filter in the search to limit the results to the matches that are immediately needed, but a Filter still requires a BitSet over all indexed documents, and that is probably overkill for highlighting. Iterating the Spans will be in doc number order, so some mapping back to the scored order would still be needed. I have not looked at any highlighting code yet. Is there already an extension of PhraseQuery that has getSpans() ? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question about search
Please take the time, before asking others "what's going on" to at least format your mail so we can tell what's what. For instance, what's a field and what's a value in what you sent? I sure can't tell because there are so many colons. Remember that you're asking people to contribute time to solve *your* problem so it would be a good idea to do us the courtesy of taking some time to make it easier rather than pasting what looks like a log file entry and expecting us to "just know" what it means. I can say that your Luke entries are incorrect. Assuming what you're trying to find is value 15986 in a field PID, the correct form would be +PID:15986. Which indicates you haven't read the lucene query syntax documentation very carefully. See http://lucene.apache.org/java/docs/queryparsersyntax.html +PID +15986 will look for "PID" and "15986" in whatever the default field is, which you can identify by looking at the Luke search page carefully. None of which may be relevant if there is only one field called "globbus-gatekeeper". And what analyzer did you use to index the data? And what was the data you indexed? Best Erick On 6/27/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Hi, >Have you used Luke to examine your index and try queries? This will tell you a LOT about what's *really* happening. >Google 'lucene' 'luke' and try it. I've tried Luke but still have no clue what is going on: I have the following entry: 2007-06-26T10:56:20-05:00 globus-gatekeeper: PID: 15986 -- Notice: 5: Authorized as local uid: 12967 While searching in Luke with StandardAnalyzer I can find +uid +12967 but "No Results" +PID +15986 Any idea? Thanks, Tanya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: several existential issues about Lucene's filesystem
On Jun 27, 2007, at 8:51 AM, Samuel LEMOINE wrote: Hi everyone ! I'm working on bibliographical researches on Lucene as an intern in Lingway (which uses Lucene in its main product), and I'm currently studying Lucene's file system. There are several things I don't catch in Lucene's file system, and I thought here was the right place to ask about those questions (I hope it's the case actually). The main resource I used is this document: http://lucene.apache.org/java/2_1_0/fileformats.html -in the .tvf file (Term Vector file) in Lucene 2.2.0, position & offsets can be possibly given in the term vector... I don't understand how it works, since there's only one .tvf per segment (according to what I've understood), and in the architecture described, there is no information given about the documents in which appears each term stored in the TermVector (the informations document-related are in the .tvd file I assume). The position/ offset informations seems to be simply a list of addresses, but how can be known the document it refers to? Or is there one .tvf file per document? Yes, offsets and positions can be associated with a term vector. When you ask the IndexReader for a term vector, you give it the document number and, optionally, a field, which it uses to go look up in the tvd file the document location in the tvd file. The tvd file then looks up the specific information in the tvf file. Have a look at the TermVectorsReader for details on implementation. -in the .prx file (prositions file), payloads are mentionned and allow to attach meta-data... what's the purpose of such data? is there a precise use, or is it only data for the sole user's use? Payloads have a variety of uses. Search the java-dev archive for the word Payload and you will find lots of discussion. I also have a few slides on it in my ApacheCon Europe presentation at http://cnlp.org/ presentations/slides/AdvancedLuceneEU.pdf See also http:// wiki.apache.org/jakarta-lucene/Payload_Planning Essentially, it can be used to store information on a term by term level, things like font weight, or XML enclosing tag, or Part of Speech. The sky really is the limit (that and your disk space) on what can be stored in a payload. -many adresses in many files are given under Delta shapes... Doesn't it slacken the search among the index ? I mean, when a keyword is looked for, in order to find its position in the right file, Lucene must find the adress of the previous term and add the "delta" address... but the previous term adress is also given by a delta address, and so on, so that as far as I understand it, the whole file must be climbed back, recursively finding the address of each term... I assume I've misunderstood something, but don't know what. Not quite sure what you are asking, but I will take a stab at it. Have a look at the section on the Term Dictionary, specifically the relationship between the tis file and the tii file. The storage mechanism makes it very easy to find where the keyword is in the file so that the rest of the information can be easily looked up. HTH, Grant -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter that works with phrase and span queries
I have not looked at any highlighting code yet. Is there already an extension of PhraseQuery that has getSpans() ? Currently I am using this code originally by M. Harwood: Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms(); int i; SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length]; for (i = 0; i < phraseQueryTerms.length; i++) { clauses[i] = new SpanTermQuery(phraseQueryTerms[i]); } SpanNearQuery sp = new SpanNearQuery(clauses, ((PhraseQuery) query).getSlop(), false); sp.setBoost(query.getBoost()); I don't think it is perfect logic for PhraseQuery's edit distance, but it approximates extremely well in most cases. I wonder if this approach to Highlighting would be worth it in the end. Certainly, it would seem to require that you store offsets or you would have to re-tokenize anyway. Some more interesting "stuff" on the current Highlighter methods: We can gain a lot of speed on the implementation of the current Highlighter if we grab from the source text in bigger chunks. Ronnie's Highlighter appears to be faster than the original due to two things: he doesn't have to re-tokenize text and he rebuilds the original document in large pieces. Depending on how you want to look at it, he loses most of the speed gained from just looking at the Query tokens instead of all tokens to pulling the Term offset information (which appears pretty slow). If you use a SimpleAnalyzer on docs around 1800 tokens long, you can actually match the speed of Ronnies highlighter with the current highlighter if you just rebuild the highlighted documents in bigger pieces i.e. instead of going through each token and adding the source text that it covers, build up the offset information until you get another hit and then pull from the source text into the highlighted text in one big piece rather than a tokens worth at a time. Of course this is not compatible with the way the Fragmenter currently works. If you use the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter wins because it takes so darn long to re-analyze. It is also interesting to note that it is very difficult to see in a gain in using TokenSources to build a TokenStream. Using the StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast as re-analyzing. Notice I didn't say fast, but "as fast". Anything smaller, or if you're using a simpler analyzer, and TokenSources is certainly not worth it. It just takes too long to pull TermVector info. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rewrite one phrase to another in search query
: (AFACT, however, their approach does not address multi-word synonyms). : Although a query-time analyzer is not directly discussed, they do say Solr's has a SynonymFilter that does handle multi-word synonyms, and it can handle query-time synonyms, but there are some caveats to both of those use cases (mainly that you can have one or hte other but not both) that you need to consider carefully. they are well documetned in teh SOlr wiki... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing anchor text
Hi, I'm trying to index some fairly standard html documents. For each of the documents, there is a unique (which I believe is generally of high quality), some content, and some anchor text from the linking documents (which is of good but more variable quality). I'm indexing them in "title" "anchor" and "body" "title" and "body" are obvious (you just give the text to the StandardAnalyzer) but I don't really know how to handle the anchor text. Suppose the page with the title "United States" I know has the anchor text "USA" 500 times, "United States" 200 times, "United States of America" 100 times and "Unite Stats" once. How do I index this? 1) index a single "anchor" field containing "USA United States United States of America Unite Stats", 2) create the field "USA USA ...500x... USA United States ...200x... United States ... " and index that as "anchor" 3) create 801 "anchor" fields (500 containg USA etc) 4) create 4 "anchor" fields and call setBoost() on each with some constants. (how do I calculate them?) I suspect these give me different results in some way, but I'm having trouble understanding what the difference between 2) and 3) is and how to make 4) work like 3). I also worry that 2) and 3) are much slower than they need to be. Any help is appreciated, Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
breaking a single index in to two indexes
I am in need of some help with the following problem. I have a single index that I am currently searching against, but it has the property that a small set of the documents get updated frequently while a large majority of them are very static and are rarely updated. Documents can move from being a static type document to one that begins to be updated on a frequent basis. I'd like to break this up into two separate indexes with one large one being the index of the static documents and a smaller one of the constantly updated documents. I am looking for some help on optimal update policies for each index and managing of migration of a document from the static index to the active index. Hopefully this should allow me to make much better use of cached filters. I am sure that someone else has run into a very similar problem. I did some poking around, but was obviously not searching for the right thing. Any suggestions or insight into dealing with this would be greatly appreciated. Thanks, Les - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing anchor text
Well, to quote the great wise one, "that depends". The reason I'm being flippant here is because what it depends on is what you want the result to be. I'm asking for a use-case scenario here. Something like "I want the docs to score equally no matter how many links with 'United States' exist in them". Or "A document with 100 links mentioning 'United States' should score way higher than a document with only one link mentioning 'United States'". Best Erick On 6/27/07, Tim Sturge <[EMAIL PROTECTED]> wrote: Hi, I'm trying to index some fairly standard html documents. For each of the documents, there is a unique (which I believe is generally of high quality), some content, and some anchor text from the linking documents (which is of good but more variable quality). I'm indexing them in "title" "anchor" and "body" "title" and "body" are obvious (you just give the text to the StandardAnalyzer) but I don't really know how to handle the anchor text. Suppose the page with the title "United States" I know has the anchor text "USA" 500 times, "United States" 200 times, "United States of America" 100 times and "Unite Stats" once. How do I index this? 1) index a single "anchor" field containing "USA United States United States of America Unite Stats", 2) create the field "USA USA ...500x... USA United States ...200x... United States ... " and index that as "anchor" 3) create 801 "anchor" fields (500 containg USA etc) 4) create 4 "anchor" fields and call setBoost() on each with some constants. (how do I calculate them?) I suspect these give me different results in some way, but I'm having trouble understanding what the difference between 2) and 3) is and how to make 4) work like 3). I also worry that 2) and 3) are much slower than they need to be. Any help is appreciated, Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
You cannot do it because TermPositions is read in the PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and loaded into an array which is passed to PhraseScorer. Extend the Weight as well and pass the payload to the Scorer as well is a possibility. - Mark Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
Could you get what you need combining the BoostingTermQuery with a SpanNearQuery to produce a score? Just guessing here.. At some point, I would like to see more Query classes around the payload stuff, so please submit patches/feedback if and when you get a solution On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing anchor text
Case B -- I believe the more inbound anchor text, the better the match. Right now I'm also boosting the documents by calling setBoost( log( numInboundLinks+1 ) + 1 ) which seems to be quite effective; is there some sort of guidebook for this? I'm also interested in figuring out how to rank the boost for title vs body vs anchor; this seems to be 90% black magic to me. Thanks, Tim Erick Erickson wrote: Well, to quote the great wise one, "that depends". The reason I'm being flippant here is because what it depends on is what you want the result to be. I'm asking for a use-case scenario here. Something like "I want the docs to score equally no matter how many links with 'United States' exist in them". Or "A document with 100 links mentioning 'United States' should score way higher than a document with only one link mentioning 'United States'". Best Erick On 6/27/07, Tim Sturge <[EMAIL PROTECTED]> wrote: Hi, I'm trying to index some fairly standard html documents. For each of the documents, there is a unique (which I believe is generally of high quality), some content, and some anchor text from the linking documents (which is of good but more variable quality). I'm indexing them in "title" "anchor" and "body" "title" and "body" are obvious (you just give the text to the StandardAnalyzer) but I don't really know how to handle the anchor text. Suppose the page with the title "United States" I know has the anchor text "USA" 500 times, "United States" 200 times, "United States of America" 100 times and "Unite Stats" once. How do I index this? 1) index a single "anchor" field containing "USA United States United States of America Unite Stats", 2) create the field "USA USA ...500x... USA United States ...200x... United States ... " and index that as "anchor" 3) create 801 "anchor" fields (500 containg USA etc) 4) create 4 "anchor" fields and call setBoost() on each with some constants. (how do I calculate them?) I suspect these give me different results in some way, but I'm having trouble understanding what the difference between 2) and 3) is and how to make 4) work like 3). I also worry that 2) and 3) are much slower than they need to be. Any help is appreciated, Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]