UWTV Program: Google: A Behind-the-Scenes Look
Just came across this interesting webcast. Check it out. -- Chakra "Google: A Behind-the-Scenes Look Search is one of the most important applications used on the internet and poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines. In this program, Jeff Dean of Google describes some of these challenges, discusses applications Google has developed, and highlights systems they've built, including GFS, a large-scale distributed file system, and MapReduce, a library for automatic parallelization and distribution of large-scale computation. He also shares some interesting observations derived from Google's web data." http://www.uwtv.org/programs/displayevent.asp?rid=2459 -- Visit my weblog: http://www.jroller.com/page/cyblogue - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4.x TermInfosWriter.indexInterval not public static ?
Doug Cutting wrote: The default value is probably good for all but folks with very large indexes, who may wish to increase the default somewhat. Also folks with smaller indexes and very high query volumes may wish to decrease the default. It's a classic time/memory tradeoff. Higher values use less memory and make searches a bit slower, smaller values use more memory and make searches a bit faster. BTW.. can you define "a bit"... Is "a bit" 5%? 10%? Benchmarks would be ncie but I'm not that picky. I just want to see what performance hits/benefits I could see by tweaking the values. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF Highlighter Package
For those of you that support indexing PDF documents, PDFBox now supports Adobe's PDF Highlight specification (http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf) PDFBox is now capable of generating an XML document that describes words in a PDF document to highlight. An "in action" example can be seen at http://pavilion.csh.rit.edu:8080/pdfbox/index.html You can enter any web accessible PDF and any keywords. The PDF will open normally and after a short pause(this is running on an old slow server) will jump to the first selected keyword. Source code is available in CVS or in tonight's nightly build. Any comments/suggestions are welcome. Special thanks to Stephan Lagraulet, who made this possible with code contributions. Ben http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 1.4.x TermInfosWriter.indexInterval not public static ?
Chris Hostetter wrote: 1) If making it mutatable requires changes to other classes to propogate it, then why is it now an instance variable instead of a static? (Presumably making it an instance variable allows subclasses to override the value, but if other classes have internal expectations of the value, that doesn't seem safe) Its an instance variable because it can vary from instance-to-instance. This value is specified when an index segment is written, and subsequently read from disk and used when reading that segment. It's an instance variable in both the writing and reading code. The thing that's lacking is a way to pass in alternate values to the writing code. The reason that other classes are involved is that the reading and writing code are in non-public classes. We don't want to expose the implementation too much by making these public, but would rather expose these as getter/setter methods on the relevant public API. 2) Should it be configurable through a get/set method, or through a system property? (which rehashes the instance/global question) That's indeed the question. My guess is that a system property would be probably be sufficient for most, but perhaps not for all. Similarly with a static setter/getter. But a getter/setter on IndexWriter would make everyone happy. 3) Is it important that a writer updating an existing index use the same value as the writer that initial created the index? if so should there really be a "preferedIndexInterval" variable which is mutatable, and a "currentIndexInterval" which is set to the value of the index currently being updated. Such that preferedIndexInterval is used when making an index from scratch and currentIndexInterval is used when adding segments to a new index? It's used whenever an index segment is created. Index segments are created when documents are added and when index segments are merged to form larger index segments. Merging happens frequently while indexing. Optimization merges all segments. The value can vary in each segment. The default value is probably good for all but folks with very large indexes, who may wish to increase the default somewhat. Also folks with smaller indexes and very high query volumes may wish to decrease the default. It's a classic time/memory tradeoff. Higher values use less memory and make searches a bit slower, smaller values use more memory and make searches a bit faster. Unless there are objections I will add this as: IndexWriter.setTermIndexInterval() IndexWriter.getTermIndexInterval() Both will be marked "Expert". Further discussion should move to the lucene-dev list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Claude Libois wrote: The explanation given by the IndexSearcher indicate me that the boost of my title is 1.0 where it should be 10.0. I really don't understand what it's wrong. You're seeing the boost for the query term, not the boost for the document's field. The boost for the field in the document is multiplied by its lengthNorm. This product is displayed in explanations as the "fieldNorm". Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexSearch and IndexWriter on 2 CPU's
Hello. I have Dual CPU's box with RH Linux. I run two processes on this box. 1. IndexWriter which adds new documents into index constantly 24/7/365 :) 2. IndexSearcher, which perform searchers from this index. Sometimes "writer" begins to merge index (this caused by mergeFactor and structure of Lucene index) "inside" addDocument method. And if merge begins then my "writer" process takes both CPU's time (180-200% totally). Actually most time time goes to IO operations. When merge operation begins then all searches performed by IndexSearcher on this computer are very-very slowed down b/c all CPU time is under first process. How can I "give" second process more CPU time or how can I reduce IO time of first process? Maybe I can tweak something about index configuration. I have set writer.mergeFactor = 2 writer.minMergeDocs = 2500 Yura Smolsky. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ANN: LuceGene bioinformatics application updated
LuceGene release 1.4 is available now at http://www.gmod.org/lucegene/ and http://eugenes.org/gmod/lucegene/ LuceGene is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. It is similar in concept to the commercial SRS package (Sequence Retrieval System). LuceGene is written in Java, built with the open-source Lucene package [http://jakarta.apache.org/lucene/] This release includes an easy to use demonstration. Pop it into a Tomcat web server and run. LuceGene adds these bioinformatics methods to Lucene: * Indexing adaptors for formats such as XML, PDF Documents, Biosequences, Spreadsheets, HTML, and others, with fine tuning by data field. * Configurations for bio-data include UniProt/Swiss-Prot, Fasta and GenBank sequences, BIND protein interactions, BLAST outputs, Medline and others. * Support for batch-list look-ups and searches by ID, gene names, etc. * Web interface with paged results, batch downloads, search refinement and search-linking among data libraries. * Web Services support with a SOAP interface. * Output support for data-field selection and formats such as Spreadsheet, XML, HTML, and others. It can take as little as a few hours engineering time to add new databank parsing, making it a cost-effective way to use many bioinformatics data sets. LuceGene is speedy with big data sets: indexing and searching the UniProt library of 1.7 million sequences with LuceGene is comparable to using SRS. Gene Annotation object search and retrieval with LuceGene is 10x to 20x faster than using a Postgres Chado database. -- Don Gilbert Genome Informatics Lab Indiana University, Bloomington IN http://iubio.bio.indiana.edu/gil/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
On Feb 28, 2005, at 10:39 AM, Stanislav Jordanov wrote: > What did you do in your private investigation? 1. empirical tests with an index of nearly 75,000 docs (I am attaching the test source) Only certain (.txt?) attachments are allowed to come through on the mailing list. > Sorted by descending relevance (the default), or in some other way? In some other way - sorted by some column (asc or desc - doesn't matter) Using IndexSearcher(query, sort)? > If a search is fast enough, as you report, then you can simply start > your access to Hits at the appropriate spot. For the current systems > I'm working on, this is the approach I've used - start iterating hits > at (pageNumber - 1) * numberOfItemsPerPage. > > Is that approach insufficient? I'm afraid this is not sufficient; Either I am doing something wrong, or it is not that simple: following is a log from my test session; It appears that IndexSearcher.search(...) finishes rather fast compared to the time it takes to fetch the last document from the Hits object. I assume you are only accessing the documents you wish to display rather than all of them up to where you need. Also keep in mind that accessing a Document is when the document is pulled from the index. If you have a large amount of data in a document it will take a corresponding amount of time to load it. You may need to restructure what you store in a document to reduce the load times. Or perhaps you need to investigate the (is it in the codebase already?) patch to load fields lazily upon demand instead. Erik The log starts here: pa Found 74222 document(s) that matched query 'pa' Sorting by "sfile_name" query executed in 16ms Last doc accessed in 375ms us Found 74222 document(s) that matched query 'us' Sorting by "sfile_name" query executed in 31ms Last doc accessed in 219ms 1 Found 74222 document(s) that matched query '1' Sorting by "sfile_name" query executed in 15ms Last doc accessed in 235ms 5 Found 74222 document(s) that matched query '5' Sorting by "sfile_name" query executed in 422ms Last doc accessed in 219ms 6 Found 72759 document(s) that matched query '6' Sorting by "sfile_name" query executed in 344ms Last doc accessed in 250ms - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
> What did you do in your private investigation?1. empirical tests with an index of nearly 75,000 docs (I am attaching the test source) 2. reviewing and tracing the source code of Lucene (I do not claim I have gained a deep understanding of it ;-) > Sorted by descending relevance (the default), or in some other way?In some other way - sorted by some column (asc or desc - doesn't matter) > If a search is fast enough, as you report, then you can simply start > your access to Hits at the appropriate spot. For the current systems > I'm working on, this is the approach I've used - start iterating hits > at (pageNumber - 1) * numberOfItemsPerPage.> > Is that approach insufficient? I'm afraid this is not sufficient; Either I am doing something wrong, or it is not that simple: following is a log from my test session; It appears that IndexSearcher.search(...) finishes rather fast compared to the time it takes to fetch the last document from the Hits object. The log starts here: pa Found 74222 document(s) that matched query 'pa' Sorting by "sfile_name" query executed in 16ms Last doc accessed in 375ms us Found 74222 document(s) that matched query 'us' Sorting by "sfile_name" query executed in 31ms Last doc accessed in 219ms 1 Found 74222 document(s) that matched query '1' Sorting by "sfile_name" query executed in 15ms Last doc accessed in 235ms 5 Found 74222 document(s) that matched query '5' Sorting by "sfile_name" query executed in 422ms Last doc accessed in 219ms 6 Found 72759 document(s) that matched query '6' Sorting by "sfile_name" query executed in 344ms Last doc accessed in 250ms - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
On Feb 28, 2005, at 6:00 AM, Stanislav Jordanov wrote: my private investigation already left me sceptic about the outcome of this issue, but I've decided to post it as a final resort. What did you do in your private investigation? Suppose I have an index of about 5,000,000 docs and I am running a single term queries against it, including queries which return say 1,000,000 or even more hits. The hits are sorted by some column and I am happy with the query execution time (i.e. the time spent in the IndexSearcher.query(...) method). Now comes the problem: it is a product requirement that the client is allowed to quickly access (by scrolling) a random page of the result set. Put in different words the app must quickly (in less that a second) respond to requests like: "Give me the results from No 567100 to No 567200" (remember the results are sorted thus ordered). Sorted by descending relevance (the default), or in some other way? If a search is fast enough, as you report, then you can simply start your access to Hits at the appropriate spot. For the current systems I'm working on, this is the approach I've used - start iterating hits at (pageNumber - 1) * numberOfItemsPerPage. Is that approach insufficient? Erik I took a look at Lucene's internals which only left me with the suspision that this is an impossible task. Would anyone, please, prove my suspision wrong? Regards Stanislav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
just retrieve Documents from 567100 to 567200 from Hits class you got while searching. Stanislav Jordanov wrote: Guys, my private investigation already left me sceptic about the outcome of this issue, but I've decided to post it as a final resort. Perhaps the gurus know the right answer :-) Suppose I have an index of about 5,000,000 docs and I am running a single term queries against it, including queries which return say 1,000,000 or even more hits. The hits are sorted by some column and I am happy with the query execution time (i.e. the time spent in the IndexSearcher.query(...) method). Now comes the problem: it is a product requirement that the client is allowed to quickly access (by scrolling) a random page of the result set. Put in different words the app must quickly (in less that a second) respond to requests like: "Give me the results from No 567100 to No 567200" (remember the results are sorted thus ordered). I took a look at Lucene's internals which only left me with the suspision that this is an impossible task. Would anyone, please, prove my suspision wrong? Regards Stanislav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search performance with one index vs. many indexes
Hi All, Sorry about that please disregard that last email. I must not be fully awake yet. Sorry, Kevin Runde -Original Message- From: Runde, Kevin [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 7:34 AM To: Lucene Users List Subject: RE: Search performance with one index vs. many indexes Follow Up to the article from Friday -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 1:30 AM To: Lucene Users List Subject: Re: Search performance with one index vs. many indexes Jochen Franke writes: > Topic: Search performance with large numbers of indexes vs. one large index > > > My questions are: > > - Is the size of the "wordlist" the problem? > - Would we be a lot faster, when we have a smaller number > of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search performance with one index vs. many indexes
Follow Up to the article from Friday -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 1:30 AM To: Lucene Users List Subject: Re: Search performance with one index vs. many indexes Jochen Franke writes: > Topic: Search performance with large numbers of indexes vs. one large index > > > My questions are: > > - Is the size of the "wordlist" the problem? > - Would we be a lot faster, when we have a smaller number > of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Claude Libois writes: > The explanation given by the IndexSearcher indicate me that the boost of my > title is > 1.0 where it should be 10.0. > I really don't understand what it's wrong. AFAIK you cannot get the boost of a field from the index because it's not stored as such. It's calculated in the fields length norm or something like that during indexing. Search the list archives for details. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
The explanation given by the IndexSearcher indicate me that the boost of my title is 1.0 where it should be 10.0. I really don't understand what it's wrong. Claude Libois [EMAIL PROTECTED] Technical associate - Unisys - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 28, 2005 11:10 AM Subject: Re: Boost doesn't works > Use the IndexSearcher.explain() feature to look at how Lucene is > calculating the score. > > Erik > > > On Feb 28, 2005, at 3:32 AM, Claude Libois wrote: > > > I use MultiFieldQueryParser(search only done on summary,title and > > content) > > with a FilteredQuery. > > Claude Libois > > [EMAIL PROTECTED] > > Technical associate - Unisys > > > > - Original Message - > > From: "Morus Walter" <[EMAIL PROTECTED]> > > To: "Lucene Users List" > > Sent: Monday, February 28, 2005 9:28 AM > > Subject: Re: Boost doesn't works > > > > > >> Claude Libois writes: > >>> Hello. I'm using Lucene for an application and I want to boost the > >>> title > > of > >>> my documents. > >>> For that I use the setBoost method that is applied on the title > >>> field. > >>> However when I look with luke(1.6) I don't see any boost on this > >>> field > > and > >>> when > >>> I do a search the score isn't change. What's wrong? > >> > >> How do you search? > >> I guess you cannot see a change unless you combine searches in > >> different > >> fields, since scores are normalized. > >> > >> Morus > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fast access to a random page of the search results.
Guys, my private investigation already left me sceptic about the outcome of this issue, but I've decided to post it as a final resort. Perhaps the gurus know the right answer :-) Suppose I have an index of about 5,000,000 docs and I am running a single term queries against it, including queries which return say 1,000,000 or even more hits. The hits are sorted by some column and I am happy with the query execution time (i.e. the time spent in the IndexSearcher.query(...) method). Now comes the problem: it is a product requirement that the client is allowed to quickly access (by scrolling) a random page of the result set. Put in different words the app must quickly (in less that a second) respond to requests like: "Give me the results from No 567100 to No 567200" (remember the results are sorted thus ordered). I took a look at Lucene's internals which only left me with the suspision that this is an impossible task. Would anyone, please, prove my suspision wrong? Regards Stanislav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Use the IndexSearcher.explain() feature to look at how Lucene is calculating the score. Erik On Feb 28, 2005, at 3:32 AM, Claude Libois wrote: I use MultiFieldQueryParser(search only done on summary,title and content) with a FilteredQuery. Claude Libois [EMAIL PROTECTED] Technical associate - Unisys - Original Message - From: "Morus Walter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 28, 2005 9:28 AM Subject: Re: Boost doesn't works Claude Libois writes: Hello. I'm using Lucene for an application and I want to boost the title of my documents. For that I use the setBoost method that is applied on the title field. However when I look with luke(1.6) I don't see any boost on this field and when I do a search the score isn't change. What's wrong? How do you search? I guess you cannot see a change unless you combine searches in different fields, since scores are normalized. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
I use MultiFieldQueryParser(search only done on summary,title and content) with a FilteredQuery. Claude Libois [EMAIL PROTECTED] Technical associate - Unisys - Original Message - From: "Morus Walter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Monday, February 28, 2005 9:28 AM Subject: Re: Boost doesn't works > Claude Libois writes: > > Hello. I'm using Lucene for an application and I want to boost the title of > > my documents. > > For that I use the setBoost method that is applied on the title field. > > However when I look with luke(1.6) I don't see any boost on this field and > > when > > I do a search the score isn't change. What's wrong? > > How do you search? > I guess you cannot see a change unless you combine searches in different > fields, since scores are normalized. > > Morus > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Claude Libois writes: > Hello. I'm using Lucene for an application and I want to boost the title of > my documents. > For that I use the setBoost method that is applied on the title field. > However when I look with luke(1.6) I don't see any boost on this field and > when > I do a search the score isn't change. What's wrong? How do you search? I guess you cannot see a change unless you combine searches in different fields, since scores are normalized. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boost doesn't works
Hello. I'm using Lucene for an application and I want to boost the title of my documents. For that I use the setBoost method that is applied on the title field. However when I look with luke(1.6) I don't see any boost on this field and when I do a search the score isn't change. What's wrong? Here is the code where I set the boost factor. public Document getDocument() throws TechnicalException { Document doc = new Document(); log.trace(new TraceMessage("will add title,resume,content,date to the Lucene Document")); Field field = Field.UnStored("Content",content); doc.add(field); doc.add(Field.Text("Summary", summary)); field = Field.Text("Title", title); field.setBoost(10); doc.add(field); return doc; } Do I have to do something else to activate boosting? Claude Libois [EMAIL PROTECTED] Technical associate - Unisys - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]