delete entries from posting list Lucene 4.0
I need to delete entries from posting list. How to do it in Lucene 4.0? I need to do this to test different pruning algorithms. Thanks in advance ZP -- View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3838649.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: delete entries from posting list Lucene 4.0
That is perfect Thank you very much Best regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3839095.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: delete entries from posting list Lucene 4.0
While using the pruning package, I realised that ridf is calculated in RIDFTermPruningPolicy as follows: Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df However, according to the original paper (Blanco et al.) for residual idf, it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation, Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc)) Do I miss something in the calculation or is this a bug? Thanks in advance ZP -- View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3862334.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Wikipedia revision history dump + lucene benchmark
wikipedia.alg in benchmark is only able to extract and index current pages dumps. It does not take revisions into account. Do you know any way to do this? Or should I change EnwikiContentSource to handle the versions? Although, Wikipedia dumps are widely used especially for research purposes, as far as I know, there is no topics/qrels for them (except the one http://www.mpi-inf.mpg.de/~kberberi/ecir2010/ here for revision history dump 2001 - 2005 which is annotated based on temporal expressions). The question is that do you know any other? By the way, I think in wikipedia.alg query.maker=org.apache.lucene.benchmark.byTask.feeds.*ReutersQueryMaker* should be remplaced by *EnwikiQueryMaker*. Thanks in advance, Best regards -- ZP -- View this message in context: http://lucene.472066.n3.nabble.com/Wikipedia-revision-history-dump-lucene-benchmark-tp3900346p3900346.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: delete entries from posting list Lucene 4.0
Hi, Thanks for the fix. I also wonder if you know any collection (free ones) to test pruning approaches. Almost all the papers use TREC collections which I don't have!! For now, I use Reuters21578 collection and Carmel's Kendall's tau extension to measure similarity. But I need a collection with relevance judgements. Thanks in advance, Best Regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3933206.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
pruning package- pruneAllPositions
Hi, In the pruning package, pruneAllPositions throws an exception. In the code it is commented that it should not happen. // should not happen! throw new IOException("termPositions.doc > docs[docsPos].doc"); Can you please explain me why it happens and what should I do to fix it? Thanks in advance, Best regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: pruning package- pruneAllPositions
Thanks for the link. I reviewed it. Here are more details about the exception: I used contrib/benchmark/conf/wikipedia.alg to index wikipedia dump with MAddDocs: 20. I wanted to index only a specific period of time so I added an if statement in doLogic of AddDocTask class. I tried to prune the index by using pruning package (CarmelTopKPruning) and I had the exception. I added System.out.println(term); as the first line of the initPositionsTerm and System.out.println("***" + term); as the last line of it. Carmel top k exception comes from pruneAllPositions (throw new IOException("termPositions.doc > docs[docsPos].doc"); ). For example, for token body:freely I had the output as follows: body:freely ***body:freely body:freely ***body:freely body:freely ***body:freely Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() = 4995) Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() = 4996) Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() = 4997) .. Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception Carmel topk in exception body:freely ***body:freely Carmel topk in exception Carmel topk in exception body:freely ***body:freely body:freely ***body:freely I hope that my problem is more clear now. Thanks in advance, Best Regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762p3968723.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Measuring precision and recall in lucene to compare two sets of results
Hi, You can use kendall's tau. An article titled Comparing top k lists by Ronald Fagin, Ravi Kumar and D. Sivakumar explaines different methods. Best Regards, ZP -- View this message in context: http://lucene.472066.n3.nabble.com/Measuring-precision-and-recall-in-lucene-to-compare-two-sets-of-results-tp3967621p3968785.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: pruning package- pruneAllPositions
Hi, Thanks for your fix. I used it but I think there is something wrong with the fix!!? because I am using LATimes collection and with epsilon = 0.1 and k =10 I got 97% pruned index. It means 3% of index left unchanged after pruning. In the the original paper, "Static index pruning for IR systems", for the same data set with the same parameters, they have 36.4%. Has enyone used this package with LATimes dataset? Thanks in advance, Best regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762p3987531.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
threshold calculation in CarmelTopKTermPruningPolicy
Hi, In CarmelTopKTermPruningPolicy class, the threshold is calculated as follows: *float threshold = docs[k - 1].score - scoreDelta;* docs[k - 1].score corresponds to z_t in the original paper (Carmel et al 2001) and scoreDelta = epsilon * r Could you please explain me why it is calculated as "z_t - scoreDelta"?? I am not able to find corresponding part in the paper. Thanks in advance Best Regards, ZP -- View this message in context: http://lucene.472066.n3.nabble.com/threshold-calculation-in-CarmelTopKTermPruningPolicy-tp3989219.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
pruning package- question about termpositions && skipTo
Hi to all, In pruning package, for pruneAllPositions(TermPositions termPositions, Term t) methos it is said that : "termPositions - positioned term positions. Implementations MUST NOT advance this by calling TermPositions methods that advance either the position pointer (next, skipTo) or term pointer (seek)." Why?? Why do I need to do skipTo : I added a new pruning class with public void initPositionsTerm(TermPositions tp, Term t, ScoreDoc[] sdoc) method. I needed it because my ScoreDoc[] is generated with different external parameters based on lucene basic results. And then in initPositionsTerm method, instead of letting method to get docs like in other classes, it is just equal to sdocs. For example, for a term x, sdocs = {42813, 123472, 22477, 76995, 47086, 106424, 68570, 26708, 49740, 116472}, sorted docs = {22477, 26708, 42813, 47086, ...}. I just want to keep these postings in my pruned index. The problem is that when I call pruneAllPositions as it is, it returns me only {22477, 26708, *107377*} After 28118 super.next() is false in PruningTermPositions.next(). So it returns never true for (termPositions.doc() == docs[docsPos].doc) with docIds > 28118.( I have no idea where it comes 107377, it is not even in my docs). However, in pruneAllPositions when I check termpositions with the code above I have all docids that I need in it. That is why I wonder why I can not do skipTo and why that happens with termspositions ?? while(termPositions.next()) { System.out.println(termPositions.doc() ); } Thanks in advance, Best Regards -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-package-question-about-termpositions-skipTo-tp4001160.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: pruning package- question about termpositions && skipTo
Hi to all, I found the problem and the solution. In PruningReader super.getSequentialSubReaders(); is used. After 28118 super.next() is false because it is a subreader for a segment and indexreader.maxDoc() is equal to 28118 for that segment. In pruneAllPositions, instead of comparing termpostions.doc to docid, I compared in.document(termPositions.doc()).getField("docid").stringValue() to docid. It happened because of my custom initPositionsTerm method. (public void initPositionsTerm(TermPositions tp, Term t, *ScoreDoc[] sdoc*) ). There is no problem with other pruning policies. DocID ** termPositions.doc() 22477 22477 26708 26708 42813 14093 47086 18366 49740 21020 68570 11760 76995 20185 106424 21524 116472 502 123472 1992 Best Regards -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-package-question-about-termpositions-skipTo-tp4001160p4002656.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
test LA Times with pruning package
Hi to all, I used pruning package with LA Times collection. The initial LA Times index is created by lucene benchmark/conf/*.alg. Luke shows 131896 documents with 635614 terms for initial index. I pruned with CarmelTopKPruning policy with epsilon = 0.1 by varying k. However, my results do not correspond to the original paper's results (Static Index Pruning for Information Retrieval Systems by Carmel et al.). Lucene score function can be the reason but the difference is big so I wonder if the package is tested with LA Times and the similar results are obtained??? What can be the reason of such difference? I count the number of postings by for each term counter += te.docFreq(); Do you know any paper who uses this package for experiments? k, Prune(%) Original Paper, Prune (%) Pruning Package, # postings in pruned index , # posting no pruned 1 49,291 3663309 37860694 5 40,290 4139019 10 36,489 4485072 15 34,288 4743474 50 x 69 11990022 Thanks in advance, Best Regards ZP -- View this message in context: http://lucene.472066.n3.nabble.com/test-LA-Times-with-pruning-package-tp4007730.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
pruning & Lucene 4.0
Hi, Do you have any information about when the pruning package will be available for Lucene 4.0 ? Best Regards Thanks in advance ZP -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-Lucene-4-0-tp4013363.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.0 benchmark bug?
Hi to all, I started to use benchmark 4.0 to create submission report files with the following code: BufferedReader br = new BufferedReader(fr); QualityQuery qqs[] = qReader.readQueries(br); QualityQueryParser qqParser = new SimpleQQParser("title", "body"); QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, "docname") ; SubmissionReport submitLog = new SubmissionReport(loggertest, "test"); QualityStats stats[] = qrun.execute(null, submitLog, null); My index is created by lucene 3.6. I use LA Times topics 401-450. With 3.6, no problem. However, when I use benchmark 4.0 I realised that it returns the results only for the first query 401 which is "foreign minorities, Germany". When I debug the code, at SimpleQQParser, the boolean query generated is "body:foreign" without other keywords. I go on debugging and it seems that the problem is raised at QueryParserBase.newFieldQuery which returns null for the rest of all queries and other keywords in the same query. I updated the code for my adhoc use. Unless, I don't know how to fix it or it also happens to someone else?! Second problem, for the same collection MAP = 0.17 with default similarity, MAP= 0.07 with lucene 4.0 BM25 similarity (b=0.75, k1=1.2). I got MAP = 0.14 with BM25 implemented based on http://ipl.cs.aueb.gr/stougianni/bm25_2.html. However this collection is represented in the litterature with MAP around 0.25 with BM25 scoring function. Did someone evaluate the different similarities and can share the results? Best Regards, ZP -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-4-0-benchmark-bug-tp4014238.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: pruning & Lucene 4.0
Hi, any news since? Thanks, Best regards, ZP -- View this message in context: http://lucene.472066.n3.nabble.com/pruning-Lucene-4-0-tp4013363p4041499.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring function in LMDirichletSimilarity Class
Hi, I have the same question related to LMJelinekMercerSimiliarity class. protected float score(BasicStats stats, float freq, float docLen) { return stats.getTotalBoost() * (float)Math.log(1 + ((1 - lambda) * freq / docLen) / (lambda * ((LMStats)stats).getCollectionProbability())); } score = Math.log( (1 - lambda) * freq / docLen * + *lambda * ((LMStats)stats).getCollectionProbability()) ) I am also getting much worse results by updating the code like above. Why is it calculated this way? Thanks in advance, Best regards, ZP P.S: Instead of creating a new question, I used your question because I believe that the reason should be the same. -- View this message in context: http://lucene.472066.n3.nabble.com/Scoring-function-in-LMDirichletSimilarity-Class-tp4052488p4053267.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
this IndexReader is closed only with jar
Hi, I am having a weird experience. I made a few changes with the source code (Lucene 3.3). I created a basic application to test it. First, I added Lucene 3.3 project to basic project as "required projects on the build path" to be able to debug. When everything was ok, I removed it from required projects, build it and I added the "jar" to basic application. When I run my basic application with jar, I have "this IndexReader is closed" error. When I remove jar and add the Lucene 3.3 project again as required project, everything is ok. I have no explication Can someone explain me why it happens? Thanks in advance ZP org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:260) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:502) at org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56) at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:77) at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:82) at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:66) at org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:53) at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:198) -- View this message in context: http://lucene.472066.n3.nabble.com/this-IndexReader-is-closed-only-with-jar-tp3428823p3428823.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org