delete entries from posting list Lucene 4.0

2012-03-19 Thread Zeynep P.
I need to delete entries from posting list. How to do it in Lucene 4.0? I
need to do this to test different pruning algorithms.

Thanks in advance

ZP


--
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3838649.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: delete entries from posting list Lucene 4.0

2012-03-19 Thread Zeynep P.
That is perfect
Thank you very much 

Best regards
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3839095.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: delete entries from posting list Lucene 4.0

2012-03-27 Thread Zeynep P.
While using the pruning package, I realised that ridf is calculated in
RIDFTermPruningPolicy as follows:
Math.log(1 - Math.pow(Math.E, termPositions.freq() / maxDoc)) - df 

However, according to the original paper (Blanco et al.) for residual idf,
it should be -log(df/D) + log (1 - e^(*-*tf/D)). Thus, in the equation,
Math.pow should be Math.pow(Math.E, - (termPositions.freq() / maxDoc))

Do I miss something in the calculation or is this a bug? 

Thanks in advance
ZP


--
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3862334.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Wikipedia revision history dump + lucene benchmark

2012-04-10 Thread Zeynep P.
wikipedia.alg in benchmark is only able to extract and index current pages
dumps. It does not take revisions into account. Do you know any way to do
this? Or should I change EnwikiContentSource to handle the versions?

Although, Wikipedia dumps are widely used especially for research purposes,
as far as I know, there is no topics/qrels for them (except the one 
http://www.mpi-inf.mpg.de/~kberberi/ecir2010/ here  for revision history
dump 2001 - 2005 which is annotated based on temporal expressions). The
question is that do you know any other?

By the way, I think in wikipedia.alg
query.maker=org.apache.lucene.benchmark.byTask.feeds.*ReutersQueryMaker*
should be remplaced by *EnwikiQueryMaker*.

Thanks in advance,
Best regards
-- 
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wikipedia-revision-history-dump-lucene-benchmark-tp3900346p3900346.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: delete entries from posting list Lucene 4.0

2012-04-23 Thread Zeynep P.
Hi,

Thanks for the fix. 

I also wonder if you know any collection (free ones) to test pruning
approaches. Almost all the papers use TREC collections which I don't have!!
For now, I use Reuters21578 collection and Carmel's Kendall's tau extension
to measure similarity. But I need a collection with relevance judgements. 

Thanks in advance,
Best Regards
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-entries-from-posting-list-Lucene-4-0-tp3838649p3933206.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



pruning package- pruneAllPositions

2012-05-02 Thread Zeynep P.
Hi,

In the pruning package, pruneAllPositions throws an exception. In the code
it is commented that it should not happen.
 
// should not happen!
 throw new IOException("termPositions.doc > docs[docsPos].doc");

Can you please explain me why it happens and what should I do to fix it? 

Thanks in advance,
Best regards
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: pruning package- pruneAllPositions

2012-05-07 Thread Zeynep P.
Thanks for the link. I reviewed it. 
Here are more details about the exception:

I used contrib/benchmark/conf/wikipedia.alg to index wikipedia dump with
MAddDocs: 20. I wanted to index only a specific period of time so I
added an if statement in  doLogic of AddDocTask class.
I tried to prune the index by using pruning package (CarmelTopKPruning) and
I had the exception.

I added System.out.println(term);  as the first line of the
initPositionsTerm and System.out.println("***" + term); as the last line of
it. Carmel top k exception comes from pruneAllPositions (throw new
IOException("termPositions.doc > docs[docsPos].doc"); ). 

For example, for token body:freely I had the output as follows:

body:freely
***body:freely
body:freely
***body:freely
body:freely
***body:freely
Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() =
4995)
Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() =
4996)
Carmel topk in exception (docs[docsPos].doc = 4414, termPositions.doc() =
4997) ..
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
Carmel topk in exception
body:freely
***body:freely
Carmel topk in exception
Carmel topk in exception
body:freely
***body:freely
body:freely
***body:freely

I hope that my problem is more clear now.

Thanks in advance,
Best Regards 
ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762p3968723.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Measuring precision and recall in lucene to compare two sets of results

2012-05-07 Thread Zeynep P.

Hi,

You can use kendall's tau. An article titled Comparing top k lists by Ronald
Fagin, Ravi Kumar and  D. Sivakumar explaines different methods. 

Best Regards,
ZP


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Measuring-precision-and-recall-in-lucene-to-compare-two-sets-of-results-tp3967621p3968785.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: pruning package- pruneAllPositions

2012-06-04 Thread Zeynep P.

Hi,

Thanks for your fix. I used it but I think there is something wrong with the
fix!!? because
I am using LATimes collection and with epsilon = 0.1 and k =10 I got 97%
pruned index. It means 3% of index left unchanged after pruning. In the the
original paper, "Static index pruning for IR systems", for the same data set
with the same parameters, they have 36.4%. Has enyone used this package with
LATimes dataset? 

Thanks in advance,
Best regards

ZP

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-package-pruneAllPositions-tp3954762p3987531.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



threshold calculation in CarmelTopKTermPruningPolicy

2012-06-12 Thread Zeynep P.
Hi,

In CarmelTopKTermPruningPolicy class, the threshold is calculated as
follows: 

*float threshold =  docs[k - 1].score  - scoreDelta;*

docs[k - 1].score corresponds to z_t in the original paper (Carmel et al
2001) and  scoreDelta = epsilon * r 

Could you please explain me why it is calculated as "z_t - scoreDelta"?? I
am not able to find corresponding part in the paper. 

Thanks in advance
Best Regards,
ZP



--
View this message in context: 
http://lucene.472066.n3.nabble.com/threshold-calculation-in-CarmelTopKTermPruningPolicy-tp3989219.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



pruning package- question about termpositions && skipTo

2012-08-14 Thread Zeynep P.
Hi to all,

In pruning package, for pruneAllPositions(TermPositions termPositions, Term
t) methos it is said that :

"termPositions - positioned term positions. Implementations MUST NOT advance
this by calling TermPositions methods that advance either the position
pointer (next, skipTo) or term pointer (seek)."

Why??

Why do I need to do skipTo :

I added a new pruning class with   public void
initPositionsTerm(TermPositions tp, Term t, ScoreDoc[] sdoc)  method. I
needed it because my ScoreDoc[] is generated with different external
parameters based on lucene basic results. And then in initPositionsTerm
method, instead of letting method to get docs like in other classes, it is
just equal to sdocs. For example, for a term x, sdocs = {42813, 123472,
22477, 76995,  47086, 106424, 68570, 26708, 49740, 116472}, sorted docs =
{22477, 26708, 42813, 47086, ...}. I just want to keep these postings in my
pruned index.

The problem is that when I call pruneAllPositions as it is, it returns me
only {22477, 26708, *107377*} After 28118 super.next() is false in
PruningTermPositions.next(). So it returns never  true for
(termPositions.doc() == docs[docsPos].doc) with docIds > 28118.( I have no
idea where it comes 107377, it is not even in my docs). However, in
pruneAllPositions when I check termpositions with the code above I have all
docids that I need in it. That is why I wonder why I can not do skipTo and
why that happens with termspositions ?? 

while(termPositions.next())
{
   System.out.println(termPositions.doc() );
} 

Thanks in advance,
Best Regards



--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-package-question-about-termpositions-skipTo-tp4001160.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: pruning package- question about termpositions && skipTo

2012-08-22 Thread Zeynep P.
Hi to all,

I found the problem and the solution. In PruningReader
super.getSequentialSubReaders(); is used. After 28118 super.next()  is false
because it is a subreader for a segment and indexreader.maxDoc() is equal to
28118 for that segment. In pruneAllPositions, instead of comparing
termpostions.doc to docid, I compared
in.document(termPositions.doc()).getField("docid").stringValue() to docid. 

It happened because of my custom  initPositionsTerm method. (public void
initPositionsTerm(TermPositions tp, Term t, *ScoreDoc[] sdoc*) ). There is
no problem with other pruning policies.

DocID  ** termPositions.doc()
22477  22477
26708  26708
42813  14093
47086  18366
49740  21020
68570  11760
76995  20185
106424  21524
116472  502
123472  1992

Best Regards





--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-package-question-about-termpositions-skipTo-tp4001160p4002656.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



test LA Times with pruning package

2012-09-14 Thread Zeynep P.
Hi to all,

I used pruning package with LA Times collection. The initial LA Times index
is created by lucene benchmark/conf/*.alg. Luke shows 131896 documents with
635614 terms for initial index. I pruned with CarmelTopKPruning policy with
epsilon = 0.1 by varying k.  However, my results do not correspond to the
original paper's results (Static Index Pruning for Information Retrieval
Systems by Carmel et al.). Lucene score function can be the reason but the
difference is big so I wonder if the package is tested with LA Times and the
similar results are obtained???

What can be the reason of such difference? I count the number of postings by
for each term counter += te.docFreq();

Do you know any paper who uses this package for experiments?


k, Prune(%) Original Paper, Prune (%) Pruning Package,   # postings in
pruned index ,  # posting no pruned

1   49,291  3663309 37860694
5   40,290  4139019 
10  36,489  4485072 
15  34,288  4743474 
50  x   69  11990022


Thanks in advance,
Best Regards
ZP




--
View this message in context: 
http://lucene.472066.n3.nabble.com/test-LA-Times-with-pruning-package-tp4007730.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



pruning & Lucene 4.0

2012-10-12 Thread Zeynep P.
Hi,

Do you have any information about when the pruning package will be available
for Lucene 4.0 ? 

Best Regards
Thanks in advance
ZP





--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-Lucene-4-0-tp4013363.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 4.0 benchmark bug?

2012-10-17 Thread Zeynep P.
Hi to all,

I started to use benchmark 4.0 to create submission report files with the
following code:
  
BufferedReader br = new BufferedReader(fr);
QualityQuery qqs[] = qReader.readQueries(br);  
QualityQueryParser qqParser = new SimpleQQParser("title", "body");  
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser,
searcher, "docname") ;
SubmissionReport submitLog = new SubmissionReport(loggertest,
"test"); 
QualityStats stats[] = qrun.execute(null, submitLog, null);

My index is created by lucene 3.6. I use LA Times topics 401-450. With 3.6,
no problem. However, when I use benchmark 4.0 I realised that it returns the
results only for the first query 401 which is "foreign minorities, Germany". 
When I debug the code, at SimpleQQParser, the boolean query generated is
"body:foreign" without other keywords. I go on debugging and it seems that
the problem is raised at  QueryParserBase.newFieldQuery  which returns null
for  the rest of all queries and other keywords in the same query.  I
updated the code for my adhoc use. Unless, I don't know  how to fix it or 
it also happens to someone else?!


Second problem, for the same collection MAP = 0.17 with default similarity,
MAP= 0.07 with lucene 4.0 BM25 similarity (b=0.75, k1=1.2). I got MAP = 0.14
with BM25 implemented based on http://ipl.cs.aueb.gr/stougianni/bm25_2.html.
However this collection is represented in the litterature with MAP around
0.25 with BM25 scoring function. Did someone evaluate the different
similarities and can share the results? 

Best Regards,
ZP





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-4-0-benchmark-bug-tp4014238.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: pruning & Lucene 4.0

2013-02-20 Thread Zeynep P.
Hi,

any news since? 

Thanks,
Best regards,
ZP



--
View this message in context: 
http://lucene.472066.n3.nabble.com/pruning-Lucene-4-0-tp4013363p4041499.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring function in LMDirichletSimilarity Class

2013-04-02 Thread Zeynep P.
Hi,

I have the same question related to LMJelinekMercerSimiliarity class.

  protected float score(BasicStats stats, float freq, float docLen) {
return stats.getTotalBoost() *
(float)Math.log(1 +  ((1 - lambda) * freq / docLen) / (lambda *
((LMStats)stats).getCollectionProbability()));
  }

 score = Math.log( (1 - lambda) *  freq / docLen * + *lambda *
((LMStats)stats).getCollectionProbability()) )

I am also getting much worse results by updating the code like above. 

Why is it calculated this way? 

Thanks in advance,

Best regards,
ZP

P.S: Instead of creating a new question, I used your question because I
believe that the reason should be the same.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Scoring-function-in-LMDirichletSimilarity-Class-tp4052488p4053267.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



this IndexReader is closed only with jar

2011-10-17 Thread Zeynep P.
Hi,

I am having a weird experience. I made a few changes with the source code
(Lucene 3.3). I created a basic application to test it. First, I added
Lucene 3.3 project to basic project as "required projects on the build path"
to be able to debug. When everything was ok, I removed it from required
projects, build it and I added the "jar" to basic application. When I run my
basic application with jar, I have "this IndexReader is closed" error. When
I remove jar and add the  Lucene 3.3 project again as required  project,
everything is ok. I have no explication Can someone explain me why it
happens? 

Thanks in advance
ZP

org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:260)
at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:502)
at 
org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56)
at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:77)
at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:82)
at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:66)
at 
org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:53)
at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:198)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/this-IndexReader-is-closed-only-with-jar-tp3428823p3428823.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org