I have an index of about 30 million short strings, the index size is
about 3GB in disk
I have give jvm 5gb memory with default setting in ubuntu 12.04 of sun jdk 7.
When I use 20 theads, it's ok. But If I run 30 threads. After a
while. The jvm is doing nothing but gc.
lucene version si
ConcurrentHashMap$HashEntry and
IdentityWeakReference. it seems lucene want to cache something using
WeakIdentityMap.
On Mon, Sep 22, 2014 at 9:17 PM, Shawn Heisey s...@elyograg.org wrote:
On 9/22/2014 6:42 AM, Li Li wrote:
I have an index of about 30 million short strings, the index size is
about
I have read
http://lucene.472066.n3.nabble.com/Indexing-Boolean-Expressions-td3762960.html.
is it now availabe in lucene?
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
configuration should be used will be
welcome.
Dawid
On Sat, Jun 23, 2012 at 12:38 PM, Li Li fancye...@gmail.com wrote:
http://mg4j.di.unimi.it/
http://vigna.di.unimi.it/papers.php#VigQSI
sounds very interesting and attractive
what new features are added to 4.0 alpha? which is not finished for 4.0
final release?
在 2012-6-26 凌晨5:28,Robert Muir rcm...@gmail.com写道:
artifacts are here:
http://people.apache.org/~rmuir/staging_area/lucene-solr-4.0aRC1-rev1353699/
Here is my +1
--
lucidimagination.com
hi all,
I am looking for a 'BooleanMatcher' in lucene. for many
application, we don't need order matched documents by relevant scores.
we just like the boolean query. But the BooleanScorer/BooleanScorer2
is a little bit heavy for the purpose of relevant scoring.
one use case is: we have
what's the hardware configuration of your machines?
if you have enough RAM, you could use RAMDirectory.
On Tue, May 8, 2012 at 2:52 PM, parkhekishor kishor.par...@highmark.in wrote:
Hi,
I have Index with size 1GB. Its each documents consist five Fields which
are use for search.For single
as for version below 4.0, it's not possible because lucene's score
model. position information is stored, but only used to support phrase
query. it just tell us whether a document is matched, but we can boost
a document. The similar problem is : how to implement proximity boost.
for 2 search
any one could help? thanks
-- Forwarded message --
From: Li Li fancye...@gmail.com
Date: Sat, Apr 28, 2012 at 7:11 PM
Subject: question about NRT(soft commit) and Transaction Log in trunk
To: solr-u...@lucene.apache.org
hi
I checked out the trunk and played with its new soft
minimumNrMatchers do you have? can you upload your test github?
(mail list rips attachments off)
On Thu, Apr 19, 2012 at 7:34 AM, Li Li fancye...@gmail.com wrote:
Michael McCandless wrote:
So... the good news is I made a new scorer (basically copied
DisjunctionMaxScorer and then tweaked from
I found the patch has removed the score(Collector) and
score(Collector,int,int) method. this seems to mean that this scorer is not
possibly used as top-level scorer. Why old implementation has this method?
anywhere use DisjunctionSumScorer as top level scorer?
On Wed, Apr 18, 2012 at 12:18 PM, Li
small addition, I'll post it in comments soon). By using it I
have disjunction summing query with steady subscorers.
Regards
On Tue, Apr 17, 2012 at 2:37 PM, Li Li fancye...@gmail.com wrote:
hi all,
I am now hacking the BooleanScorer2 to let it keep the docID() of the
leaf scorer(mostly
hi all,
I am now hacking the BooleanScorer2 to let it keep the docID() of the
leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.
Why I want to do this is: When I Collect a doc, I want to know which term
is matched(especially for BooleanClause whose Occur is SHOULD). we
some mistakes of the example:
after first call advance(5)
currentDoc=6
first scorer's nextDoc is called to in advance, the heap is empty now.
then call advance(6)
because scorerDocQueue.size() minimumNrMatchers, it just return
NO_MORE_DOCS
On Tue, Apr 17, 2012 at 6:37 PM, Li Li
is absolutely useful
(with one small addition, I'll post it in comments soon). By using it I
have disjunction summing query with steady subscorers.
Regards
On Tue, Apr 17, 2012 at 2:37 PM, Li Li fancye...@gmail.com wrote:
hi all,
I am now hacking the BooleanScorer2 to let it keep the docID
me too. maybe it should provide a one-stage component.
On Fri, Apr 13, 2012 at 1:41 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
Btw, I always wanted to ask why it's always done in two stages. It seems
to me that it's purposed for the specific usecase. But shouldn't we have an
it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is
-- Forwarded message --
From: Li Li fancye...@gmail.com
Date: Wed, Apr 11, 2012 at 4:59 PM
Subject: Re: using solr to do a 'match'
To: solr-u...@lucene.apache.org
I searched my mail but nothing found.
the thread searched by key words boolean expression is Indexing Boolean
hi all,
maybe it's not a suitable question here, but I want advices from
experts knowing the details of lucene indexing.
we modified lucene 2.9.1 and add a feature we call attribute fields
which may be named column based storage in many K-V system.
because we need frequently update
hi all,
we have used solr to provide searching service in many products. I
found for each product, we have to do some configurations and query
expressions.
our users are not used to this. they are familiar with sql and they may
describe like this: I want a query that can search books whose
I just want solr providing this new feature and also want to know whether
any other users need this feature. if possible, I'd like to participate in
it.
On Tue, Feb 7, 2012 at 5:31 PM, Michael Wechner
michael.wech...@wyona.comwrote:
Am 07.02.12 10:24, schrieb Li Li:
hi all,
we have used
from http://jsqlparser.sourceforge.net/
I have experimented it to check sql injections.
On Tue, Feb 7, 2012 at 5:54 PM, Michael Wechner
michael.wech...@wyona.comwrote:
Am 07.02.12 10:43, schrieb Li Li:
I just want solr providing this new feature and also want to know whether
any other users
hi all
when I run/debug solr in eclipse, an error occured.
Caused by: java.lang.IllegalArgumentException: A SPI class of type
org.apache.lu
cene.index.codecs.Codec with name '' does not exist. You need to add the
corresp
onding JAR file supporting this SPI to your classpath.The current
** **
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
** **
*From:* Li Li [mailto:fancye...@gmail.com]
*Sent:* Thursday, December 01, 2011 12:28 PM
*To:* dev@lucene.apache.org
*Subject:* Codec with name ‘Lucene40
hi all,
In current lucene versions(2.x/3.x) , we can hardly modify the scoring
of documents because originally lucene adopt the VSM model and matching
phase and ranking phase are integrated.
But In many situation, we usually use complicated boolean query to
filter out unrelated documents
hi all,
I tested it following the instructions in
http://wiki.apache.org/solr/SpellCheckComponent. but it seems something
wrong.
the sample url in the wiki is
hi all,
I follow the wiki http://wiki.apache.org/solr/SpellCheckComponent
but there is something wrong.
the url given my the wiki is
hi all
I read the the Lucene Google Summer of Code 2011 at
http://wiki.apache.org/lucene-java/SummerOfCode2011. document
identifier reassignment
could make index smaller. even simply sorting documents by url will
achieve many benifits. There are many research papers about this
topic. Maybe
we
for some time
delPolicy.setReserveDuration(indexVersion, reserveCommitDuration);
}
So my supposed extreme will never happen.
2011/3/11 Li Li fancye...@gmail.com:
-- Forwarded message --
From: Li Li fancye...@gmail.com
Date: 2011/3/11
Subject: Problem
hi all,
The replication handler in solr 1.4 which we used seems to be a
little problematic in some extreme situation.
The default reserve duration is 10s and can't modified by any method.
private Integer reserveCommitDuration =
SnapPuller.readInterval(00:00:10);
The current
-- Forwarded message --
From: Steven A Rowe sar...@syr.edu
Date: 2011/3/11
Subject: RE: Problem of Replication Reservation Duration
To: solr-...@lucene.apache.org solr-...@lucene.apache.org
Hi Li Li,
Please do not use the solr-dev mailing list - Solr and Lucene
development both
-- Forwarded message --
From: Li Li fancye...@gmail.com
Date: 2011/3/11
Subject: Problem of Replication Reservation Duration
To: solr-...@lucene.apache.org
hi all,
The replication handler in solr 1.4 which we used seems to be a
little problematic in some extreme situation
hi
when we have more and more machines, we have replications of indexes and
also many shards which hold a part of the whole indexes.
we have to do many things: fail-over, load balance, log collecting,
monitoring each machine's status,
Solr only did a small number of these now.
decoder?
partial decoder of PFOR many need many if/else and will be slower.
Any one has any solution for this?
2010/12/27 Li Li fancye...@gmail.com:
I integrated pfor codec into lucene 2.9.3 and the search time
comparsion is as follows:
single term and query
great things.
But I think the patch is different from the method in that paper.
my colleague had tested this patch but don't get good results
(I don't know the detail well, and he just tell me his experience)
2011/2/15 Andrzej Bialecki a...@getopt.org:
On 2/15/11 11:57 AM, Li Li wrote:
hi all
hi all,
I recently read a paper Pruning Policies for Two-Tiered Inverted
Index with Correctness Guarantee. It's idea is interesting and I
have some questions and like to share with you.
it's idea is pruning unlikely documents for certain terms
e.g.
term1 d1 d3 d6 | d9 d7 d8
hi all
when using post.jar to post very large xml files to solr like
java -Xmx1g -Durl=http://localhost/solr/update -jar post.jar
sample.xml,
it will use many java heap space. e.g. in our case, we will post xml
files larger than 1GB.
Because UpdateHandler in solr will do many things. It
?
This could also be a hardware issue in your test env. If you run
CheckIndex on the corrupt index does it always fail in the same way?
Mike
On Fri, Jan 14, 2011 at 6:43 AM, Li Li fancye...@gmail.com wrote:
hi all,
we have confronted this problem 3 times when testing
The exception stack
does docvalues (adds column-stride fields) means stored but not indexed fields
which can be modified while do not need reindex?
we simply implemented this based on lucene 2.9.1 and integrated it into solr 1.4
it works well for short fields such as click count, page rank etc.
these values
changed
hi all,
we have confronted this problem 3 times when testing
The exception stack is
Exception in thread Lucene Merge Thread #2
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (7286
= 7286 )
at
we recently are interested in this problem. if we come up with a
patch, I'd like
to share it with everyone.
2011/1/4 Michael McCandless luc...@mikemccandless.com:
2011/1/4 Li Li fancye...@gmail.com:
I agree with you that we should not tie concurrency w/in a single search to
index segments
collector per thread and then merge in the end (this is a similar
tradeoff as we've discussed on the per-segment collectors).
Mike
2011/1/1 Li Li fancye...@gmail.com:
I sent a mail to MG4J group and Sebastiano Vigna recommended the paper
Reducing query latencies in web search using fine
also very useful.
On Dec 31, 2010, at 7:25 AM, Li Li wrote:
Which one is used in MG4J to support multithreads searching? Are
2010/12/31 Li Li fancye...@gmail.com:
is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
it says Multithreading. Indices can be queried and scored
is coarse-
grained.
2010/12/30 Michael McCandless luc...@mikemccandless.com:
On Mon, Dec 27, 2010 at 5:08 AM, Li Li fancye...@gmail.com wrote:
I integrated pfor codec into lucene 2.9.3 and the search time
comparsion is as follows:
single term and query
searching multi segments is a alternative solution but it has some
disadvantages.
1. idf is not global?(I am not familiar with its implementation) maybe
it's easy to solve it by share global idf
2. each segments will has it's own tii and tis files, which may make
search slower(that's why
plus
2 means search a term need seek many times for tis(if it's not cached in tii)
2010/12/31 Li Li fancye...@gmail.com:
searching multi segments is a alternative solution but it has some
disadvantages.
1. idf is not global?(I am not familiar with its implementation) maybe
it's easy to solve
is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
it says Multithreading. Indices can be queried and scored concurrently.
maybe we can learn something from it.
2010/12/31 Li Li fancye...@gmail.com:
plus
2 means search a term need seek many times for tis(if it's not cached in tii
was created).
Mike
On Wed, Dec 22, 2010 at 9:45 PM, Li Li fancye...@gmail.com wrote:
I used the bulkpostings
branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene)
does trunk have PForDelta decoder/encoder ?
2010/12/23 Michael McCandless luc...@mikemccandless.com:
Those
I am also interested in this question.
And my understanding may be wrong.
2010/12/27 xu cheng xcheng@gmail.com:
Hi all:
I'm new to lucene dev. these days I'm reading the lucene source code. and
now there are some difficulties for me to understand the index chain.
I could not understand
branch for this
test?
Mike
On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote:
great improvement!
I did a test in our data set. doc count is about 2M+ and index size
after optimization is about 13.3GB(including fdt)
it seems lucene4's index format is better than lucene2.9.3
(I think
maybe because Simple16 can't handle ints = 2^28?).
Mike
On Sun, Dec 19, 2010 at 10:06 PM, Li Li fancye...@gmail.com wrote:
is ForDecompressImpl generated by codes or manully coded?
I am frustrated by
http://code.google.com/p/integer-array-compress-kit/ which contains
too many bugs
OK we should have a look at that one still. We need to converge on a
good default codec for 4.0. Fortunately it's trivial to take any int
block encoder (fixed or variable block) and make a Lucene codec out of
it!
I suggests you not to use this one, I fixed dozens of bugs but it
still failed
to a clean PFor/For impl) now...
Mike
On Thu, Dec 16, 2010 at 4:29 AM, Li Li fancye...@gmail.com wrote:
hi Michael,
lucene 4 has so much changes that I don't know how to index and
search with specified codec. could you please give me some code
snipplets that using PFor codec so I can trace
McCandless luc...@mikemccandless.com:
Hi Li Li,
That issue has such a big patch, and enough of us are now iterating on
it, that we cut a dedicated branch for it.
But note that this branch is off of trunk (to be 4.0).
You should be able to do this:
svn checkout
https://svn.apache.org/repos/asf
get a
sense of what speedups a real app will see... micro-benching is nearly
impossible in Java since Hotspot acts very differently vs the real
test.
Mike
On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote:
Hi
I tried to integrate PForDelta into lucene 2.9 but confronted
thanks.
I'd like trying it and do some experiment on our dataset.
2010/12/15 Michael McCandless luc...@mikemccandless.com:
Hi Li Li,
That issue has such a big patch, and enough of us are now iterating on
it, that we cut a dedicated branch for it.
But note that this branch is off of trunk
then we get a
sense of what speedups a real app will see... micro-benching is nearly
impossible in Java since Hotspot acts very differently vs the real
test.
Mike
On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote:
Hi
I tried to integrate PForDelta into lucene 2.9 but confronted
Hi
I tried to integrate PForDelta into lucene 2.9 but confronted a problem.
I use the implementation in
http://code.google.com/p/integer-array-compress-kit/
it implements a basic PForDelta algorithm and an improved one(which
called NewPForDelta, but there are many bugs and I have fixed
have to walk the exceptions up
front?
That's a nice idea to interleave the encoding of 128 docDeltas then
128 docFreqs... in trunk's Sep codec we currently put them in separate
files.
Mike
On Wed, Nov 24, 2010 at 5:16 AM, Li Li fancye...@gmail.com wrote:
hi all
I want to improve our
hi all
I want to improve our search engine throughput without any help of
hardward improvement. My task now is optimizing index format. And we
have some experience on modifying index format by reading codes of
lucene and write a little codes such as implementing bitmap for high
frequent terms,
hi all
I confronted a strange problem when feed data to solr. I started
feeding and then Ctrl+C to kill feed program(post.jar). Then because
XML stream is terminated unnormally, DirectUpdateHandler2 will throw
an exception. And I goto the index directory and sorted it by date.
newest files are
thank you.
2010/11/8 Uwe Schindler u...@thetaphi.de:
You have to also use Solr 4.0 :-)
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
-Original Message-
From: Li Li [mailto:fancye...@gmail.com]
Sent: Monday, November 08
?
it seems http://svn.apache.org/repos/asf/lucene/dev/ is current
developed version
http://svn.apache.org/repos/asf/lucene/java/ is old version before 3.0
So I just need http://svn.apache.org/repos/asf/lucene/dev/ and
https://svn.apache.org/repos/asf/lucene/dev/?
2010/11/8 Li Li fancye...@gmail.com
hi all,
When will lucene 4.0 be released?
I want to replace VInt compression with fast ones such as
PForDelta. In my application, decompressing a docList of 10M will use
about 300ms. In Performance of Compressed Inverted List Caching in
Search Engines. With J. Zhang and X.Long. 17th
thank you.
so if I want to use new compress/decompress algorithm, I must use
lucene 4.0 in svn? Is there any patch for old release such as
2.9?because I need solr 1.4 which based on lucene 2.9
2010/11/8 Simon Willnauer simon.willna...@googlemail.com:
Li Li,
there is no official
hi all
we found function call in java will cost much time. e.g replacing
Math.min with ab?a:b will make it faster. Another example is lessThan
in PriorityQueue when use Collector to gather top K documents. Yes,
use function and subclass make it easy to maintain and extend. in
C/C++, we can use
is there anyone could help me?
2010/10/11 Li Li fancye...@gmail.com:
hi all,
I want to know the detail of IndexReader in SolrCore. I read a
little codes of SolrCore. Here is my understanding, are they correct?
Each SolrCore has many SolrIndexSearcher and keeps them in
_searchers
hi all
I want to speed up search time for my application. In a query, the
time is largly used in reading postlist(io with frq files) and
calculate scores and collect result(cpu, with Priority Queue). IO is
hardly optimized or already part optimized by nio. So I want to use
multithreads to
yes, there is a multisearcher in lucene. but it's idf in 2 indexes are
not global. maybe I can modify it and also the index like:
term1 df=5 doc1 doc3 doc5
term1 df=5 doc2 doc4
2010/9/28 Li Li fancye...@gmail.com:
hi all
I want to speed up search time for my application. In a query
I think current implmetation is slow. because it do collapse in all
the hit docs. In my view, it will take more than 1s when using
collapse and only 200ms-300ms when not in our environment. So we
modify it as -- when user need top 100 docs, we collect top 200 docs
and do collapse within these 200
I have about 70k document, the total indexed size is about 15MB(the
orginal text files' size).
dir=new RAMDirectory();
IndexWriter write=new IndexWriter(dir,...;
for(loop){
writer.addDocument(doc);
}
create a web project
copy all source codes to src
copy all jsp to WebContent
configure tomcat with -Dsolr.solr.home=
2010/7/23 pavan kumar donepudi pavan.donep...@gmail.com:
HI,
Can anyone help me with the instructions on how to use eclipse for solr
development.I want to configure Solr in
Or where to find any improvement proposal for lucene?
e.g. I want to change the float point multiplication to integer
multiplication or using bitmap for high frequent terms or something
else like this. Is there any place where I can find any resources or
guys?
thanks.
hi all,
I want to implement a query that taking position and terms'
relative positions into consideration. It only supports multiterm
queries like boolean or query.
But I want to consider term postion and terms relative positions.
e.g. there are two docs
doc1 apache lucene is a
...
On Jun 4, 2010, at 2:36 AM, Li Li wrote:
hi all,
I want to implement a query that taking position and terms'
relative positions into consideration. It only supports multiterm
queries like boolean or query.
But I want to consider term postion and terms relative positions.
e.g
75 matches
Mail list logo