date:20111031

Re: Weighted Query Sequence

2011-10-31 Thread Ian Lea

Sounds custom made for boosting.  Depending on how you are structuring
your fields and queries you could use either index or query time
boosts, or even both.

http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_difference_between_field_.28or_document.29_boosting_and_query_boosting.3F


--
Ian.

2011/10/31 Shengtao Lei :
>  Hello Every One!
>
>
> I'm struggling with my degree paper. My research project is build a search
> engine for a language which has many affixes and prefixes.
> Many papers have been read， the common way is stemming,
> My segmentation processor can cut of the affix and prefix 。But for this
> language， i can't just remove them simply(My supervisor said so).
>
> what i should do  is:
> If User input a query like : " root + affix1+ affix2",  It means "root" is
> the most important , "affix1" and "affix2" are following "root".
> If "root + affix1 + affix2" is founded in the doc, it is best result.  If
> not "root + affix1"matched is Better , If not "A"matched is also OK.
>
> How can I construct my query and search by using exist API?
> Evey advice is appreciate! Thank you very much!
>
>   Sincerely
> Scott Lei
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: index bigger than it should be?

2011-10-31 Thread Ian Lea

Do the individual docs get bigger after 28 million?  Can you try
loading the last few million docs, from when the size jumps, and see
what happens?  Or load them in reverse order or something, again to
see what happens?

I don't have indexes with that many docs, but I believe that plenty of
people do.


--
Ian.


On Sun, Oct 30, 2011 at 9:01 AM,   wrote:
> Hi,
>
> I did the following on the existing index:
>  - expunge deletes
>  - optimize(5)
>  - check index
>
> then from the existing index I exported all docs into a new one, then on
> the new one I did:
>  - optimize(5)
>  - check index
>
> the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt
>
> during the export, I also monitored the size on disk at each chunk of
> 10 docs added to the new index:
> http://dl.dropbox.com/u/47469698/lucene/index.xls
>
> what I found was that the index was taking around 2400 Mb/million docs
> almost all the time, and from time to time it would take a little bit more
> (<3500) during a short period of time. this stays true until around 28
> millions docs where the size on disk increases a lot (4500 Mb/million docs
> = 135 Gb on disk) until the end of the export (my index contains 32
> millions docs). at the end the space on disk went from 134 Gb to 91 Gb
> thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is
> still 3000 Mb/million docs, far more than the 2400 I was seeing most of
> the time.
>
> I understand that merges happen, what I was surprised about was that the
> behavior between 28 and 32 millions was a lot bigger in scale than the
> other merges before, and even an optimize would not solve this entirely.
> did I reach a limit? should I maintain the index at 25 millions to avoid
> this behavior?
>
> I am using lucene 3.4 with the tiered merge policy and all the fields are
> stored.
>
> thanks,
>
>
> Vincent Sevel
>
>
>
>
>
>
>
>
> Ian Lea 
> Sent by: java-user-return-51136-v.sevel=lombardodier@lucene.apache.org
>
>
> 27.10.2011 15:28
> Please respond to
> java-user@lucene.apache.org
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: index bigger than it should be?
>
>
>
>
>
>
> There's org.apache.lucene.index.CheckIndex which will report assorted
> stats about the index, as well as checking it for correctness.  It can
> fix it too but you don't need that.  I hope. Will take quite a while
> to run on a large index.
>
> What version of lucene?  Does a before/after (or large/small)
> directory listing give any clues?
>
>
> --
> Ian.
>
>
> On Thu, Oct 27, 2011 at 12:44 PM,   wrote:
>> Hi,
>>
>> I have an application that has an index with 30 millions docs in it.
> every
>> day, I add around 1 million docs, and I remove the oldest 1 million, to
>> keepit stable at 30 million.
>> for the most part doc fields are indexed and stored. each doc weighs
>> around from a few Kb to a 1 Mb (a few Mb in some cases).
>> I used to be able to maintain the index at around 60 Gb on disk. but
>> recently the index has had a tendency to keep growing (90 Gb). I can see
>> that the expunge is doing what it should do, because after it executes,
>> the size on disk does go down, but never as low as the previous day.
> from
>> the outside, it looks like a leak, but since I do not remove the docs I
>> added during the day, it might be that the new docs are just bigger than
>> the old ones. still I am surprised with the increase.
>>
>> are there any tools to dig into the index structure and help justify the
>> space taken on disk?
>> I was thinking about something that would help identify terms that take
> up
>> the most space, or some sort of dump that I could compare from one day
> to
>> the other.
>>
>> any help appreciated,
>>
>> thanks,
>>
>> vince
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>  DISCLAIMER 
> This message is intended only for use by the person to
> whom it is addressed. It may contain information that is
> privileged and confidential. Its content does not
> constitute a formal commitment by Lombard Odier
> Darier Hentsch & Cie or any of its branches or affiliates.
> If you are not the intended recipient of this message,
> kindly notify the sender immediately and destroy this
> message. Thank You.
> *
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexReader#reopen() on externally changed index

2011-10-31 Thread Michael McCandless

That's a good idea, if your index is "large enough", and/or you make
heavy use of FieldCache (eg, sorting by field), regardless of whether
you use NRT or "normal" commit + reopen to reopen your reader.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Oct 30, 2011 at 7:36 PM, Denis Bazhenov  wrote:
> Well, if so I guess I should use IndexWarmer to warm up IndexReader before 
> publishing reference to search clients. At least it will pre read all the 
> segments in RAM before issuing search.
>
> On Oct 17, 2011, at 9:47 PM, Michael McCandless wrote:
>
>> You'll have to call .commit() from the IndexWriter to make the changes
>> externally visible.
>>
>> The call IndexReader.reopen to get a reader seeing the committed
>> changes; the reopen will be efficient (only open "new" segments vs the
>> old reader).
>>
>> It's still best to use near-real-time reader when possible (ie, open
>> the IndexReader from the IndexWriter), but it sounds like in your case
>> this is not possible since writer and reader on different
>> JVMs/machines across a network.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Oct 16, 2011 at 10:32 PM, Denis Bazhenov  wrote:
>>> We have situation when lucene index is replicated over network. And on that 
>>> machine reader reopen doesn't make new documents visible to a search.
>>>
>>> As far as I know IndexReader.reopen() call does work only if changes are 
>>> applied using the linked IndexWriter. My question is: how can I implement 
>>> efficient index reopen (only new segments should be read) when index is 
>>> changed externally?
>>> ---
>>> Denis Bazhenov 
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---
> Denis Bazhenov 
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: multiple phrase search for topic

2011-10-31 Thread deb.lucene

thanks Ian for your response. This is a one-time offline program so am not
bothered about the performance (i.e. speed etc.).

one more question, there are some situations where I need to run a AND
clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My
approach was something like :-

**
String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ;
QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new
StandardAnalyzer(Version.LUCENE_33));

Query query = queryParser.parse(searchString);
bQuery.add(query,BooleanClause.Occur.SHOULD); 

**
thanks for the carrot2 pointer.

-d




--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: multiple phrase search for topic

2011-10-31 Thread Ian Lea

Nice not to have to worry about performance.  You say there is another
question, but not what it is.  The code you show looks like it should
do what you want.

For anything non-trivial I prefer to build the queries directly in
code rather than concatenating strings to be parsed, because I find it
hard to work out the quotes and brackets and what the result will be.
But your way is fine.


--
Ian.


On Mon, Oct 31, 2011 at 2:51 PM, deb.lucene  wrote:
> thanks Ian for your response. This is a one-time offline program so am not
> bothered about the performance (i.e. speed etc.).
>
> one more question, there are some situations where I need to run a AND
> clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My
> approach was something like :-
>
> **
> String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ;
> QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new
> StandardAnalyzer(Version.LUCENE_33));
>
> Query query = queryParser.parse(searchString);
> bQuery.add(query,BooleanClause.Occur.SHOULD);
>
> **
> thanks for the carrot2 pointer.
>
> -d
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki


On 22/10/2011 11:11, Grant Ingersoll wrote:

Hi All,

I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." 
(http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the 
years, a number of us in the community have done some pretty cool things using Lucene 
that don't fit under the core premise of full text search.  I've got a fair number of 
ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your 
stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the 
conversation to a bit more than the conference and also see if I can't inject more ideas 
beyond the ones I have.  I don't need deep technical details, but just high level use 
case and the basic insight that led you to believe Lucene could solve the problem.


Better late than never ... :) I briefly mentioned this use case to you 
at Eurocon, but here it is for the record.


I used Lucene in a duplicate-detection scenario where instead of 
documents individual sentences would be indexed (with a fuzz). A 
similarity-preserving hash function was calculated on each sentence, and 
the hash was added as a field. The property of the hash was that similar 
documents (sentences) would produce a similar hash, with only some 
bit-level perturbation. The challenge was to find a ranked list of 
possible duplicates with similar (not exact same) hashes, which in this 
case meant to find a ranked list of documents that have the smallest 
bit-level distance in their hashes from the query hash.


The solution is described in SOLR-1918 - Bit-wise scoring field type.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Petite Abeille


On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

> similarity-preserving hash function was calculated on each sentence, and the 
> hash was added as a field. The property of the hash was that similar 
> documents (sentences) would produce a similar hash, with only some bit-level 
> perturbation. The challenge was to find a ranked list of possible duplicates 
> with similar (not exact same) hashes, which in this case meant to find a 
> ranked list of documents that have the smallest bit-level distance in their 
> hashes from the query hash.
> 
> The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: idf calculation in Lucene ?

2011-10-31 Thread David Ryan

Thanks!  Is there any way to extend the Similarity class to overwrite the
behavior (e.g.,  using the max idf instead of the sum of each term idfs)?


On Thu, Oct 27, 2011 at 5:41 AM, Robert Muir  wrote:

> On Thu, Oct 20, 2011 at 3:11 PM, David Ryan  wrote:
>
> >
> > However, in some case,  when I search o'reilly ,  I see
> >
> >  *  44.0865 = idf(title: o''reilli=4 o=1488 reilli=14 oreilli=4)*
> >
> > In this cae, How is IDF calculated?
> >
>
> thats a phrase or multiphrase query.
>
> in this case it sums up the idf of each term:
>
> http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/search/Similarity.html#idfExplain(java.util.Collection
> ,
> org.apache.lucene.search.Searcher)
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: idf calculation in Lucene ?

2011-10-31 Thread Robert Muir

yes: override that method idfExplain(java.util.Collection,
org.apache.lucene.search.Searcher)

On Mon, Oct 31, 2011 at 5:24 PM, David Ryan  wrote:
> Thanks!  Is there any way to extend the Similarity class to overwrite the
> behavior (e.g.,  using the max idf instead of the sum of each term idfs)?
>
>
> On Thu, Oct 27, 2011 at 5:41 AM, Robert Muir  wrote:
>
>> On Thu, Oct 20, 2011 at 3:11 PM, David Ryan  wrote:
>>
>> >
>> > However, in some case,  when I search o'reilly ,  I see
>> >
>> >  *  44.0865 = idf(title: o''reilli=4 o=1488 reilli=14 oreilli=4)*
>> >
>> > In this cae, How is IDF calculated?
>> >
>>
>> thats a phrase or multiphrase query.
>>
>> in this case it sums up the idf of each term:
>>
>> http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/search/Similarity.html#idfExplain(java.util.Collection
>> ,
>> org.apache.lucene.search.Searcher)
>>
>> --
>> lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki


On 31/10/2011 21:42, Petite Abeille wrote:


On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:


similarity-preserving hash function was calculated on each sentence, and the 
hash was added as a field. The property of the hash was that similar documents 
(sentences) would produce a similar hash, with only some bit-level 
perturbation. The challenge was to find a ranked list of possible duplicates 
with similar (not exact same) hashes, which in this case meant to find a ranked 
list of documents that have the smallest bit-level distance in their hashes 
from the query hash.

The solution is described in SOLR-1918 - Bit-wise scoring field type.


In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


Yes, you could use this. In that project we used a different 
application-specific hash.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Weighted Query Sequence

Re: index bigger than it should be?

Re: IndexReader#reopen() on externally changed index

Re: multiple phrase search for topic

Re: multiple phrase search for topic

Re: Bet you didn't know Lucene can...

Re: Bet you didn't know Lucene can...

Re: idf calculation in Lucene ?

Re: idf calculation in Lucene ?

Re: Bet you didn't know Lucene can...

10 matches

Site Navigation

Mail list logo

Footer information