Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
On 31/10/2011 21:42, Petite Abeille wrote: On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Petite Abeille
On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: > similarity-preserving hash function was calculated on each sentence, and the > hash was added as a field. The property of the hash was that similar > documents (sentences) would produce a similar hash, with only some bit-level > perturbati

Re: Bet you didn't know Lucene can...

2011-10-31 Thread Andrzej Bialecki
On 22/10/2011 11:11, Grant Ingersoll wrote: Hi All, I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Luc

Re: Bet you didn't know Lucene can...

2011-10-26 Thread Dawid Weiss
m also using public domain Wikipedia data so can release the code and data > somewhere if that's of interest. > > Cheers > Mark > > > > - Original Message - > From: Dawid Weiss > To: java-user@lucene.apache.org > Cc: > Sent: Tuesday, 25 October 2011,

Re: Bet you didn't know Lucene can...

2011-10-26 Thread mark harwood
pache.org Cc: Sent: Tuesday, 25 October 2011, 23:17 Subject: Re: Bet you didn't know Lucene can... > Lucene started out at an avg 3ms but subsequent runs took it down > dramatically due to OS file caching. The all-in-memory hashset implementation > clearly did not demonstrate th

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Dawid Weiss
> Lucene started out at an avg 3ms but subsequent runs took it down > dramatically due to OS file caching. The all-in-memory hashset implementation > clearly did not demonstrate the same speed ups between runs. I don't say the benchmark was wrong or anything, but this is surprising. I mean, the

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Mark Harwood
> Avg lookup time slightly less than a HashSet? Interesting. Yep, HashSet comparison was a surprise to me too. I threw it in as a datapoint for what I thought would be the fastest option on the example dataset but clearly not a long-term answer to my problem as it costs so much in RAM. Lucene s

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Dawid Weiss
Avg lookup time slightly less than a HashSet? Interesting. Is the code to these benchmarks available somewhere? Dawid On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll wrote: > > On Oct 25, 2011, at 11:26 AM, mark harwood wrote: > using Lucene that don't fit under the core premise of full te

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Grant Ingersoll
On Oct 25, 2011, at 11:26 AM, mark harwood wrote: >>> using Lucene that don't fit under the core premise of full text search > > I've had several use cases over the years that use features peculiar to > Lucene but here's a very simple one I came across today that illustrates its > raw index l

Re: Bet you didn't know Lucene can...

2011-10-25 Thread Erik Hatcher
At the group where I worked at UVa once upon a time, a coworker built Juxta, this way cool tool to diff multiple versions of a document visually with heat maps and "difference"-o-meters, and it leverages Lucene analyzers to extract words and positions and such. You can find it here: http://www.

Re: Bet you didn't know Lucene can...

2011-10-25 Thread mark harwood
>>using Lucene that don't fit under the core premise of full text search  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability: I needed a fast, scalable and persistent "S

Re: Bet you didn't know Lucene can...

2011-10-23 Thread Dawid Weiss
Hi Grant, In Carrot2 (and Carrot Search's commercial products) we're not using Lucene as an indexing/ search service directly, but we are re-using a lot of internal infrastructure (like analyzers, ported snowball stemmers and other segmentation stuff). We also plan on using the new language identi

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Shashi Kant
Using Lucene as a recommendation engine. On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll wrote: > > On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote: > >> Hi Grant, >> >> Not sure if this qualifies as a "bet you didn't know", but one could use >> Lucene term vectors to construct document vectors for

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Grant Ingersoll
On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote: > Hi Grant, > > Not sure if this qualifies as a "bet you didn't know", but one could use > Lucene term vectors to construct document vectors for similarity, > clustering and classification tasks. I found this out recently (although > I am probably no

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Wouter Heijke
Hi Grant, These are 2 cases into work i've done that I can think of: -use Lucene to match products in a database with eBay auctions, the title of the auction is used as the query to Lucene. -use a servlet filter and Lucene to map well-formed URL's into a website to it's individual (product) page

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Sujit Pal
Hi Grant, Not sure if this qualifies as a "bet you didn't know", but one could use Lucene term vectors to construct document vectors for similarity, clustering and classification tasks. I found this out recently (although I am probably not the first one), and I think this could be quite useful. -

Re: Bet you didn't know Lucene can...

2011-10-22 Thread Paul Libbrecht
Grant, for years the ActiveMath learning environment has been using as storage engine. At the time (~2004), it was by far the best storage engine ever doable in a pure java-world. Now it still is perfect in terms of performance. We had an issue with the separate versions where the stored-fields w