Re: lucene query (sql kind)

2005-01-28 Thread jian chen
I like your idea and think you are quite right. I see quite some
people are using lucene to the extreme such that relational database
functionalities are replaced by lucene.

However, storing everything in lucene and use it as a relational type
of database will be kind of re-inventing the wheel. For example,
sorting on the date field, and any other range query.

I think the better way is to look at ways to integrate lucene tightly
into a java relational database, such as HSQL, McKoi or Derby.

In particular, that integration would make it possible for queries
like contains(...), which is included in MySQL full text search
syntax and other major relational db vendors.

I would like to contribute any possible help I could for that to happen.

Thanks,

Jian

On Fri, 28 Jan 2005 13:01:40 + (GMT), mark harwood
[EMAIL PROTECTED] wrote:
 I've added some user-defined lucene functions to
 HSQLDB and I've been able to run queries like the
 following one:
 
 select top 10 lucene_highlight(adText) from ads where
 pricePounds 200  and lucene_query('bass guitar
 drums',id)0 order by lucene_score(id) DESC
 
 I've had similar success with Derby (Cloudscape).
 This approach has some appeal and I've been able to
 use the same class as a UDF in both databases but it
 does have issues: it looks like this UDF based
 integration won't scale. The above query took 80
 milliseconds using 10,000 records. Another
 index/database with 50,000 records was taking a matter
 of seconds. I think a scalable integration is likely
 to require modification of the core RDBMS code.
 
 I think it is worth considering developing such a
 tight RDBMS integration if you consider the issues
 commonly associated with using Lucene:
 1) Sorting on float/date fields and associated memory
 consumption
 2) Representing numbers/dates in Lucene (eg having to
 pad with sufficent leading zeros and add to index's
 list of terms)
 3) Retrieving only certain stored fields from a
 document (all storage can be done in db)
 4) Issues to do with updating volatile data eg price
 data used in sorts
 5) Manually coding joins with RDBMS content as custom
 filters
 6) Too-many terms exceptions produced by range queries
 7) Grouping results eg by website
 8) Boosting docs based on stored content eg date
 
 I'm not saying there aren't answers to the above using
 Lucene. However,I do wonder if these can be addressed
 more effectively in a project which seeks tighter
 integration with an RDBMS and leveraging its
 capabilities.
 
 Any one else been down this route?
 
 
 ___
 ALL-NEW Yahoo! Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



google mini? who needs it when Lucene is there

2005-01-27 Thread jian chen
Hi,

I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)

The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...

It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.

I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread jian chen
Overall, even if google mini gives a lot of cool features compared to
a bare-born lucene project, what is good with the 50,000 documents
limit. It is useless with that limit. That is just their way of trying
to turn it into another cash cow.

Jian


On Thu, 27 Jan 2005 17:45:03 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 500 times the original data?  Not true! :)
 
 Otis
 
 --- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I agree that Google mini is quite expensive.  It might be similar to
  the desktop version in quality.  Anyone knows google's ratio of index
  to text?   Is it true that Lucene's index is about 500 times the
  original text size (not including image size)?  I don't have one
  installed, so I cannot measure.
 
  Best,
 
  Sharon
 
  jian chen [EMAIL PROTECTED] wrote:
  Hi,
 
  I was searching using google and just found that there was a new
  feature called google mini. Initially I thought it was another free
  service for small companies. Then I realized that it costs quite some
  money ($4,995) for the hardware and software. (I guess the
  proprietary
  software costs a whole lot more than actual hardware.)
 
  The nice feature is that, you can only index up to 50,000 documents
  with this price. If you need to index more, sorry, send in the
  check...
 
  It seems to me that any small biz will be ripped off if they install
  this google mini thing, compared to using Lucene to implement a easy
  to use search software, which could search up to whatever number of
  documents you could image.
 
  I hope the lucene project could get exposed more to the enterprise so
  that people know that they have not only cheaper but more
  importantly,
  BETTER alternatives.
 
  Jian
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread jian chen
Hi,

Just to continue this discussion. I think right now Lucene's retrieval
algorithm is based purely on Vector Space Model, which is simple and
efficient.

However, there maybe cases where folks like me want to use another set
of completely different ranking algorithms, those which do not even
use tf/idf.

For example, I am thinking about adding Cover Density ranking
algorithm to lucene, which is for now purely based on the proximity
information and does not require any global ranking variables. But
looking into the lucene code, it seems not very easy to make a hack
for that. At least, for me, a novice lucene user.

I read on the lucene whiteboard 2.0 that lucene will accomodate more
in terms of what to be indexed and such. That move might be good for
implementing other or ad hoc ranking algorithms.

Cheers,

Jian


On Wed, 26 Jan 2005 10:25:15 -0500, Ian Soboroff [EMAIL PROTECTED] wrote:
 Erik Hatcher [EMAIL PROTECTED] writes:
 
  By all means, if you have other suggestions for our site, let us know
  at [EMAIL PROTECTED]
 
 One of the things I would like to see, but which isn't either in the
 Lucene site, documentation, or Lucene in Action, is a complete
 description of how the retrieval algorithm works.  That is, how the
 HitCollector, Scorers, Similarity, etc all fit together.
 
 I'm involved in a project which to some degree is looking at poking
 deeply into this part of the Lucene code.  We have a nice (non-Lucene)
 framework for working with more different kinds of similarity
 functions (beyond tf-idf) which should also be expandable to include
 query expansion, relevance feedback, and the like.
 
 I used to think that integrating it would be as simple as hacking in
 Similarity, but I'm beginning to think it might need broader changes.
 I could obviously hook in our whole retrieval setup by just diving for
 an IndexReader and doing it all by hand, but then I would have to redo
 the incremental search and possibly the rich query structure, which
 would be a lose.
 
 So anyway, I got LIA hoping for a good explanation (not a good
 Explanation) on this bit, but it wasn't there.  There are some hints
 on the Lucene site, but nothing complete.  If I muddle it out before
 anything gets contributed, I'll try to write something up, but don't
 expect anything too soon...
 
 Ian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread jian chen
Hi, Ian,

Thanks for your information. It would be really helpful to have some
documentation maybe on the WIKI about retrieval algorithm and how to
hack it. At least, something there even if like several paragraphs to
get started...

Thanks,

Jian

On Wed, 26 Jan 2005 12:40:54 -0500, Ian Soboroff [EMAIL PROTECTED] wrote:
 jian chen [EMAIL PROTECTED] writes:
 
  Just to continue this discussion. I think right now Lucene's retrieval
  algorithm is based purely on Vector Space Model, which is simple and
  efficient.
 
 As I understand it, it's indeed a tf-idf vector space approach, except
 that the queries are structured and as such, the tf-idf weights are
 totaled as a straight cosine among siblings of a BooleanQuery, but
 other query nodes may do things differently, for example, I haven't
 read it but I assume PhraseQueries require all terms present and
 adjacent to contribute to the score.
 
 There is also a document-specific boost factor in the equation which
 is essentially a hook for document things like recency, PageRank, etc
 etc.
 
 You can tweak this by defining custom Similarity classes which can say
 what the tf, idf, norm, and boost mean.  You can also affect the
 term normalization at the query end in BooleanScorer (I think? through
 the sumOfSquares method?).
 
 We've implemented something kind of like the Similarity class but
 based on a model which decsribes a larger family of similarity
 functions.  (For the curious or similarly IR-geeky, it's from Justin
 Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
 need more general hooks than the Lucene Similarity provides.  I think
 those hooks might exist, but I'm not sure I know which classes they're
 in.
 
 I'm also interested in things like relevance feedback which can affect
 term weights as well as adding terms to the query... just how many
 places in the code do I have to subclass or change?
 
 It's clear that if I'm interested in a completely different model like
 language modeling the IndexReader is the way to go.  In which case,
 what parts of the Lucene class structure should I adapt to maintain
 the incremental-results-return, inverted list skips, and other
 features which make the inverted search fast?
 
 Ian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to give recent documents a boost?

2005-01-25 Thread jian chen
Hi,

I think setting boost to the recent document is tricky. There is no
clear cut except trial and error to make the boost value right.

Could you let the user specify a date range and sort the documents
within that range by relevance? This way, the users get what they
exactly specified, and won't be annoyed by in-proper setting of the
boost factor.

Workable?

Thanks,

Jian

On Tue, 25 Jan 2005 10:30:21 -0800, aurora [EMAIL PROTECTED] wrote:
 What is the best way to give recent documents a boost? Not sorting them by
 strict date order but to give them some preference. If document 1 filed
 last week has a score of 0.5 and document 2 filed last month has a score
 of 0.55, then list document 1 first. But if document 1 has a score of only
 0.05, then keep it at the end. Any experience of fine tuning by date order?
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread jian chen
Hi,

If it is really the case that every 128th term is loaded into memory.
Could you use a relational database or b-tree to index to do the work
of indexing of the terms instead?

Even if you create another level of indexing on top of the .tii fle,
it is just a hack and would not scale well.

I would think a b/b+ tree based approach is the way to go for better
memory utilization.

Cheers,

Jian


On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 There Kevin, that's what I was referring to, the .tii file.
 
 Otis
 
 --- Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
   Kevin A. Burton wrote:
  
We have one large index right now... its about 60G ... When I
  open it
the Java VM used 940M of memory.  The VM does nothing else
  besides
open this index.
  
   After thinking about it I guess 1.5% of memory per index really
  isn't
   THAT bad.  What would be nice if there was a way to do this from
  disk
   and then use the a buffer (either via the filesystem or in-vm
  memory) to
   access these variables.
 
  It's even documented. From:
  http://jakarta.apache.org/lucene/docs/fileformats.html :
 
  The term info index, or .tii file.
  This contains every IndexIntervalth entry from the .tis file, along
  with its
  location in the tis file. This is designed to be read entirely
  into memory
  and used to provide random access to the tis file.
 
  My guess is that this is what you see happening.
  To see the actuall .tii file, you need the non default file format.
 
  Once searching starts you'll also see that the field norms are
  loaded,
  these take one byte per searched field per document.
 
   This would be similar to the way the MySQL index cache works...
 
  It would be possible to add another level of indexing to the terms.
  No one has done this yet, so I guess it's prefered to buy RAM
  instead...
 
  Regards,
  Paul Elschot
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action

2005-01-22 Thread jian chen
Hi,

I am not sure. However I see that the book has an electronic version
you can buy online...

Cheers,

Jian


On Sun, 23 Jan 2005 10:30:24 +0800, ansi [EMAIL PROTECTED] wrote:
 hi,all
 
 Does anyone know how to buy Lucene in Action in China?
 
 Ansi
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread jian chen
Hi,

One thing to point out. I think Lucene is not using LSI as the
underlying retrieval model. It uses vector space model and also
proximity based retrieval.

Personally, I don't know much about LSI and I don't think the fancy
stuff like LSI is workable in industry. I believe we are far away from
the era of artificial intelligence and using any elusive way to do
information retrieval.

Cheers,

Jian


On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote:
 Hi .. I'm new to the list so forgive a dumb question or two as I get
 started.
 
 We're in the midst of converting a small collection (1200-1500
 currently) of scientific literature to be easily searchable/navigable.
 We'll likely provide both a text query interface as well as a graphical
 way to search and discover.
 
 Our initial approach will be vector based, looking at Latent Semantic
 Indexing (LSI) as a potential tool, although if that's not needed,
 we'll stop at reasonably simple stemming with a weighted document term
 matrix (DTM).  (Bear in mind I couldn't even pronounce most of these
 concepts last week, so go easy if I'm incoherent!)
 
 It looks to me that Lucene has a quite well factored architecture.  I
 should at the very least be able to use the analyzer and stemmer to
 create a good starting point in the project.  I'd also like to leave a
 nice architecture behind in case we or others end up experimenting
 with, or extending, the system.
 
 So a couple of questions:
 
 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
 apparently produces non-word stems .. i.e. not really human readable.
 (Example: generate, generates, generated, generating  - generat)
 Although in typical queries this is not important because the result of
 the search is a document list, it *would* be important if we use the
 stems within a graphical navigation interface.
  So the question is: Is there a way to have the stemmer produce
 english
  base forms of the words being stemmed?
 
 2 - We're probably using Lucene in ways it was not designed for, such
 as DTM/LSI and graphical clustering and navigation.  Naturally we'll
 provide code for these parts that are not in Lucene.
  But the question arises: is this kinda dumb?!  Has anyone stretched
 Lucene's
  design center with positive results?  Are we barking up the wrong
 tree?
 
 3 - A nit on hyphenation: Our collection is scientific so has many
 hyphenated words.  I'm wondering about your experiences with
 hyphenation.  In our collection, things like self-organization,
 power-law, space-time, small-world, agent-based, etc. occur often, for
 example.
  So the question is: Do folks break up hyphenated words?  If not, do
 you stem the
  parts and glue them back together?  Do you apply stoplists to the
 parts?
 
 Thanks for any help and pointers you can fling along,
 
 Owenhttp://backspaces.net/http://redfish.com/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]