Using Lucene to store document

2004-11-09 Thread Nhan Nguyen Dang
Hi all,
I'm using Lucene to index XML document/ file (may be millions of documents in 
future, each about 5-10KB)
Beside the index for searching, I want to use Lucene to store whole document 
content with UnIndexed fields -content field(instead of store each document in 
a XML file). All the document content will be stored on a separate index. Each 
time I want to get access to a document, I will let Lucene retrieve it.
 
I am consider this issue with another one Use file system to store document 
content in separate XML document means, 400K document ill be stored in 400K 
XML file in file system.
 
Purpose of this is that I can access each document rapidly. Can any body who 
has experience with this problem before give me advise which method is suitable 
? Is this better to collect all documents to an Lucene index or store them 
separately in file system ?
 
Thanks,
Dang Nhan





-
Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com

Re: Need Help

2004-11-09 Thread Chandrashekhar

Hi,
Thank you for help.
I got solution for this. lucene 1.3 index works with clucene 0.8.12


- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 08, 2004 11:01 PM
Subject: Re: Need Help


 Hello,
 
 You should double-check with CLucene community, but from my research
 for Lucene in Action CLucene's index is not compatible with that of
 Lucene 1.4, so you will not be able to use the same index with both
 Lucene and CLucene.
 
 Otis
 
 --- Chandrashekhar [EMAIL PROTECTED] wrote:
 
  Hi,
  I have query regarding index file portability of lucene 1.4 and
  clucene 0.8.12.
  I have created index file in Java  - lucene 1.4  and now want to
  search some term in the same index file by using clucene.
  I am not getting results if i do that. 
  So just wanted to make sure, does it support such kind of
  interportability?
  
  

  With Regards,
  Chandrashekhar V Deshmukh
  Sr. System Analyst
  Cybage Software Pvt. Ltd. (a CMM Level 3 company)
  Phone(O) : 91-20-4041700, 91-20-4044700 Ext: 804
  Cell : 91-9822749239
  Fax : 91-20-4041701 , 4041702
  [EMAIL PROTECTED]
  www.cybage.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching in keyword field ?

2004-11-09 Thread Thierry Ferrero (Itldev.info)
Hi All,

Can i search only one word in a keyword field which contains few words.
I know keyword field isn't tokenized. After many tests, i think is
impossible.
Someone can confirm me ?

Why don't i use a text field? because the users know the category from a
list (ex: category ABC, category DEF GHI, category  JKL ...) and the keyword
field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST).
I use a SnowBallAnalyzer for text field in indexing.
Perhaps the better way for me, is to use a text field with the value ABC
DEF_GHI  JKL_NOPQ where categorys are concatinated with a _.
Thanks for your reply !

Thierry.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching in keyword field ?

2004-11-09 Thread Justin Swanhart
You can add the category keyword multiple times to a document.

Instead of seperating your categories with a delimiter, just add the
keyword multiple times.

doc.add(Field.Keyword(category, ABC);
doc.add(Field.Keyword(category, DEF GHI);

On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info)
[EMAIL PROTECTED] wrote:
 Hi All,
 
 Can i search only one word in a keyword field which contains few words.
 I know keyword field isn't tokenized. After many tests, i think is
 impossible.
 Someone can confirm me ?
 
 Why don't i use a text field? because the users know the category from a
 list (ex: category ABC, category DEF GHI, category  JKL ...) and the keyword
 field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST).
 I use a SnowBallAnalyzer for text field in indexing.
 Perhaps the better way for me, is to use a text field with the value ABC
 DEF_GHI  JKL_NOPQ where categorys are concatinated with a _.
 Thanks for your reply !
 
 Thierry.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene to store document

2004-11-09 Thread Otis Gospodnetic
It is difficult to give a general answer.  You can certainly store the
whole XML in the Lucene index, just don't tokenize it.  The HEAD
version of Lucene even has some compression that you may find handy. 
On the other hand, storing XML in the FS would allow you to store XML
files wherever you wanted, even on separate disk(s).  If these are lots
of parallel searches/reads, this can be handy.  If you want to be able
to see XML files without going through the index, this can also be
handy.  So, it depends on how you like it, but both approaches are
doable.

Otis


--- Nhan Nguyen Dang [EMAIL PROTECTED] wrote:

 Hi all,
 I'm using Lucene to index XML document/ file (may be millions of
 documents in future, each about 5-10KB)
 Beside the index for searching, I want to use Lucene to store whole
 document content with UnIndexed fields -content field(instead of
 store each document in a XML file). All the document content will be
 stored on a separate index. Each time I want to get access to a
 document, I will let Lucene retrieve it.
  
 I am consider this issue with another one Use file system to store
 document content in separate XML document means, 400K document ill
 be stored in 400K XML file in file system.
  
 Purpose of this is that I can access each document rapidly. Can any
 body who has experience with this problem before give me advise which
 method is suitable ? Is this better to collect all documents to an
 Lucene index or store them separately in file system ?
  
 Thanks,
 Dang Nhan
 
 
 
 
   
 -
 Do you Yahoo!?
  Check out the new Yahoo! Front Page. www.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What is the difference between these searches?

2004-11-09 Thread Luke Francl
Hi,

I've implemented a converter to translate our system's internal Query
objects to Lucene's Query model.

I recently realized that my implementation of OR NOT was not working
as I would expect and I was wondering if anyone on this list could give
me some advice.

I am converting a query that means foo or not bar into the following:

+item_type:xyz +(field_name:foo -field_name:bar)

This returns only Documents where field_name contains foo. I would
expect it to return all the Documents where field_name contains foo or
field_name doesn't contain bar.

Fiddling around with the Lucene Index Toolbox, I think that this query
does what I want:

+item_type:xyz field_name:foo -field_name:bar

Can someone explain to me why these queries return different results?

Thanks,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Erik Hatcher
On Nov 9, 2004, at 2:58 PM, Luke Francl wrote:
I recently realized that my implementation of OR NOT was not working
as I would expect and I was wondering if anyone on this list could give
me some advice.
Lucene's BooleanQuery does not really have the concept of OR NOT.  It's 
really an AND NOT.

I am converting a query that means foo or not bar into the following:
+item_type:xyz +(field_name:foo -field_name:bar)
This returns only Documents where field_name contains foo. I would
expect it to return all the Documents where field_name contains foo or
field_name doesn't contain bar.
What you experienced is how Lucene operates.  It's more of a fail-safe 
mode because doing pure NOT queries is more likely to get out of 
control.

Fiddling around with the Lucene Index Toolbox, I think that this query
does what I want:
+item_type:xyz field_name:foo -field_name:bar
Can someone explain to me why these queries return different results?
This last query has a required clause, which is what BooleanQuery 
requires when there is a NOT clause.  You're getting what you want here 
because you've got an item_type:xyz clause as required.  In your first 
example, you're requiring field_name:foo, whereas in this one it is not 
mandatory.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


can lucene be backed to have an update field

2004-11-09 Thread Chris Fraschetti
Is it possible to modify the lucene source to create an
updateDocument(doc#, FIELD, value)  function ? 

I know there's a lot of work that goes on being the scene when an
.add(doc) is called, but can some of that functionality be adapter to
make the update a reality?

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Paul Elschot
Luke,

On Tuesday 09 November 2004 20:58, you wrote:
 Hi,
 
 I've implemented a converter to translate our system's internal Query
 objects to Lucene's Query model.
 
 I recently realized that my implementation of OR NOT was not working
 as I would expect and I was wondering if anyone on this list could give
 me some advice.

Could you explain OR NOT ? 

Lucene has no provision for matching by being prohibited only. This can
be achieved by indexing something for each document that can be
used in queries to match always, combined with something prohibited
in a query.
But doing this is bad for performance for querying larger nrs of docs.

Lucene's - prefix in queries means AND NOT, ie. the term with the - prefix
prohibits the matching of a document.
 
 I am converting a query that means foo or not bar into the following:
 
 +item_type:xyz +(field_name:foo -field_name:bar)
 
 This returns only Documents where field_name contains foo. I would
 expect it to return all the Documents where field_name contains foo or
 field_name doesn't contain bar.
 
 Fiddling around with the Lucene Index Toolbox, I think that this query
 does what I want:
 
 +item_type:xyz field_name:foo -field_name:bar
 
 Can someone explain to me why these queries return different results?

A bit dense, but anyway:

Anything prefixed with + is required.
Anything not having + or - prefix is optional and only influences the score.
In case there is nothing required by a + prefix, at least one of the things
without prefix is required.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Luke Francl
On Tue, 2004-11-09 at 15:48, Erik Hatcher wrote:

 This last query has a required clause, which is what BooleanQuery 
 requires when there is a NOT clause.  You're getting what you want here 
 because you've got an item_type:xyz clause as required.  In your first 
 example, you're requiring field_name:foo, whereas in this one it is not 
 mandatory.

So, essentially, my query:

+item_type:xyz +(field_name:foo -field_name:bar)

Gets translated to:

+item_type:xyz +field_name:foo -field_name:bar

Whereas the more lenient one does not require field_name:foo and returns
what I expect.

Is that right?

Now, to decide whether to try to make this work the way I thought it
would, or just document that it doesn't. ;)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Luke Francl
On Tue, 2004-11-09 at 16:00, Paul Elschot wrote:

 Lucene has no provision for matching by being prohibited only. This can
 be achieved by indexing something for each document that can be
 used in queries to match always, combined with something prohibited
 in a query.
 But doing this is bad for performance for querying larger nrs of docs.

I'm familiar with Lucene's restrictions on prohibited queries, and I
have a required clause for a field that will always be part of the query
(it's not a nonsense value, it's the item type of the object in a CMS). 

My problem is that I have been considering the whole query object that
I've generated. Every BooleanQuery that's a part of my finished query
must also have a required clause if it has a prohibited clause.

I'm thinking of refactoring my code so that instead of joining together
Query objects into a large BooleanQuery, it passes around BooleanClauses
and assembles them into a single BooleanQuery.

Thanks for your help,
Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: can lucene be backed to have an update field

2004-11-09 Thread Paul Elschot
Chris,

On Tuesday 09 November 2004 22:54, Chris Fraschetti wrote:
 Is it possible to modify the lucene source to create an
 updateDocument(doc#, FIELD, value)  function ? 

It's possible, but an implementation would not be efficient
when the field is indexed. The current index structure
has no room to spare for insertions, and no provision for
deleted terms.

Some time ago an extra level was added in the index
for skipping ahead more efficiently. Perhaps that could
be combined with a gap for insertions. But when such a gap
would fill up there would again be no choice but to delete and add 
the changed document.
Also adding a document without optimizing is quite efficient
already, so there is probably not much interest in adding
such gaps.

In case the field is stored only and the value would have the
same length as the currently stored value it would be possible
to replace the value efficiently.

The only updates available are on the field norms.
 
Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Paul Elschot
On Tuesday 09 November 2004 23:14, Luke Francl wrote:
 On Tue, 2004-11-09 at 16:00, Paul Elschot wrote:
 
  Lucene has no provision for matching by being prohibited only. This can
  be achieved by indexing something for each document that can be
  used in queries to match always, combined with something prohibited
  in a query.
  But doing this is bad for performance for querying larger nrs of docs.
 
 I'm familiar with Lucene's restrictions on prohibited queries, and I
 have a required clause for a field that will always be part of the query
 (it's not a nonsense value, it's the item type of the object in a CMS). 

That might also be mapped  to a filter.
 
 My problem is that I have been considering the whole query object that
 I've generated. Every BooleanQuery that's a part of my finished query
 must also have a required clause if it has a prohibited clause.
 
 I'm thinking of refactoring my code so that instead of joining together
 Query objects into a large BooleanQuery, it passes around BooleanClauses
 and assembles them into a single BooleanQuery.

It may not be possible to flatten a boolean query to a single level, eg:
(+aa +bb) (+cc +dd)
+(a1 a2) +(b1 b2)

These will generate nested BooleanQuery's iirc.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene external field storage contribution.

2004-11-09 Thread Terry Steichen
Kevin,

Sorry for the delay in replying.  I think your idea for an external field 
storage mechanism is excellent.  I'd love to see it, and if I can, will be 
willing to help make that happen.

Regards,

Terry
  - Original Message - 
  From: Kevin A. Burton 
  To: Lucene Users List 
  Sent: Sunday, November 07, 2004 4:47 PM
  Subject: Lucene external field storage contribution.


  About 3 months ago I developed a external storage engine which ties into 
  lucene. 

  I'd like to discuss making a contribution so that this is integrated 
  into a future version of Lucene.

  I'm going to paste my original PROPOSAL in this email. 

  There wasn't a ton of feedback first time around but I figure squeaky 
  wheel gets the grease...


  
  
   I created this proposal because we need this fixed at work. I want to 
   go ahead and work on a vertical fix for our version of lucene and then 
   submit this back to Jakarta.
   There seems to be a lot of interest here and I wanted to get feedback 
   from the list before moving forward ...
  
   Should I put this in the wiki?!
  
   Kevin
  
   ** OVERVIEW **
  
   Currently Lucene supports 'stored fields; where the content of these 
   fields are
   kept within the lucene index for use in the future.
  
   While acceptable for small indexes, larger amounts of stored fields 
   prevent:
  
   - Fast index merges since the full content has to be continually merged.
  
   - Storing the indexes in memory (since a LOT of memory would be 
   required and
   this is cost prohibitive)
  
   - Fast queries since block caching can't be used on the index data.
  
   For example in our current setup our index size is 20G.  Nearly 90% of 
   this is
   content.  If we could store the content outside of Lucene our merges and
   searches would be MUCH faster.  If we could store the index in MEMORY 
   this could
   be orders of magnitude faster.
  
   ** PROPOSAL **
  
   Provide an external field storage mechanism which supports legacy indexes
   without modification.  Content is stored in a content segment. The only
   changes would be a field with 3(or 4 if checksum enabled) values.
  
   - CS_SEGMENT
  
 Logical ID of the content segment.  This is an integer value.  
   There is
 a global Lucene property named CS_ROOT which stores all the 
   content.
 The segments are just flat files with pointers.  Segments are 
   broken
 into logical pieces by time and size.  Usually 100M of content 
   would be
 in one segment.
  
   - CS_OFFSET
  
 The byte offset of the field.
  
   - CS_LENGTH
  
 The length of the field.
  
   - CS_CHECKSUM
  
 Optional checksum to verify that the content is correct when 
   fetched
 from the index.
  
   - The field value here would be exactly 'N:O:L' where N is the segment 
   number,
 O is the offset, and L is the length.  O and L are 64bit values.  N 
   is a 32
 bit value (though 64bit wouldn't really hurt).
  
   This mechanism allows for the external storage of any named field.

   CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
   code for
   efficient content lookup.  (Though filehandle caching should probably 
   be used).
  
   Since content is broken into logical 100M segments the underlying 
   filesystem can
   orgnize the file into contiguous blocks for efficient non-fragmented 
   lookup.
  
   File manipulation is easy and indexes can be merged by simply 
   concatenating the
   second file to the end of the first.  (Though the segment, offset, and 
   length
   need to be updated).  (FIXME: I think I need to think about this more 
   since I
   will have  100M per syncs)
  
   Supporting full unicode is important.  Full java.lang.String storage 
   is used
   with String.getBytes() so we should be able to avoid unicode issues.  
   If Java
   has a correct java.lang.String representation it's possible easily add 
   unicode
   support just by serializing the byte representation. (Note that the 
   JDK says
   that the DEFAULT system char encoding is used so if this is ever 
   changed it
   might break the index)
  
   While Linux and modern versions of Windows (not sure about OSX) 
   support 64bit
   filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
   example)
   are an issue.  Using smaller indexes can prevent this but eventually 
   segment
   lookup in the filesystem will be slow.  This will only happen within 
   terabyte
   storage systems so hopefully the developer has migrated to another 
   (modern)
   filesystem such as XFS.
  
   ** FEATURES **
  
 - Must be able to replicate indexes easily to other hosts.
  
 - Adding content to the index must be CHEAP
  
 - Deletes need to be cheap (these are cheap for older content.  Just 
   discard
   older indexes)
  
 - Filesystem needs to be able to optimize storage
  
 - Must support UNICODE and binary content (images, 

LUCENE + DATA RETRIVAL

2004-11-09 Thread Karthik N S

Hi guys,


Apologies...


Has any one on the form attempted to retrieved data and Indexed
Macromedia FLASH based Files
If there is some example please distrubute ,it may be usefull for
developer's.


Thx in advance






  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene1.4.1 + OutOf Memory

2004-11-09 Thread Karthik N S

Hi
Guys

Apologies..



History

Ist type :  4  subindexes   +  MultiSearcher  + Search on Content Field
Only  for 2000 hits

  =
Exception  [ Too many Files Open ]





IInd type :  40 Mergerd Indexes [1000 subindexes each]   +  MultiSearcher
/ParallelSearcher +  Search on Content Field Only for 2 hits

  =
Exception  [ OutOf Memeory  ]



System Config  [same for both type]

Amd Processor [High End Single]
RAM  1GB
O/s Linux  ( jantoo type )
Appserver Tomcat 5.05
Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]

Index contains 15 Fields
Search
Done only on 1 field
Retrieve 11 corrosponding fields
3 Fields  are for debug details


Switched from Ist type to IInd Type

Can some body suggest me Why is this Happening

Thx in advance




  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene1.4.1 + OutOf Memory

2004-11-09 Thread yahootintin-lucene
There is a memory leak in the sorting code of Lucene 1.4.1. 
1.4.2 has the fix!

--- Karthik N S [EMAIL PROTECTED] wrote:

 
 Hi
 Guys
 
 Apologies..
 
 
 
 History
 
 Ist type :  4  subindexes   +  MultiSearcher  + Search on
 Content Field
 Only  for 2000 hits
 
   
=
 Exception  [ Too many Files Open ]
 
 
 
 
 
 IInd type :  40 Mergerd Indexes [1000 subindexes each]   + 
 MultiSearcher
 /ParallelSearcher +  Search on Content Field Only for 2
 hits
 
   
=
 Exception  [ OutOf Memeory  ]
 
 
 
 System Config  [same for both type]
 
 Amd Processor [High End Single]
 RAM  1GB
 O/s Linux  ( jantoo type )
 Appserver Tomcat 5.05
 Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]
 
 Index contains 15 Fields
 Search
 Done only on 1 field
 Retrieve 11 corrosponding fields
 3 Fields  are for debug details
 
 
 Switched from Ist type to IInd Type
 
 Can some body suggest me Why is this Happening
 
 Thx in advance
 
 
 
 
   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]
 
 
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-09 Thread yahootintin . 1247688
Otis or Erik, do you know if a Reader continously opening should cause the
Writer to fail with a Lock obtain timed out error?



--- Lucene Users List
[EMAIL PROTECTED] wrote:

The attached Java file shows a locking
issue that occurs with

 Lucene.

 

 One thread opens and closes an IndexReader.
 The other thread

 opens an IndexWriter, adds a document and then closes
the

 IndexWriter.  I would expect that this app should be able to

 happily
run without an issues.

 

 It fails with:

   java.io.IOException: Lock
obtain timed out

 

 Is this expected?  I thought a Reader could be opened
while a

 Writer is adding a document.

 

 Any help is appreciated.




 

 -

 To unsubscribe, e-mail: [EMAIL PROTECTED]

 For
additional commands, e-mail: [EMAIL PROTECTED]

 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]