Re: Index-Format difference between 1.4.3 and 2.0

2006-08-25 Thread lude

Hi Andrzej,

a month ago you mentioned a new Lucene 2.0 compatible Version of luke.
Does it exist somewhere?

Thanks
lude


On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:


lude wrote:
>> As Luke was release with a Lucene-1.9 
>
> Where did you get this information? From all I know Luke is based on
> Lucene
> Version 1.4.3.
>

The latest version of Luke was released with an early snapshot of 1.9. I
plan to release a 2.0-based version in a few days.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index-Format difference between 1.4.3 and 2.0

2006-08-25 Thread Gopikrishnan Subramani

Not sure if it helps, but I have been using Luke (webstart version) from
it's website for quite sometime now for inspecting and manipulating my
indexes built using Lucene 2.0. I may not be a power user of Luke in that
sense, but I haven't found any issues using the basic features.

Gopi


On 8/25/06, lude <[EMAIL PROTECTED]> wrote:


Hi Andrzej,

a month ago you mentioned a new Lucene 2.0 compatible Version of luke.
Does it exist somewhere?

Thanks
lude


On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> lude wrote:
> >> As Luke was release with a Lucene-1.9 
> >
> > Where did you get this information? From all I know Luke is based on
> > Lucene
> > Version 1.4.3.
> >
>
> The latest version of Luke was released with an early snapshot of 1.9. I
> plan to release a 2.0-based version in a few days.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




Re: Upgrade from 1.4.3 to 1.9.1. Any problems with using existing index files?

2006-08-25 Thread Michael McCandless


We are upgrading from Lucene 1.4.3 to 1.9.1, and have many customers 
with large existing index files. In our testing we have reused large 
indexes created in 1.4.3 in 1.9.1 without incident. We have looked 
through the changelog and the code and can't see any reason there should 
be any problems doing so.


So, we're just wondering, has anyone had any problems, or is there 
anything we need to look out for?


Looking at the code and also at the file formats specification:

  http://lucene.apache.org/java/docs/fileformats.html

I believe this is completely fine.  Meaning, the 1.9.x code can open the 
older index format for both searching and writing (either deletes or 
added docs), without issue.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index Stat Functions

2006-08-25 Thread Mag Gam

Hi All,

I am trying to get some stats on my Index such as:

1) When it was created
2) Size in MB of the index
3) If I can get the size, date of each file in the index. For example: I
index 100 files, is it possible for me to get their name, size, and date
when the last modification of that file (similar to a unix "ls -la
/path/to/file)

tia


what do i get with FieldCache.DEFAULT.getStrings(...);

2006-08-25 Thread Martin Braun
hello,
I am using FieldCache.DEFAULT.getStrings in combination with an own
HitCollector (I loop through all results and count the number of
occurences of a fieldvalue in the results).

My Problem is that I have Filed values like dt.|lat or ger.|eng. an it
seems that only the last token of the fields value is stored in the
returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(),
category).

But both values are Stored in the Index (I can find dt. and lat.)

The same issue is with another field which contains the word cd-rom an I
get only "rom" back.

Is this an Analyzer Problem? How do I get all tokens?



tia,
martin




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Zhao, Xin

Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs for each 
item digested(we actually need the top 25 MeSH terms generated), would it be 
any problem if there are too many documents? If we apply field name like 
"mesh_1", "mesh_2"..., when it comes to search, we will have to generate a 
loop for each single one of the query terms( there will be more than 20-30 
terms on average, since we are using sematic web to implement concept 
search), do you think it would affect the performance in a very bad way?

Regards,
Xin


- Original Message - 
From: "Dedian Guo" <[EMAIL PROTECTED]>

To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can 
group
your mesh terms as one string then add into a field, which requires a 
simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from medical
citations, and then use Lucene to save the terms and citations for future
search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field "mesh". If I
understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized into
filed boost. In this case, we wouldn't be able to save separate score, so
the information is lost. Am I correct? Is there anyway we could change 
it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs Database Search

2006-08-25 Thread Chris Lu

Performance wise, Lucene search is much faster for full-text search.
If you only do "Employee ID" search, or exact match of Names,
database's search can do a good job already.

If it's regarding the index maintenance, you should have a updated_at
column for each record, and select the latest records out. And do an
"update" to the index periodically.

Chris Lu
-
Lucene Search for Any Databases/Applications
http://www.dbsight.net

On 8/24/06, kalpesh patel <[EMAIL PROTECTED]> wrote:

Hi,

  I have an application. It has large number of records around (1.2 million) 
with a possibility of doubling every year. The average records being added per 
day is around 3000 distributed over the day. The inserted record has to be 
searchable immediately once it is entered into the database and the index 
updated. I have created a Lucene index, and the size is around 0.5 GB.

  The search DOES not require text search. It just includes search by First 
Name, Last Name, Employee ID.

  What would be better solution in the existing situation and long run? Keeping 
all the searchable records in one database table (issuing a select query 
against one table) or using Lucene index.

  Thanks in advance.

  -Kalpesh


-
How low will we go? Check out Yahoo! Messenger's low  PC-to-Phone call rates.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what do i get with FieldCache.DEFAULT.getStrings(...);

2006-08-25 Thread Chris Lu

Not sure of the solution though. But FieldCache.DEFAULT.getStrings()
is returning a String[], with one String for each document. Seems your
field is analyzed into multiple String values.

Chris Lu
---
Lucene Search on Any Databases/Applications
http://www.dbsight.net

On 8/25/06, Martin Braun <[EMAIL PROTECTED]> wrote:

hello,
I am using FieldCache.DEFAULT.getStrings in combination with an own
HitCollector (I loop through all results and count the number of
occurences of a fieldvalue in the results).

My Problem is that I have Filed values like dt.|lat or ger.|eng. an it
seems that only the last token of the fields value is stored in the
returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(),
category).

But both values are Stored in the Index (I can find dt. and lat.)

The same issue is with another field which contains the word cd-rom an I
get only "rom" back.

Is this an Analyzer Problem? How do I get all tokens?



tia,
martin




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WIll storing docs affect lucene's search performance ?

2006-08-25 Thread Rupinder Singh Mazara

Where can I find information which version / tag to checkout so as to
get the lazy loading verity of lucene


Grant Ingersoll wrote:


Large stored fields can affect performance when you are iterating over 
your hits (assuming you are not interested in the value of the stored 
field at that point in time) for a results display since all Fields 
are loaded when getting the Document.  The SVN trunk has a version of 
lazy loading that allows you to specify which fields are loaded and 
which ones are lazy, so you can avoid loading fields that a user will 
never view.


-Grant

On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:

I have a requirement ( use highlighter) to  store the doc content 
somewhere., and I am not allowed to use a RDBMS. I am thinking of 
using Lucene's Field with (Field.Store.YES and Field.Index.NO) to 
store the doc content. Will it have any negative affect on my search 
performance ?
I think I have read somewhere that  Lucene shouldn't be used(or 
misused)  to provide RDBMS like storage.


--prasen

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Rupinder Singh Mazara

hi Xin

 this is take a look at this you can add multiple fields with the name 
mesh

for ( i=0; i< meshList.size() ; i++ ){
   meshTerm = meshList.get(i)
 document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
Field.Store.YES , Field.Index.NO_NORMS  );

}

 when querying this index, create a analyzer that infers the text 
string and generates id's that correspond to the mesh term in the 
semantic web




Zhao, Xin wrote:

Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs 
for each item digested(we actually need the top 25 MeSH terms 
generated), would it be any problem if there are too many documents? 
If we apply field name like "mesh_1", "mesh_2"..., when it comes to 
search, we will have to generate a loop for each single one of the 
query terms( there will be more than 20-30 terms on average, since we 
are using sematic web to implement concept search), do you think it 
would affect the performance in a very bad way?

Regards,
Xin


- Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]>
To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u 
can group
your mesh terms as one string then add into a field, which requires a 
simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from 
medical
citations, and then use Lucene to save the terms and citations for 
future

search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field "mesh". 
If I

understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized into
filed boost. In this case, we wouldn't be able to save separate 
score, so
the information is lost. Am I correct? Is there anyway we could 
change it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Zhao, Xin

Hi, Rupinder,
My understanding is Field.Index.NO_NORMS disables  index-time boosting and 
field length normalization at the same time. But I do need index-time 
boosting to store the scoring of each mesh term. Have I missed anything?

Thank you very much for your help,
Xin

- Original Message - 
From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]>

To: 
Sent: Friday, August 25, 2006 10:49 AM
Subject: Re: controlled vocabulary



hi Xin

 this is take a look at this you can add multiple fields with the name 
mesh

for ( i=0; i< meshList.size() ; i++ ){
   meshTerm = meshList.get(i)
 document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
Field.Store.YES , Field.Index.NO_NORMS  );

}

 when querying this index, create a analyzer that infers the text string 
and generates id's that correspond to the mesh term in the semantic web




Zhao, Xin wrote:

Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs for 
each item digested(we actually need the top 25 MeSH terms generated), 
would it be any problem if there are too many documents? If we apply 
field name like "mesh_1", "mesh_2"..., when it comes to search, we will 
have to generate a loop for each single one of the query terms( there 
will be more than 20-30 terms on average, since we are using sematic web 
to implement concept search), do you think it would affect the 
performance in a very bad way?

Regards,
Xin


- Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]>
To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can 
group
your mesh terms as one string then add into a field, which requires a 
simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from 
medical
citations, and then use Lucene to save the terms and citations for 
future

search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field "mesh". If 
I

understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized into
filed boost. In this case, we wouldn't be able to save separate score, 
so
the information is lost. Am I correct? Is there anyway we could change 
it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Rupinder Singh Mazara

Hi Xin

  then perhaps you can change it to Field.Index.TOKENIZED, but i was 
not aware that pubmed boosts mesh terms, they broadly classify terms as 
major and minor, if you plan to use this simple system of classification 
consider adding the major terms twice to the document ?


Zhao, Xin wrote:

Hi, Rupinder,
My understanding is Field.Index.NO_NORMS disables  index-time boosting 
and field length normalization at the same time. But I do need 
index-time boosting to store the scoring of each mesh term. Have I 
missed anything?

Thank you very much for your help,
Xin

- Original Message - From: "Rupinder Singh Mazara" 
<[EMAIL PROTECTED]>

To: 
Sent: Friday, August 25, 2006 10:49 AM
Subject: Re: controlled vocabulary



hi Xin

 this is take a look at this you can add multiple fields with the 
name mesh

for ( i=0; i< meshList.size() ; i++ ){
   meshTerm = meshList.get(i)
 document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
Field.Store.YES , Field.Index.NO_NORMS  );

}

 when querying this index, create a analyzer that infers the text 
string and generates id's that correspond to the mesh term in the 
semantic web




Zhao, Xin wrote:

Hi,
Thank you for your reply. I had thought about the first two 
solutions before. If we apply one doc for each MeSH term, it would 
be 26 docs for each item digested(we actually need the top 25 MeSH 
terms generated), would it be any problem if there are too many 
documents? If we apply field name like "mesh_1", "mesh_2"..., when 
it comes to search, we will have to generate a loop for each single 
one of the query terms( there will be more than 20-30 terms on 
average, since we are using sematic web to implement concept 
search), do you think it would affect the performance in a very bad 
way?

Regards,
Xin


- Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]>
To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u 
can group
your mesh terms as one string then add into a field, which requires 
a simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from 
medical
citations, and then use Lucene to save the terms and citations for 
future

search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field 
"mesh". If I

understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized 
into
filed boost. In this case, we wouldn't be able to save separate 
score, so
the information is lost. Am I correct? Is there anyway we could 
change it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Zhao, Xin

Hi, Rupinder,
Our algorithm is a little different from what PubMed does. We have scoring 
for each mesh term, which will affect the search result.

What do you think the difference would be for these two:
document.addField(Field.Keyword("mesh", ""));
and
document.addField( new Field( "mesh", "", Field.Store.YES , 
Field.Index.TOKENIZED );


Thank you,
Xin



- Original Message - 
From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]>

To: 
Sent: Friday, August 25, 2006 11:27 AM
Subject: Re: controlled vocabulary



Hi Xin

  then perhaps you can change it to Field.Index.TOKENIZED, but i was not 
aware that pubmed boosts mesh terms, they broadly classify terms as major 
and minor, if you plan to use this simple system of classification 
consider adding the major terms twice to the document ?


Zhao, Xin wrote:

Hi, Rupinder,
My understanding is Field.Index.NO_NORMS disables  index-time boosting 
and field length normalization at the same time. But I do need index-time 
boosting to store the scoring of each mesh term. Have I missed anything?

Thank you very much for your help,
Xin

- Original Message - From: "Rupinder Singh Mazara" 
<[EMAIL PROTECTED]>

To: 
Sent: Friday, August 25, 2006 10:49 AM
Subject: Re: controlled vocabulary



hi Xin

 this is take a look at this you can add multiple fields with the name 
mesh

for ( i=0; i< meshList.size() ; i++ ){
   meshTerm = meshList.get(i)
 document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
Field.Store.YES , Field.Index.NO_NORMS  );

}

 when querying this index, create a analyzer that infers the text string 
and generates id's that correspond to the mesh term in the semantic web




Zhao, Xin wrote:

Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs for 
each item digested(we actually need the top 25 MeSH terms generated), 
would it be any problem if there are too many documents? If we apply 
field name like "mesh_1", "mesh_2"..., when it comes to search, we will 
have to generate a loop for each single one of the query terms( there 
will be more than 20-30 terms on average, since we are using sematic 
web to implement concept search), do you think it would affect the 
performance in a very bad way?

Regards,
Xin


- Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]>
To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can 
group
your mesh terms as one string then add into a field, which requires a 
simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from 
medical
citations, and then use Lucene to save the terms and citations for 
future

search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field "mesh". 
If I

understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized 
into
filed boost. In this case, we wouldn't be able to save separate 
score, so
the information is lost. Am I correct? Is there anyway we could 
change it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Test new query parser?

2006-08-25 Thread Mark Miller
I have received a few inquires about my new query parser. I apologize 
for making that announcement a little premature. My current 
implementation only allows simple mixing of proximity queries with 
boolean queries...complex mixing would result in an incorrect search. A 
reply to my first email made me consider this more (I had done that part 
a while ago) and I came to the conclusion that it was obviously 
unacceptable to release the parser to anyone in this hobbled form. The 
parser must support arbitrary mixing of boolean and proximity searchers.


I think I have cracked this. I would say I am at 90%  of the way to a 
solution and can see the light at the end of the tunnel. When I have 
resolved this issue, I will contact those that have expressed interest 
and provide them with the parser. With some feedback and improvements I 
will think about how to release it generally.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: controlled vocabulary

2006-08-25 Thread Zhao, Xin
now. i have a second thought about one meah term per document. the scoring 
formula(hits too) is based on document, right? does it mean that we 
shouldn't have more than one document for each object indexed?
for example, i try to index a publication,  for some of the information, 
like title, abstract i would like to store and index them using default 
similarity, while the other information i would like to use customized 
similarity. i probably should use a different indexing directory and writer 
instead of two documents in the same index, right?
thank you for helping me. you could see that i am in the early learning 
stage now.

xin



- Original Message - 
From: "Zhao, Xin" <[EMAIL PROTECTED]>

To: 
Sent: Friday, August 25, 2006 10:21 AM
Subject: Re: controlled vocabulary



Hi,
Thank you for your reply. I had thought about the first two solutions 
before. If we apply one doc for each MeSH term, it would be 26 docs for 
each item digested(we actually need the top 25 MeSH terms generated), 
would it be any problem if there are too many documents? If we apply field 
name like "mesh_1", "mesh_2"..., when it comes to search, we will have to 
generate a loop for each single one of the query terms( there will be more 
than 20-30 terms on average, since we are using sematic web to implement 
concept search), do you think it would affect the performance in a very 
bad way?

Regards,
Xin


- Original Message - 
From: "Dedian Guo" <[EMAIL PROTECTED]>

To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library


in my solution, you can apply one doc for each mesh term, or apply 
different
keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can 
group
your mesh terms as one string then add into a field, which requires a 
simple

string parser for the group string when you wanna read the terms...

not sure if that works or answers your question...

On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from 
medical
citations, and then use Lucene to save the terms and citations for 
future

search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
---
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for

as you could see we generate all the terms under named field "mesh". If 
I

understand correctly, all the fields under the same name would
eventually  save into one field, with all the scores be normalized into
filed boost. In this case, we wouldn't be able to save separate score, 
so
the information is lost. Am I correct? Is there anyway we could change 
it? I
understand Lucene is for keyword search, and what we try to do is 
Controlled

Vocabulary search, Any other tool we could use?

Thank you,
Xin








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WIll storing docs affect lucene's search performance ?

2006-08-25 Thread Grant Ingersoll

It is on the HEAD version in SVN.

See  http://wiki.apache.org/jakarta-lucene/SourceRepository for info  
on checking out from SVN.



-Grant

On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote:


Where can I find information which version / tag to checkout so as to
get the lazy loading verity of lucene


Grant Ingersoll wrote:


Large stored fields can affect performance when you are iterating  
over your hits (assuming you are not interested in the value of  
the stored field at that point in time) for a results display  
since all Fields are loaded when getting the Document.  The SVN  
trunk has a version of lazy loading that allows you to specify  
which fields are loaded and which ones are lazy, so you can avoid  
loading fields that a user will never view.


-Grant

On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:

I have a requirement ( use highlighter) to  store the doc content  
somewhere., and I am not allowed to use a RDBMS. I am thinking of  
using Lucene's Field with (Field.Store.YES and Field.Index.NO) to  
store the doc content. Will it have any negative affect on my  
search performance ?
I think I have read somewhere that  Lucene shouldn't be used(or  
misused)  to provide RDBMS like storage.


--prasen

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WIll storing docs affect lucene's search performance ?

2006-08-25 Thread Grant Ingersoll

It is on the HEAD version in SVN.

See  http://wiki.apache.org/jakarta-lucene/SourceRepository for info  
on checking out from SVN.



On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote:


Where can I find information which version / tag to checkout so as to
get the lazy loading verity of lucene


Grant Ingersoll wrote:


Large stored fields can affect performance when you are iterating  
over your hits (assuming you are not interested in the value of  
the stored field at that point in time) for a results display  
since all Fields are loaded when getting the Document.  The SVN  
trunk has a version of lazy loading that allows you to specify  
which fields are loaded and which ones are lazy, so you can avoid  
loading fields that a user will never view.


-Grant

On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:

I have a requirement ( use highlighter) to  store the doc content  
somewhere., and I am not allowed to use a RDBMS. I am thinking of  
using Lucene's Field with (Field.Store.YES and Field.Index.NO) to  
store the doc content. Will it have any negative affect on my  
search performance ?
I think I have read somewhere that  Lucene shouldn't be used(or  
misused)  to provide RDBMS like storage.


--prasen

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sharing Documents between Lucene and DotLucene

2006-08-25 Thread d rj

Hello-

I am just wondering if any one has encountered any good strategies for
sharing search records between a Linux based server using Lucene and a
Windows based client using DotLucene.

I am doing all the indexing on the server ( i.e. the master index is
contained on the server) and I would like to transfer parts of that index
across the wire to a client.

Presently I am creating a temporary sub-index on the server, adding the
appropriate Document objects to that index, then transferring the the entire
index to the client which then merges the index into any existing index it
may already have.  However, I would like to avoid building/transferring a
sub-index.

I would like to know if anyone has attempted to directly marshall Document
objects from Java to C#.  Or if there are any other good approaches sharing
individual Document objects between Lucene and DotLucene.

Thanks.
-drj


Re: controlled vocabulary

2006-08-25 Thread Dedian Guo

Hi, Xin, in my understanding , the document in Lucene is a term of
collection of fields, while a field is pair of keyword and value, tough it
can be indexed or stored or both. That is plain structure. if you wanna
index a deep tree structure such as complex objects and keep those
relationship inside, i guess we need do some tricky of that. so in my
mentioned solution, i will do something on the keyword of a document(here, a
document represent a object...) . the score problem you mentioned in your
question is similar, i mean, score is actually an attribute of mesh object,
so u wanna index the information which has a tree-like structure (i met the
similar problem when i indexing xml-based pages. esp. for those have lots of
deep element nodes, deep index needed for deep searching).

correct me if i was wrong or there are some better solutions...

On 8/25/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


now. i have a second thought about one meah term per document. the scoring
formula(hits too) is based on document, right? does it mean that we
shouldn't have more than one document for each object indexed?
for example, i try to index a publication,  for some of the information,
like title, abstract i would like to store and index them using default
similarity, while the other information i would like to use customized
similarity. i probably should use a different indexing directory and
writer
instead of two documents in the same index, right?
thank you for helping me. you could see that i am in the early learning
stage now.
xin



- Original Message -
From: "Zhao, Xin" <[EMAIL PROTECTED]>
To: 
Sent: Friday, August 25, 2006 10:21 AM
Subject: Re: controlled vocabulary


> Hi,
> Thank you for your reply. I had thought about the first two solutions
> before. If we apply one doc for each MeSH term, it would be 26 docs for
> each item digested(we actually need the top 25 MeSH terms generated),
> would it be any problem if there are too many documents? If we apply
field
> name like "mesh_1", "mesh_2"..., when it comes to search, we will have
to
> generate a loop for each single one of the query terms( there will be
more
> than 20-30 terms on average, since we are using sematic web to implement
> concept search), do you think it would affect the performance in a very
> bad way?
> Regards,
> Xin
>
>
> - Original Message -
> From: "Dedian Guo" <[EMAIL PROTECTED]>
> To: ; "Zhao, Xin" <[EMAIL PROTECTED]>
> Sent: Thursday, August 24, 2006 4:22 PM
> Subject: Re: controlled library
>
>
>> in my solution, you can apply one doc for each mesh term, or apply
>> different
>> keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can
>> group
>> your mesh terms as one string then add into a field, which requires a
>> simple
>> string parser for the group string when you wanna read the terms...
>>
>> not sure if that works or answers your question...
>>
>> On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi,
>>> I have a design question. Here is what we try to do for indexing:
>>> We designed an indexing tool to generate standard MeSH terms from
>>> medical
>>> citations, and then use Lucene to save the terms and citations for
>>> future
>>> search. The information we need to save are:
>>> a) the exact mesh terms (top 10)
>>> b) the score for each term
>>> so the codings are like
>>> ---
>>> for the top 10 MeSH Terms
>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>> myField.setBoost(score);
>>> doc.add(myFiled);
>>> end for
>>> 
>>> as you could see we generate all the terms under named field "mesh".
If
>>> I
>>> understand correctly, all the fields under the same name would
>>> eventually  save into one field, with all the scores be normalized
into
>>> filed boost. In this case, we wouldn't be able to save separate score,
>>> so
>>> the information is lost. Am I correct? Is there anyway we could change
>>> it? I
>>> understand Lucene is for keyword search, and what we try to do is
>>> Controlled
>>> Vocabulary search, Any other tool we could use?
>>>
>>> Thank you,
>>> Xin
>>>
>>>
>>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Sharing Documents between Lucene and DotLucene

2006-08-25 Thread George Aroush
Hi,

I am the developer and maintainer of Lucene.Net.

DotLucene is the old name, Lucene.Net is the official name.  You can find
out more about Lucene.Net by visiting this link:
http://incubator.apache.org/lucene.net/

I am not sure what you mean by "marshall Document objects from Java to C#".
However, if mean sharing an index that's created by Jakarta Lucene be
searched/updated by Lucene.Net and via-vise, then the answer is yes.  In
fact, if you share the lock file, you can have concurrent access and update
to the Lucene index from the Jakarta Lucene and Lucene.Net

As part of Lucene.Net release, I always test and validate this test-case.

Regards,

-- George Aroush


-Original Message-
From: d rj [mailto:[EMAIL PROTECTED]
Sent: Friday, August 25, 2006 5:33 PM
To: java-user@lucene.apache.org
Subject: Sharing Documents between Lucene and DotLucene

Hello-

I am just wondering if any one has encountered any good strategies for
sharing search records between a Linux based server using Lucene and a
Windows based client using DotLucene.

I am doing all the indexing on the server ( i.e. the master index is
contained on the server) and I would like to transfer parts of that index
across the wire to a client.

Presently I am creating a temporary sub-index on the server, adding the
appropriate Document objects to that index, then transferring the the entire
index to the client which then merges the index into any existing index it
may already have.  However, I would like to avoid building/transferring a
sub-index.

I would like to know if anyone has attempted to directly marshall Document
objects from Java to C#.  Or if there are any other good approaches sharing
individual Document objects between Lucene and DotLucene.

Thanks.
-drj


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what do i get with FieldCache.DEFAULT.getStrings(...);

2006-08-25 Thread Chris Hostetter

FieldCache was designed with searching in mind, where there can only be a
single indexed Term for each doc (otherwise how would you sort a doc that
had two Terms "a" and "z" ?)  I'm acctually suprised you are getting any
values out instead of an Exception

If you index your Field as UN_TOKENIZED you should get the resultss you
expect -- but then searching on individual words may not work the way you
expect, adding the data to two differnet fields (on TOKENIZED for
search and one UN_TOKENIZED for sorting/FieldCache) is the typicaly solution.

You also may want to look at hte LazyFieldLoading using the Fieldable APIs
 they are for accessing the STORED fields of a Document, and are
aparently much faster then the old method of pulling out the whole
Document ... wether they are as fast as FieldCache or not I don't know.


: Date: Fri, 25 Aug 2006 15:26:38 +0200
: From: Martin Braun <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED]
: To: java-user@lucene.apache.org
: Subject: what do i get with FieldCache.DEFAULT.getStrings(...);
:
: hello,
: I am using FieldCache.DEFAULT.getStrings in combination with an own
: HitCollector (I loop through all results and count the number of
: occurences of a fieldvalue in the results).
:
: My Problem is that I have Filed values like dt.|lat or ger.|eng. an it
: seems that only the last token of the fields value is stored in the
: returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(),
: category).
:
: But both values are Stored in the Index (I can find dt. and lat.)
:
: The same issue is with another field which contains the word cd-rom an I
: get only "rom" back.
:
: Is this an Analyzer Problem? How do I get all tokens?
:
:
:
: tia,
: martin
:
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



A problem on performance

2006-08-25 Thread luan xl
I have got nearly 4 million chinese documents, each size ranges from 1k - 
300k. So I use
org.apache.lucene.analysis.cn.ChineseAnalyzer as the analyzer for the text. 
The index have
four fields: 


content - tokenized not stored
title - tokenized and stored
path - stored only
date - stored only

For some reason, I divide these documents into 12 sets and use 
IndexSearcher over 
MultiReader for search. For all the english query, the speed is very fast, 
only cost about
10-100ms. But when I use the Chinese words for query, the situation is a 
bit confused:
If the word is only one char, so the Query is actually a TermQuery, the 
speed is very fast.
however, If the word is more than one char, the Query is actually a 
PhraseQuery with slop 0,

IndexSearcher usually cost 3000-5000ms to return the Hits.

I have also tested with the QueryParser and get the same results, and my 
environment is a
Dell PE2600 2G*2 Xeon, 2GRAM, 1R/s SCSI, Debian/sarge, Sun JDK 1.5 + 
lucene 2.0.0


thanks.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Incemental Updating

2006-08-25 Thread neils

Hi,

i have two applications on an windows machine. One is the searchengine where
the index is can be searched.
The second application runs one time on a day which updates
(deletions/adding)
the index.

My question:
The index is already opened (Indexreader) by the frist application. Is there
a problem when second application accesses the same indexfiles for updating 
at the same time? I tried it and i get no exception, but when i search for
the for documents where the values where changed (first delete, than add 
new document), i can only find it with the old values, not with new ones.


Thanks a lot for you help :-))

-- 
View this message in context: 
http://www.nabble.com/Incemental-Updating-tf2168389.html#a5995363
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]