Re: Near Duplicate Documents

2007-11-18 Thread rishabh9

Can anyone help me?

Rishabh


rishabh9 wrote:
 
 Hi,
 
 I am evaluating Solr 1.2 for my project and wanted to know if it can
 return near duplicate documents (near dups) and how do i go about it? I am
 not sure, but is MoreLikeThisHandler the implementation for near dups?
 
 Rishabh
 
 

-- 
View this message in context: 
http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query multiple fields

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 1:50 AM, Dave C. [EMAIL PROTECTED] wrote:
 Maybe you can help me with this related problem I am having.
 My query is: q=description:(test)!(type:10)!(type:14).

 However, my results are not as expected (55 results instead of the expected 
 23)

 The response header shows:
 responseHeader:{
   status:0,
   QTime:1,
   params:{
 wt:json,
 !(type:10):,
 !(type:14):,
 indent:on,
 q:description:(test),
 fl:*}},

 I am confused about why the !(type:10)!(type:14) is not in the 'q' 
 parameter.

Looks like the first  in your query is being interpreted as a
divider in the query string.  You probably need to escape every  as
%26 in your query.

-Stuart


Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are similar in
 content. To elaborate a little more on what we mean here, lets take an
 example.

 The example of this email chain in which we are interacting on, can be best
 used for illustrating the concept of near dupes (We are not getting confused
 with threads, they are two different things.). Each email in this thread is
 treated as a document by the system. A reply to the original mail also
 includes the original mail in which case it becomes a near duplicate of the
 orginal mail (depending on the percentage of similarity).  Similarly it goes
 on. The near dupes need not be limited to emails.

I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart


Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

Regards,
Eswar

On Nov 18, 2007 9:06 PM, Ryan McKinley [EMAIL PROTECTED] wrote:

 I'm not sure I understand your question...

 A near duplicate document could mean a LOT of things depending on the
 context.

 perhaps you just need fuzzy searching?
 http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches

 or proximity searches?

 http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches


 MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is
 used to search for other similar documents based on the results of
 another query.

 ryan


 rishabh9 wrote:
  Can anyone help me?
 
  Rishabh
 
 
  rishabh9 wrote:
  Hi,
 
  I am evaluating Solr 1.2 for my project and wanted to know if it can
  return near duplicate documents (near dups) and how do i go about it? I
 am
  not sure, but is MoreLikeThisHandler the implementation for near
 dups?
 
  Rishabh
 
 
 




Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
Is there any idea implementing that feature in the up coming releases?

Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
 similar in
  content. To elaborate a little more on what we mean here, lets take an
  example.
 
  The example of this email chain in which we are interacting on, can be
 best
  used for illustrating the concept of near dupes (We are not getting
 confused
  with threads, they are two different things.). Each email in this thread
 is
  treated as a document by the system. A reply to the original mail also
  includes the original mail in which case it becomes a near duplicate of
 the
  orginal mail (depending on the percentage of similarity).  Similarly it
 goes
  on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

 -Stuart



Performance of Solr on different Platforms

2007-11-18 Thread Eswar K
Hi,

I understand that Solr can be used on different Linux flavors. Is there any
preferred flavor (Like Red Hat, Ubuntu, etc)?
Also what is the kind of configuration of hardware (Processors, RAM, etc) be
best suited for the install?
We expect to load it with millions of documents (varying from 2 - 20
million). There might be around 1000 concurrent users.

Your help in this regard will be appreciated.

Regards,
Eswar


Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley

Eswar K wrote:

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?



mess around with the MoreLikeThisHandler, see if it gives you what you 
are looking for.


Check:
http://wiki.apache.org/solr/MoreLikeThis

For your example, you would want to make sure that the 'type' field 
(email) is in the mlt.fl param.  Perhaps: mlt.fl=type,content


Finding all possible synonyms for a word

2007-11-18 Thread Kishore AVK. Veleti
Hi All,

I am new to Lucene / SOLR and developing a POC as part of research. Check below 
my requirement and problem statement. Need help on how I can index the data 
such data I have a very good search functionality in my POC.

--
Requirement:
--

Assume my web application is an Online book store and it sell all categories of 
books like Computers, Social Studies, Physical Sciences etc. Each of these 
categories has sub-categories. For example Computers has sub-categories like 
Software Engineering, Java, SQL Server etc

I have a database table called Categories and it contains both Parent Category 
descriptions and also Child Category descriptions.

Data structure of Category table is:

Category_ID_Primay_Key  integer
Parent_Category_ID  integer
Category_Name varchar(100)
Category_Description varchar(1000)


--
My Search UI:
--

My search page is very simple. We have a text field with Search button.

--
User Action:
--

User enter below search text in above text field and clicks on Search button.

Books on Data Center

--
What is my expected behavior:
--

Since the word Data Center more relevant computers I should show books 
related to computers.

--
My Problem statement and Question to you all:
--

To have a better search in my web applications what kind of strategy should I 
have and index the data accordingly in SOLR/Lucene.

In my Lucene Index I may or may not have the word data center. Still I should 
be able to return data center

One thought I have is as follows:

Modify the Category table by adding one more column to it:

Category_ID_Primay_Key  integer
Parent_Category_ID  integer
Category_Name varchar(100)
Category_Description varchar(1000)
Category_Description_Keywords varchar(8000)

Now take each word in Category_description, find synonyms of it and store 
that data in Category_Description_Keywords column. After doing it, index the 
Category table records in SOLR/Lucene.

Below are my questions to you all:

Question 1:
Need your feedbacks on above approach or any other approach which help me to 
make my search better that returns most relevant results to the user.

Question 2:
Can you suggest me Java based best Open Source or commercial synonym engines. I 
want such a best synonym engine that gives me all possible synonyms of a word.



Thanks in Advance,
Kishore Veleti A.V.K.


RE: Query multiple fields

2007-11-18 Thread Stu Hood
 q=description:(test)!(type:10)!(type:14)

You can't use an '' symbol in your query (without escaping it). The boolean 
operator for 'and' in Lucene is 'AND': and it is case sensitive. Your query 
should probably look like:

 q=description:test AND -type:10 AND -type:14

See the Lucene query syntax here:

http://lucene.apache.org/java/docs/queryparsersyntax.html#Boolean%20operators

Thanks,
Stu


-Original Message-
From: Dave C. [EMAIL PROTECTED]
Sent: Sunday, November 18, 2007 1:50am
To: solr-user@lucene.apache.org
Subject: RE: Query multiple fields

Hi Nick,

Maybe you can help me with this related problem I am having.
My query is: q=description:(test)!(type:10)!(type:14).

However, my results are not as expected (55 results instead of the expected 23)

The response header shows: 
responseHeader:{
  status:0,
  QTime:1,
  params:{
wt:json,
!(type:10):,
!(type:14):,
indent:on,
q:description:(test),
fl:*}},

I am confused about why the !(type:10)!(type:14) is not in the 'q' 
parameter.

Any ideas?

Thanks,
David


 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: RE: Query multiple fields
 Date: Sun, 18 Nov 2007 03:18:12 +
 
 oh, awesome thanks
 
 -david
 
 
 
  Date: Sun, 18 Nov 2007 15:24:00 +1300
  From: [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Subject: Re: Query multiple fields
  
  Hi David
  You had it write in your example :)
  
  description:test AND type:10
  
  But it would probably be wise to wrap any text in parenthesis:
  
  description:(test foo bar baz) AND type:10
  
  You can find more info on the query syntax here:
  http://lucene.apache.org/java/docs/queryparsersyntax.html
  -Nick
  On 11/18/07, Dave C. [EMAIL PROTECTED] wrote:
   Hello,
  
   I've been trying to figure out how to query multiple fields at a time.
   For example, I want to do something like: description:test AND type:10.
   I've tried things like: ?q=description:testtype:10 etc, but I keep 
   getting syntax errors.
  
   Can anyone tell me how this can be done?
  
   Thanks,
   David
  
   P.S. Perhaps the solution to this could/should be added to the 
   FAQ/tutorial?
  
   _
   You keep typing, we keep giving. Download Messenger and join the i'm 
   Initiative now.
   http://im.live.com/messenger/im/home/?source=TAGLM
 
 _
 You keep typing, we keep giving. Download Messenger and join the i’m 
 Initiative now.
 http://im.live.com/messenger/im/home/?source=TAGLM

_
You keep typing, we keep giving. Download Messenger and join the i’m Initiative 
now.
http://im.live.com/messenger/im/home/?source=TAGLM



Re: Payloads in Solr

2007-11-18 Thread Tricia Williams

Thanks for your comments, Yonik!

All for it... depending on what one means by payload functionality of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

  
It sounds like Lucene 2.3 is going to be released soonish 
(http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605).  As 
best I can tell it will include the Payload stuff marked experimental.  
The new Lucene version will have many improvements besides Payloads 
which would benefit Solr (examples galore in CHANGES.txt 
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).  
So I find it hard to believe that the new release will not be included.  
I recognize that the experimental status would be worrisome.  What will 
it take to get Payloads to the place that they would be excepted for use 
in the Solr community?  You probably know more about the projected 
changes to the API than I.  Care to fill me in or suggest who I should 
ask?  On the [EMAIL PROTECTED] list Grant Ingersoll 
suggested that the Payload object would be done away with and the API 
would just deal with byte arrays directly.

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

  
I agree that is a lot of data to associate with every token - especially 
since the data is repetitive in nature.  Erik Hatcher suggested I store 
a representation of the structure of the document in a separate field, 
store a numeric representation of the mapping of the token to the 
structure as the payload for each token, and do a lookup at query time 
based on the numeric mapping in the payload at the position hit to get 
the structure/context back for the token.


I'm also wondering how others have accomplished this.  Grant Ingersoll 
noted that one of the original use cases was XPath queries so I'm 
particularly interested in finding out if anyone has implemented that, 
and how.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

  
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that 
uses Payloads (but I sure hope not!).  Maybe we can build some optional 
configuration options into the TokenFilter constructor that guide their 
behavior with regard to Payloads.  Maybe there is something stored in 
the TokenStream that dictates how the Payloads are handled by the 
TokenFilters.  Maybe there is no case where identical payloads would not 
be created for new tokens and we can just change the TokenFilter to deal 
with payloads directly in a uniform way.


Tricia


solrj users -- API feedback, suggestions, etc

2007-11-18 Thread Ryan McKinley

Hello-

Solrj has been out there for a while, but is not yet baked into an 
official release.  If there is anything major to change just so it feels 
better, now is the time.  Here are a few things I'm thinking about:


1. The setFields() behavior
Currently:
  query.setFields( name,id );
generates:
  fl=name,id

while:
  query.setFields( name, id );
generates:
  fl=namefl=id  (undefined behavior, it will probably just use 'name')


2. though maybe just because I'm looking at it is that request  
response are split into two packages when it seems like the 
request/response pair should sit next to eachother.


3. Interface vs Abstract super class?  I know interfaces are an OO 
standard, but I have found they are pain to maintain across releases 
(you can't add a function to an interface without breaking existing 
implementations)  Perhaps we should convert the interfaces to abstract 
super classes where possible.



Other thoughts?

ryan


Re: Payloads, Tokenizers, and Filters. Oh My!

2007-11-18 Thread Tricia Williams
I apologize for cross-posting but  I believe both Solr and Lucene users 
and developers should be concerned with this.  I am not aware of a 
better way to reach both communities.


In this email I'm looking for comments on:

   * Do TokenFilters belong in the Solr code base at all?
   * How to deal with TokenFilters that add new Tokens to the stream?
   * How to patch TokenFilters and Tokenizers using the model of
 LUCENE-969 in the Solr code base and in Lucene contrib?

Earlier in this thread I identified that at least one TokenFilter is 
eating Payloads (WordDelimiterFilter).


Yonik pointed out:

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

And I responded: 
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that 
uses Payloads (but I sure hope not!).  Maybe we can build some 
optional configuration options into the TokenFilter constructor that 
guide their behavior with regard to Payloads.  Maybe there is 
something stored in the TokenStream that dictates how the Payloads are 
handled by the TokenFilters.  Maybe there is no case where identical 
payloads would not be created for new tokens and we can just change 
the TokenFilter to deal with payloads directly in a uniform way. 


I thought it might be useful to figure out which existing TokenFilters 
need to know about Payloads.  To this end I have taken an inventory of 
the TokenFilters out there.  I think it is fair to categorize them by 
Add (A), Delete (D), Modify (M), Observe (O):


*org.apache.solr.analysis.*HyphenatedWordsFilter, DM
*org.apache.solr.analysis.*KeepWordFilter, D
*org.apache.solr.analysis.*LengthFilter, D
*org.apache.solr.analysis.*PatternReplaceFilter, M
*org.apache.solr.analysis.*PhoneticFilter, AM
*org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D
*org.apache.solr.analysis.*SynonymFilter, ADM
*org.apache.solr.analysis.*TrimFilter, M
*org.apache.solr.analysis.*WordDelimiterFilter, AM
*org.apache.lucene.analysis.*CachingTokenFilter, O
*org.apache.lucene.analysis.*ISOLatin1AccentFilter, M
*org.apache.lucene.analysis.*LengthFilter, D
*org.apache.lucene.analysis.*LowerCaseFilter, M
*org.apache.lucene.analysis.*PorterStemFilter, M
*org.apache.lucene.analysis.*StopFilter, D
*org.apache.lucene.analysis.standard*.StandardFilter, M*
org.apache.lucene.analysis.br.*BrazilianStemFilter, M
*org.apache.lucene.analysis.cn.*ChineseFilter, D*
org.apache.lucene.analysis.de.*GermanStemFilter, M
*org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M
*org.apache.lucene.analysis.fr.*ElisionFilter, M
*org.apache.lucene.analysis.fr.*FrenchStemFilter, M
*org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM
*org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM
*org.apache.lucene.analysis.nl.*DutchStemFilter, M
*org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M
*org.apache.lucene.analysis.ru.*RussianStemFilter, M
*org.apache.lucene.analysis.th.*ThaiWordFilter, AM
*org.apache.lucene.analysis.snowball.*SnowballFilter, M

Some characteristics of Add (A), Delete (D), Modify (M), Observe (O)
Add: new Token() and buffer of Tokens to consider before addressing 
input.next()

Delete: loop ignoring tokens based on some criteria
Modify: new Token(), or use of Token set methods
Observe: rare CachingTokenFilter

The categories of TokenFilters that are affected by Payloads are add and 
modify.  The default behavior of TokenFilters which only delete or 
observe return the Token fed through intact, hence the Payload will 
remain intact.


Maybe the Lucene community has thought about this problem?  I noticed 
that the org.apache.lucene.analysis TokenFilters in the modify category 
(there are none in the add category) refrain from using new Token().  
That led me to the comment in the JavaDocs:


*NOTE:* As of 2.3, Token stores the term text internally as a 
malleable char[] termBuffer instead of String termText. The indexing 
code and core tokenizers have been changed re-use a single Token 
instance, changing its buffer and other fields in-place as the Token 
is processed. This provides substantially better indexing performance 
as it saves the GC cost of new'ing a Token and String for every term. 
The APIs that accept String termText are still available but a warning 
about the associated performance cost has been added (below). The 
|termText()| 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termText%28%29 
method has been deprecated.


Tokenizers and filters should try to re-use a Token instance when 
possible for best performance, by implementing the 
|TokenStream.next(Token)| 

Re: Query multiple fields

2007-11-18 Thread Yonik Seeley
On Nov 18, 2007 9:58 PM, Dave C. [EMAIL PROTECTED] wrote:
 According to the Lucene query syntax:
 The symbol  can be used in place of the word AND.   So, I shouldn't have 
 to use 'AND'.

Yes, but before the query parser can even get the query string, the
servlet container parses query args and  is a delimiter.  Hence you
need to escape '' for the sake of the servlet container.

 If I do the same query: q=description:(test)!(type:10)!(type:14) in the 
 Solr admin interface, I get the correct results.

Right, because the browser knows to escape '' for you.

You need to escape '' as %26, since that is how URL escaping works
(it has nothing to do with lucene syntax.)

-Yonik


RE: Query multiple fields

2007-11-18 Thread Dave C .
okay thanks for the details

- David



 Date: Sun, 18 Nov 2007 22:14:23 -0500
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: Re: Query multiple fields
 
 On Nov 18, 2007 9:58 PM, Dave C. [EMAIL PROTECTED] wrote:
  According to the Lucene query syntax:
  The symbol  can be used in place of the word AND.   So, I shouldn't 
  have to use 'AND'.
 
 Yes, but before the query parser can even get the query string, the
 servlet container parses query args and  is a delimiter.  Hence you
 need to escape '' for the sake of the servlet container.
 
  If I do the same query: q=description:(test)!(type:10)!(type:14) in the 
  Solr admin interface, I get the correct results.
 
 Right, because the browser knows to escape '' for you.
 
 You need to escape '' as %26, since that is how URL escaping works
 (it has nothing to do with lucene syntax.)
 
 -Yonik

_
Connect and share in new ways with Windows Live.
http://www.windowslive.com/connect.html?ocid=TXT_TAGLM_Wave2_newways_112007

Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas

On 18-Nov-07, at 8:17 AM, Eswar K wrote:


Is there any idea implementing that feature in the up coming releases?


Not currently.  Feel free to contribute something if you find a good  
solution g.


-Mike



On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:


On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:

We have a scenario, where we want to find out documents which are

similar in
content. To elaborate a little more on what we mean here, lets  
take an

example.

The example of this email chain in which we are interacting on,  
can be

best

used for illustrating the concept of near dupes (We are not getting

confused
with threads, they are two different things.). Each email in this  
thread

is
treated as a document by the system. A reply to the original mail  
also
includes the original mail in which case it becomes a near  
duplicate of

the
orginal mail (depending on the percentage of similarity).   
Similarly it

goes

on. The near dupes need not be limited to emails.


I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart





Re: Finding all possible synonyms for a word

2007-11-18 Thread Eswar K
Kishore,

Solr has a SynonymFilterFactory which might be off use to you (
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46)


Regards,
Eswar

On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti [EMAIL PROTECTED]
wrote:

 Hi All,

 I am new to Lucene / SOLR and developing a POC as part of research. Check
 below my requirement and problem statement. Need help on how I can index the
 data such data I have a very good search functionality in my POC.

 --
 Requirement:
 --

 Assume my web application is an Online book store and it sell all
 categories of books like Computers, Social Studies, Physical Sciences etc.
 Each of these categories has sub-categories. For example Computers has
 sub-categories like Software Engineering, Java, SQL Server etc

 I have a database table called Categories and it contains both Parent
 Category descriptions and also Child Category descriptions.

 Data structure of Category table is:

 Category_ID_Primay_Key  integer
 Parent_Category_ID  integer
 Category_Name varchar(100)
 Category_Description varchar(1000)


 --
 My Search UI:
 --

 My search page is very simple. We have a text field with Search button.

 --
 User Action:
 --

 User enter below search text in above text field and clicks on Search
 button.

 Books on Data Center

 --
 What is my expected behavior:
 --

 Since the word Data Center more relevant computers I should show books
 related to computers.

 --
 My Problem statement and Question to you all:
 --

 To have a better search in my web applications what kind of strategy
 should I have and index the data accordingly in SOLR/Lucene.

 In my Lucene Index I may or may not have the word data center. Still I
 should be able to return data center

 One thought I have is as follows:

 Modify the Category table by adding one more column to it:

 Category_ID_Primay_Key  integer
 Parent_Category_ID  integer
 Category_Name varchar(100)
 Category_Description varchar(1000)
 Category_Description_Keywords varchar(8000)

 Now take each word in Category_description, find synonyms of it and
 store that data in Category_Description_Keywords column. After doing it,
 index the Category table records in SOLR/Lucene.

 Below are my questions to you all:

 Question 1:
 Need your feedbacks on above approach or any other approach which help me
 to make my search better that returns most relevant results to the user.

 Question 2:
 Can you suggest me Java based best Open Source or commercial synonym
 engines. I want such a best synonym engine that gives me all possible
 synonyms of a word.



 Thanks in Advance,
 Kishore Veleti A.V.K.



Re: multiple delete by id in one delete command?

2007-11-18 Thread climbingrose
The easiest solution I know is:
deletequeryid:1 OR id:2 OR .../query/delete
If you know that all of these ids can be found by issuing a query, you
can do delete by query:
deletequeryYOUR_DELETE_QUERY_HERE/query/delete

Cheers

On Nov 19, 2007 4:18 PM, Norberto Meijome [EMAIL PROTECTED] wrote:
 Hi everyone,

 I'm trying to issue, via curl to SOLR (testing at the moment), 3 deletes by 
 id.
 I tried sending :

 deleteid1/idid2/idid3/id/delete

 and solr didn't like it at all.

 When I changed it to :

 deleteid1/id/deletedeleteid2/id/deletedeleteid3/id/delete

 as in :

 curl http://localhost:8983/vcs/update -H Content-Type: text/xml 
 --data-binary 
 'deleteid816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3/id/deletedeleteid53f3f80e65482a5be353e7110f5308949d51dfa93dbe3c1eca169edd19b3/id/delete'

 only the 1st ( id =1 , or id = 
 816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3 gets deleted 
 (after a commit, of course).

 So i figure I will have to issue a series of independent  
 deleteidxxx/id/delete commandsIs it not possible to bunch them 
 all together as it's possible with adddoc../docdoc.../doc/add ?


 thanks!!
 Beto
 _
 {Beto|Norberto|Numard} Meijome

 Imagination is more important than knowledge.
   Albert Einstein, On Science

 I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
 Reading disclaimers makes you go blind. Writing them is worse. You have been 
 Warned.




-- 
Regards,

Cuong Hoang


RE: I18N with SOLR?

2007-11-18 Thread Dilip.TS
  Hello,
  Does SOLR supports searching for a keyword which has a
combination of more than 1 language within the same search page?



  -Original Message-
  From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 15, 2007 7:39 PM
  To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
  Subject: Re: I18N with SOLR?


  Hi Dillip,
  don't know if this helps, but I have set up a TextIt field in the
config/schema.xml file, in order to index italian text.
  It works pretty well with non-ascii characters (we do have some accented
vowels, even if not as many as the french).
  It also works with  stopwords (and I assume with protwords as well, though
I didn't try). I created an italian-stopwords.txt file in the config/ path.
  I think the SnowballPorterFilterFactory is a default usable class in Solr,
although I remember having read it's a bit slower than other libraries.
  But I am no expert.


  fieldtype name=textIt class=solr.TextField
positionIncrementGap=100
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumber
  s=1 catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory
words=italian-stopwords.txt ignoreCase=true/
  filter class=solr.SnowballPorterFilterFactory
language=Italian/
/analyzer
  /fieldtype



  On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote:
Hi Ed,
  Thanks for the help,  but i have some queries,
  i understand that we need to have a stopwords_french.txt and
protwords_french.txt files say for french in solr/conf directory.
  Is it like we need to write the classes like FrenchStopFilterFactory,
FrenchPorterFilterFactory for each language
  or do we have these classes in built in solr? I didnt find them in
SOLR/Lucene APIs.
  I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer
etc., in lucene-analyzers.jar.
  Any idea what is this class used for?

Thanks in advance,

Regards
Dilip

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed
Summers
Sent: Monday, November 12, 2007 7:00 PM
To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
Subject: Re: I18N with SOLR?


I'd say yes. Solr supports Unicode and ships with language specific
analyzers, and allows you to provide your own custom analyzers if you
need them. This allows you to create different fieldType definitions
for the languages you want to support. For example here is an example
field type for French text which uses a French stopword list and
French stemming.

fieldType
  name=text_french
  class=solr.TextField 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory  /
filter
  class=solr.FrenchStopFilterFactory
  ignoreCase=true
  words=stopwords_french.txt /
filter
  class= solr.FrenchPorterFilterFactory
  protected=protwords_french.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Then you can create a dynamicField definitions that allow you to
index and query your documents using the correct field type:

dynamicField
  name=*_french
  type=text_french
  indexed=true
  stored=true/

This means that when you index you need to know what language your
data is in so that you know what field names to use in your document
(e.g. title_french). And at search time you need to know what language
you are in so you know which fields to search.  Most user interfaces
are in a single language context so from the query perspective you'll
most likely know the language they want to search in. If you don't
know the language context in either case you could try to guess using
something like org.apache.nutch.analysis.lang.LanguageIdentifier.

I hope this helps. We used this technique (without the guessing) quite
effectively at the Library of Congress recently for a prototype
application that needed to provide search functionality in 7 different
languages.

//Ed

On Nov 12, 2007 1:56 AM, Dilip.TS  [EMAIL PROTECTED] wrote:
 Hello,

   Does SOLR supports I18N (with multiple language support) ?
   Thanks in advance.

 Regards,
 Dilip TS







RE: I18N with SOLR?

2007-11-18 Thread Dilip.TS
   Hello,

  Also can we have something like this ? i.e  having multiple
defaultSearchField entries in the schema.xml while searching for a keyword
which has a combination of more than 1 language:

  defaultSearchFieldtext/defaultSearchField
  defaultSearchFieldtext_french/defaultSearchField...
  -Original Message-
  From: Dilip.TS [mailto:[EMAIL PROTECTED]
  Sent: Monday, November 19, 2007 11:29 AM
  To: solr-user@lucene.apache.org
  Subject: RE: I18N with SOLR?


Hello,
Does SOLR supports searching for a keyword which has a
combination of more than 1 language within the same search page?



-Original Message-
From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 15, 2007 7:39 PM
To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: I18N with SOLR?


Hi Dillip,
don't know if this helps, but I have set up a TextIt field in the
config/schema.xml file, in order to index italian text.
It works pretty well with non-ascii characters (we do have some accented
vowels, even if not as many as the french).
It also works with  stopwords (and I assume with protwords as well,
though I didn't try). I created an italian-stopwords.txt file in the config/
path.
I think the SnowballPorterFilterFactory is a default usable class in
Solr, although I remember having read it's a bit slower than other
libraries.
But I am no expert.


fieldtype name=textIt class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumber
s=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=italian-stopwords.txt ignoreCase=true/
filter class=solr.SnowballPorterFilterFactory
language=Italian/
  /analyzer
/fieldtype



On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote:
  Hi Ed,
Thanks for the help,  but i have some queries,
i understand that we need to have a stopwords_french.txt and
  protwords_french.txt files say for french in solr/conf directory.
Is it like we need to write the classes like
FrenchStopFilterFactory,
  FrenchPorterFilterFactory for each language
or do we have these classes in built in solr? I didnt find them in
  SOLR/Lucene APIs.
I found some classes like
org.apache.lucene.analysis.fr.FrenchAnalyzer
  etc., in lucene-analyzers.jar.
Any idea what is this class used for?

  Thanks in advance,

  Regards
  Dilip

  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Ed
  Summers
  Sent: Monday, November 12, 2007 7:00 PM
  To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
  Subject: Re: I18N with SOLR?


  I'd say yes. Solr supports Unicode and ships with language specific
  analyzers, and allows you to provide your own custom analyzers if you
  need them. This allows you to create different fieldType definitions
  for the languages you want to support. For example here is an example
  field type for French text which uses a French stopword list and
  French stemming.

  fieldType
name=text_french
class=solr.TextField 
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory  /
  filter
class=solr.FrenchStopFilterFactory
ignoreCase=true
words=stopwords_french.txt /
  filter
class= solr.FrenchPorterFilterFactory
protected=protwords_french.txt /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
  /fieldType

  Then you can create a dynamicField definitions that allow you to
  index and query your documents using the correct field type:

  dynamicField
name=*_french
type=text_french
indexed=true
stored=true/

  This means that when you index you need to know what language your
  data is in so that you know what field names to use in your document
  (e.g. title_french). And at search time you need to know what language
  you are in so you know which fields to search.  Most user interfaces
  are in a single language context so from the query perspective you'll
  most likely know the language they want to search in. If you don't
  know the language context in either case you could try to guess using
  something like org.apache.nutch.analysis.lang.LanguageIdentifier.

  I hope this helps. We used this technique (without the guessing) quite
  

Re: multiple delete by id in one delete command?

2007-11-18 Thread Norberto Meijome
On Mon, 19 Nov 2007 16:53:17 +1100
climbingrose [EMAIL PROTECTED] wrote:

 The easiest solution I know is:
 deletequeryid:1 OR id:2 OR .../query/delete
 If you know that all of these ids can be found by issuing a query, you
 can do delete by query:
 deletequeryYOUR_DELETE_QUERY_HERE/query/delete

thanks, so i'm not going nuts (at least not due to this  :) )

i may just change the way i handle deletes ...

thanks,
B

_
{Beto|Norberto|Numard} Meijome

I've had a perfectly wonderful evening. But, this wasn't it.
  Groucho Marx

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.