Facing problem with the FieldType of UniqueField

2008-02-14 Thread Rishabh Joshi
Hi,

Initially, in my schema, I had my uniqueField's field type as string and
everything was working fine. But, then the users of my application wanted to
search on the unique field and entered values which were in a different case
than what was indexed. They never got proper results, at times, no results.

I noticed this happened because the field type was string. I then changed
it to a custom text type and had specified only the whitespace tokenizer and
lowercase filter. It worked. The users were able to search on the
uniqueField irrespective of the case the values were entered in.

But now, another problem has risen. If I update a document, it really does
not update, but it is added as a separate document with the same uniqueField
value and the contents of the old document is merged with the new one.

So, now what I want to achieve is that the users should be able to search on
the uniqueField, irrespective of the case the values are entered in; and on
updation of a document, not to have duplicate documents (documents with the
same uniqueField value) in the index. Can anyone help me in as to how this
can be done?

Regards,
Rishabh


Re: Facing problem with the FieldType of UniqueField

2008-02-14 Thread Rishabh Joshi
Ryan,

Using the KeywordTokenizer does not help. And there are not any spaces in
the unique keys. the keys are alpha numeric. E.g.: AA-23-E1

Regards,
Rishabh

On Thu, Feb 14, 2008 at 10:28 PM, Ryan McKinley [EMAIL PROTECTED] wrote:


  I noticed this happened because the field type was string. I then
 changed
  it to a custom text type and had specified only the whitespace tokenizer
 and
  lowercase filter. It worked. The users were able to search on the

 are there spaces in your unique key?  Try using the KeywordTokenizer --
 the main concern with the field type for uniqueKey is to make sure it
 only has one token.

 ryan



Restrict values in a multivalued field

2008-01-12 Thread Rishabh Joshi
Hi,

In my schema I have a multivalued field, and the values of that field are
stored and indexed in the index. I wanted to know if its possible to
restrict the number of multiple values being returned from that field, on a
search? And how? Because, lets say, if I have thousands of values in that
multivalued field, returning all of them would be a lot of load on the
system. So, I want to restrict it to send me only say, 50 values out of the
thousands.

Regards,
Rishabh


How to perform a double query in one

2008-01-02 Thread Rishabh Joshi
Hi,

Is there a way to perform 2 search queries in one search request, and then
return their combined results?

Currently I am performing the following:

I have a document which consists of id field which is the unique
identifier, the info field, and an xid field which contains the ids of
other documents (these documents are also indexed) it relates to. Thus there
is a mapping between two or more documents thru the xid field.
Now, in my first search, I search for the document based on a given id, say
XYZ on the id field. this gives me exactly one document and I retrieve the
document's xid content.
Then I search for the same id - XYZ in the xid field and retrive the xid
content for all the matching documents.

Can I perform the same operation in one query? If yes, how do I go about it?
Do, I need to write my custom request handler? If no, any other efficient
way to do the same?


Regards,
Rishabh


Retrieving Tokens

2007-12-19 Thread Rishabh Joshi
Hi,

I have created my own Tokenizer and I am indexing the documents using the
same.

I wanted to know if there is a way to retrieve the tokens (created by my
custom tokenizer) from the index.
Do we have to modify the code to get these tokens?

Regards,
Rishabh


Creating user-defined field types

2007-12-11 Thread Rishabh Joshi
Hi,

Can anyone guide me as to how one can go on to implement a user defined
field types in solr? I could not find anything on the solr-wiki. Help of any
kind would be appreciated.

Regards,
Rishabh


How to store a HashSet in the index?

2007-12-10 Thread Rishabh Joshi
Hi,

Can anyone help me on, as to how I can go about efficiently indexing
(actually, storing in the index) and retrieving, a HashSet object, which
contains multiple string arrays?
I just want to store the HashSet in the index, and not search on it. The
HashSet should be returned with the document when I perform a search on any
other fields.

Regards,
Rishabh


Re: How to store a HashSet in the index?

2007-12-10 Thread Rishabh Joshi
Thanks Eric!

Rishabh

On Dec 10, 2007 3:30 PM, Erik Hatcher [EMAIL PROTECTED] wrote:

 On Dec 10, 2007, at 3:10 AM, Rishabh Joshi wrote:
  Can anyone help me on, as to how I can go about efficiently indexing
  (actually, storing in the index) and retrieving, a HashSet object,
  which
  contains multiple string arrays?
  I just want to store the HashSet in the index, and not search on
  it. The
  HashSet should be returned with the document when I perform a
  search on any
  other fields.

 If you have Java on indexing and querying side of things, you could
 simply serialize and stringify (via uuencoding perhaps) the HashSet,
 and deserialize it on retrieval.  Just be sure to set the field to be
 untokenized and stored.

Erik




Re: Strange behavior MoreLikeThis Feature

2007-11-22 Thread Rishabh Joshi
Thanks Ryan. I now know the reason why.
Before I explain the reason, let me correct the mistake I made in my earlier
mail. I was not using the first document mentioned in the xml . Instead it
was this one:
doc
  field name=idIW-02/field
  field name=nameiPod amp; iPod Mini USB 2.0 Cable/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter for iPod, white/field
  field name=weight2/field
  field name=price11.50/field
  field name=popularity1/field
  field name=inStockfalse/field
/doc

The reason I was getting strange result was because of the character i.
Here is what I learnt from debug info:

debug:{
  rawquerystring:id:neardup06,
  querystring:id:neardup06,
  parsedquery:features:og features:en features:til features:er
features:af features:der features:ts features:se features:i features:p
features:pet features:brag features:efter features:zombier features:k
features:tilbag features:ala features:sviner features:folk
features:klassisk features:resid features:horder features:lidt
features:man features:denn,
  parsedquery_toString:features:og features:en features:til
features:er features:af features:der features:ts features:se
features:i features:p features:pet features:brag features:efter
features:zombier features:k features:tilbag features:ala
features:sviner features:folk features:klassisk features:resid
features:horder features:lidt features:man features:denn,
  explain:{
id=IW-02,internal_docid=8:\n0.0050230525 = (MATCH) product of:\n
0.12557632 = (MATCH) sum of:\n0.12557632 = (MATCH)
weight(features:i in 8), product of:\n  0.17474915 =
queryWeight(features:i), product of:\n1.9162908 =
idf(docFreq=3)\n0.09119135 = queryNorm\n  0.71860904 =
(MATCH) fieldWeight(features:i in 8), product of:\n1.0 =
tf(termFreq(features:i)=1)\n1.9162908 = idf(docFreq=3)\n
 0.375 = fieldNorm(field=features, doc=8)\n  0.04 = coord(1/25)\n}}}

The field features uses the default fieldtype - text in the schema.xml.
The problem was solved by adding the character i to the
stopwords.txtfile. the is in document 2 were matched with the i in
iPod of document
1.

I still have to figure out why a single character - i - matched the i in
a word - iPod.

Regards,
Rishabh

On 22/11/2007, Ryan McKinley [EMAIL PROTECTED] wrote:

 
  Now when I run the following query:
 
 http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on
 

 try adding:
   debugQuery=on

 to your query string and you can see why each document matches...

 My guess is that features uses a text field with stemming and a
 stemmed word matches

 ryan



Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
Thanks for the info Cuong!

Regards,
Rishabh

On Nov 21, 2007 1:59 PM, climbingrose [EMAIL PROTECTED] wrote:

 The duplication detection mechanism in Nutch is quite primitive. I
 think it uses a MD5 signature generated from the content of a field.
 The generation algorithm is described here:

 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html
 .

 The problem with this approach is MD5 hash is very sensitive: one
 letter difference will generate completely different hash. You
 probably have to roll your own near duplication detection algorithm.
 My advice is have a look at existing literature on near duplication
 detection techniques and then implement one of them. I know Google has
 some papers that describe a technique called minhash. I read the paper
 and found it's very interesting. I'm not sure if you can implement the
 algorithm because they have patented it. That said, there are plenty
 literature on near dup detection so you should be able to get one for
 free!

 On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
  Otis,
 
  Thanks for your response.
 
  I just gave a quick look to the Nutch Forum and find that there is an
  implementation to obtain de-duplicate documents/pages but none for Near
  Duplicates documents. Can you guide me a little further as to where
 exactly
  under Nutch I should be concentrating, regarding near duplicate
 documents?
 
  Regards,
  Rishabh
 
  On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
  wrote:
 
 
   To whomever started this thread: look at Nutch.  I believe something
   related to this already exists in Nutch for near-duplicate detection.
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Mike Klaas [EMAIL PROTECTED]
   To: solr-user@lucene.apache.org
   Sent: Sunday, November 18, 2007 11:08:38 PM
   Subject: Re: Near Duplicate Documents
  
   On 18-Nov-07, at 8:17 AM, Eswar K wrote:
  
Is there any idea implementing that feature in the up coming
releases?
  
   Not currently.  Feel free to contribute something if you find a good
   solution g.
  
   -Mike
  
  
On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED]
 wrote:
   
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, lets
take an
example.
   
The example of this email chain in which we are interacting on,
can be
best
used for illustrating the concept of near dupes (We are not
 getting
confused
with threads, they are two different things.). Each email in this
thread
is
treated as a document by the system. A reply to the original mail
also
includes the original mail in which case it becomes a near
duplicate of
the
orginal mail (depending on the percentage of similarity).
Similarly it
goes
on. The near dupes need not be limited to emails.
   
I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.
   
-Stuart
   
  
  
  
  
  
 



 --
 Regards,

 Cuong Hoang



Re: Performance of Solr on different Platforms

2007-11-20 Thread Rishabh Joshi
Eswar,

This link would give you a fair idea of how Solr is used by some of the
sites/companies -
http://wiki.apache.org/solr/SolrPerformanceData

Rishabh

On Nov 20, 2007 10:49 AM, Eswar K [EMAIL PROTECTED] wrote:

 In our case, the load is kind of distributed. On an average, the QPS could
 be much less than that. 1000 qps could be the peak load ever expected
 could
 ever reach. However the number of documents going to be in the range of 2
 -
 20 million documents.

 We would possibly distribute the indexes to different solr instances and
 possibly direct it accordingly to reduce the QPS.

 - Eswar

 On Nov 20, 2007 10:42 AM, Walter Underwood [EMAIL PROTECTED] wrote:

  1000 qps is a lot of load, at least 30M queries/day.
 
  We are running dual CPU Power P5 machines and getting about 80 qps
  with worst case response times of 5 seconds. 90% of responses are
  under 70 msec.
 
  Our expected peak load is 300 qps on our back-end Solr farm.
  We execute multiple back-end queries for each query page.
 
  With N+1 sizing (full throughput with one server down), we
  have five servers to do that. We have a separate server
  for indexing and use the Solr distribution scripts.
 
  We have a relatively small index, about 250K docs.
 
  wunder
 
 
  On 11/19/07 8:48 PM, Eswar K [EMAIL PROTECTED] wrote:
 
   Its not going to hit 1000 all the time, its the expected peak value.
  
   I guess for distributing the load we should be using collections and I
  was
   looking at the collections documentation (
   http://wiki.apache.org/solr/CollectionDistribution) .
  
   - Eswar
   On Nov 20, 2007 12:07 AM, Matthew Runo [EMAIL PROTECTED] wrote:
  
   I'd think that any platform that can run Java would be fine to run
   SOLR on. Maybe this is more a question of preferred platforms for
 Java
   deployments? That is quite the load for SOLR though, you may find
 that
   you want more than one server.
  
   Do you mean that you're expecting about 1000 QPS over an index with
 up
   to 20 million documents?
  
   --Matthew
  
   On Nov 19, 2007, at 6:00 AM, Eswar K wrote:
  
   All,
  
   Can you give some information on this or atleast let me know where I
   can
   find this information if its already listed out anywhere.
  
   Regards,
   Eswar
  
   On Nov 18, 2007 9:45 PM, Eswar K [EMAIL PROTECTED] wrote:
  
   Hi,
  
   I understand that Solr can be used on different Linux flavors. Is
   there
   any preferred flavor (Like Red Hat, Ubuntu, etc)?
   Also what is the kind of configuration of hardware (Processors,
   RAM, etc)
   be best suited for the install?
   We expect to load it with millions of documents (varying from 2 -
 20
   million). There might be around 1000 concurrent users.
  
   Your help in this regard will be appreciated.
  
   Regards,
   Eswar
  
  
  
  
 
 



rows=VERY_LARGE_VALUE throws exception, and error in some cases

2007-11-20 Thread Rishabh Joshi
Hi,

We are using Solr 1.2 for our project and have come across the following
exception and error:

Exception:
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java
:36)

Steps to reproduce:
1. Restart your Web Server.
2. Enter a query with VERY_LARGE_VALUE for rows field. For example:
http://xx.xx.xx.xx:8080/solr/select?q=unix%20start=0fl=idindent=offrows=9
3. Press enter or click on the 'Go' button on the browser.

NOTE:
1. This exception is thrown if'999' (seven digits) 
VERY_LARGE_VALUE  '9' (nine digits).
2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to
= '999', execute the query and then change the VERY_LARGE_VALUE  back
to it's original value and execute the query again.
3. If the VERY_LARGE_VALUE = '99' (ten digits) we get the following
error:

Error:
HTTP Status 400 - For input string: 99

Has anyone come across this scenario before?

Regards,
Rishabh


Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 To whomever started this thread: look at Nutch.  I believe something
 related to this already exists in Nutch for near-duplicate detection.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Sunday, November 18, 2007 11:08:38 PM
 Subject: Re: Near Duplicate Documents

 On 18-Nov-07, at 8:17 AM, Eswar K wrote:

  Is there any idea implementing that feature in the up coming
  releases?

 Not currently.  Feel free to contribute something if you find a good
 solution g.

 -Mike


  On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:
 
  On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
  similar in
  content. To elaborate a little more on what we mean here, lets
  take an
  example.
 
  The example of this email chain in which we are interacting on,
  can be
  best
  used for illustrating the concept of near dupes (We are not getting
  confused
  with threads, they are two different things.). Each email in this
  thread
  is
  treated as a document by the system. A reply to the original mail
  also
  includes the original mail in which case it becomes a near
  duplicate of
  the
  orginal mail (depending on the percentage of similarity).
  Similarly it
  goes
  on. The near dupes need not be limited to emails.
 
  I think this is what's known as shingling.  See
  http://en.wikipedia.org/wiki/W-shingling
  Lucene (and therefore Solr) does not implement shingling.  The
  MoreLikeThis query might be close enough, however.
 
  -Stuart
 







Near Duplicate Documents

2007-11-16 Thread Rishabh Joshi
Hi,

I am evaluating Solr 1.2 for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is MoreLikeThisHandler the implementation for near dups?

Rishabh


RE: Best way to create multiple indexes

2007-11-12 Thread Rishabh Joshi

Ryan,

We currently have 8-9 million documents to index and this number will grow in 
the future. Also, we will never have a query that will search across groups, 
but, we will have queries that will search across sub-groups for sure.
Now, keeping this in mind we were thinking if we could have multiple indexes at 
the 'group' level at least.
Also, can multiple indexes be created dynamically? For example: In my 
application if I create a 'logical group', then an index should be created for 
that group.

Rishabh

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Monday, November 12, 2007 7:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Best way to create multiple indexes

For starters, do you need to be able to search across groups or
sub-groups (in one query?)

If so, then you have to stick everything in one index.

You can add a field to each document saying what 'group' or 'sub-group'
it is in and then limit it at query time

  q=kittens +group:A

The advantage to splitting it into multiple indexes is that you could
put each index on independent hardware.  Depending on your queries and
index size that may make a big difference.

ryan


Rishabh Joshi wrote:
 Hi,

 I have a requirement and was wondering if someone could help me in how to go 
 about it. We have to index about 8-9 million documents and their size can be 
 anywhere from a few KBs to a couple of MBs. These documents are categorized 
 into many 'groups' and 'sub-groups'. I wanted to know if we can create 
 multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, 
 then how do we go about it? I tried going through the section on 
 'Collections' in the Solr Wiki, but could not make much use of it.


 Regards,
 Rishabh Joshi