truncating indexed docs

2009-04-16 Thread CIF Search
Is it possible to truncate large documents once they are indexed? (Can this
be done without re-indexing)

Regards,
CI


response time

2009-04-07 Thread CIF Search
Hi,

I have around 10 solr servers running indexes of around 80-85 GB each and
and with 16,000,000 docs each. When i use distrib for querying, I am not
getting a satisfactory response time. My response time is around 4-5
seconds. Any suggestions to improve the response time for queries (to bring
it below 1 second). Is the response slow due to the size of the index ? I
have already gone through the pointers provided at:
http://wiki.apache.org/solr/SolrPerformanceFactors

Regards,
CI


Re: custom reranking

2009-04-07 Thread CIF Search
Would it not be a good idea to provide Ranking as solr plugin, in which
users can write their custom ranking algorithms and reorder the results
returned by Solr in whichever way they need. It may also help Solr users to
incorporate learning (from search user feedback - such as click logs), and
reorder the results returned by Solr accordingly and not depend purely on
relevance as we do today.

Regards,
CI


On Fri, Feb 27, 2009 at 5:21 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Feb 26, 2009, at 11:16 PM, CIF Search wrote:

  I believe the query component will generate the query in such a way that i
 get the results that i want, but not process the returned results, is that
 correct? Is there a way in which i can group the returned results, and
 rank
 each group separately, and return the results together. In other words
 which
 component do I need to write to reorder the returned results as per my
 requirements.


 I'd have a look at what I did for the Clustering patch, i.e. SOLR-769.  It
 may even be the case that you can simply plugin your own SolrClusterer or
 whatever it's called.  Or, if it doesn't quite fit your needs, give me
 feedback/patch and we can update it.  I'm definitely open to ideas on it.




 Also, the deduplication patch seems interesting, but it doesnt appear to
 be
 expected to work across multiple shards.


 Yeah, that does seem a bit tricky.  Since Solr doesn't support distributed
 indexing, it would be tricky to support just yet.



  Regards,
 CI

 On Thu, Feb 26, 2009 at 8:03 PM, Grant Ingersoll gsing...@apache.org
 wrote:


 On Feb 26, 2009, at 6:04 AM, CIF Search wrote:

 We have a distributed index consisting of several shards. There could be

 some documents repeated across shards. We want to remove the duplicate
 records from the documents returned from the shards, and re-order the
 results by grouping them on the basis of a clustering algorithm and
 reranking the documents within a cluster on the basis of log of a
 particular
 returned field value.



 I think you would have to implement your own QueryComponent.  However,
 you
 may be able to get away with implementing/using Solr's FunctionQuery
 capabilities.

 FieldCollapsing is also a likely source of inspiration/help (
 http://www.lucidimagination.com/search/?q=Field+Collapsing#/
 s:email,issues)

 As a side note, have you looked at
 http://issues.apache.org/jira/browse/SOLR-769 ?

 You might also have a look at the de-duplication patch that is working
 it's
 way through dev: http://wiki.apache.org/solr/Deduplication



  How do we go about achieving this? Should we write this logic by
 implementing QueryResponseWriter. Also if we remove duplicate records,
 the
 total number of records that are actually returned are less than what
 were
 asked for in the query.

 Regards,
 CI


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search



 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




Re: response time

2009-04-07 Thread CIF Search
yes, non cached. If I repeat a query the response is fast since the results
are cached.

2009/4/7 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com

 are these the numbers for non-cached requests?

 On Tue, Apr 7, 2009 at 11:46 AM, CIF Search cifsea...@gmail.com wrote:
  Hi,
 
  I have around 10 solr servers running indexes of around 80-85 GB each and
  and with 16,000,000 docs each. When i use distrib for querying, I am not
  getting a satisfactory response time. My response time is around 4-5
  seconds. Any suggestions to improve the response time for queries (to
 bring
  it below 1 second). Is the response slow due to the size of the index ? I
  have already gone through the pointers provided at:
  http://wiki.apache.org/solr/SolrPerformanceFactors
 
  Regards,
  CI
 



 --
 --Noble Paul



Re: input XSLT

2009-03-13 Thread CIF Search
There is a fundamental problem with using 'pull' approach using DIH.
Normally people want a delta imports which are done using a timestamp field.
Now it may not always be possible for application servers to sync their
timestamps (given protocol restrictions due to security reasons). Due to
this Solr application is likely to miss a few records occasionally. Such a
problem does not arise if applications themseleves identify their records
and post. Should we not have such a feature in Solr, which will allow users
to push data onto the index in whichever format they wish to? This will also
facilitate plugging in solr seamlessly with all kinds of applications.

Regards,
CI

On Wed, Mar 11, 2009 at 11:52 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

  On Tue, Mar 10, 2009 at 12:17 PM, CIF Search cifsea...@gmail.com wrote:
  Just as you have an xslt response writer to convert Solr xml response to
  make it compatible with any application, on the input side do you have an
  xslt module that will parse xml documents to solr format before posting
 them
  to solr indexer. I have gone through dataimporthandler, but it works in
 data
  'pull' mode i.e. solr pulls data from the given location. I would still
 want
  to work with applications 'posting' documents to solr indexer as and when
  they want.
 it is a limitation of DIH, but if you can put your xml in a file
 behind an http server then you can fire a command to DIH to pull data
 from the url quite easily.
 
  Regards,
  CI
 



 --
 --Noble Paul



Re: input XSLT

2009-03-13 Thread CIF Search
But these documents have to be converted to a particular format before being
posted. Any XML document cannot be posted to Solr (with XSLT handled by Solr
internally).
DIH handles any xml format, but it operates in pull mode.


On Fri, Mar 13, 2009 at 11:45 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Fri, Mar 13, 2009 at 11:36 AM, CIF Search cifsea...@gmail.com wrote:

  There is a fundamental problem with using 'pull' approach using DIH.
  Normally people want a delta imports which are done using a timestamp
  field.
  Now it may not always be possible for application servers to sync their
  timestamps (given protocol restrictions due to security reasons). Due to
  this Solr application is likely to miss a few records occasionally. Such
 a
  problem does not arise if applications themseleves identify their records
  and post. Should we not have such a feature in Solr, which will allow
 users
  to push data onto the index in whichever format they wish to? This will
  also
  facilitate plugging in solr seamlessly with all kinds of applications.
 

 You can of course push your documents to Solr using the XML/CSV update (or
 using the solrj client). It's just that you can't push documents with DIH.

 http://wiki.apache.org/solr/#head-98c3ee61c5fc837b09e3dfe3fb420491c9071be3

 --
 Regards,
 Shalin Shekhar Mangar.



input XSLT

2009-03-10 Thread CIF Search
Just as you have an xslt response writer to convert Solr xml response to
make it compatible with any application, on the input side do you have an
xslt module that will parse xml documents to solr format before posting them
to solr indexer. I have gone through dataimporthandler, but it works in data
'pull' mode i.e. solr pulls data from the given location. I would still want
to work with applications 'posting' documents to solr indexer as and when
they want.

Regards,
CI


custom reranking

2009-02-26 Thread CIF Search
We have a distributed index consisting of several shards. There could be
some documents repeated across shards. We want to remove the duplicate
records from the documents returned from the shards, and re-order the
results by grouping them on the basis of a clustering algorithm and
reranking the documents within a cluster on the basis of log of a particular
returned field value.
How do we go about achieving this? Should we write this logic by
implementing QueryResponseWriter. Also if we remove duplicate records, the
total number of records that are actually returned are less than what were
asked for in the query.

Regards,
CI


Re: custom reranking

2009-02-26 Thread CIF Search
I believe the query component will generate the query in such a way that i
get the results that i want, but not process the returned results, is that
correct? Is there a way in which i can group the returned results, and rank
each group separately, and return the results together. In other words which
component do I need to write to reorder the returned results as per my
requirements.

Also, the deduplication patch seems interesting, but it doesnt appear to be
expected to work across multiple shards.

Regards,
CI

On Thu, Feb 26, 2009 at 8:03 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Feb 26, 2009, at 6:04 AM, CIF Search wrote:

 We have a distributed index consisting of several shards. There could be
 some documents repeated across shards. We want to remove the duplicate
 records from the documents returned from the shards, and re-order the
 results by grouping them on the basis of a clustering algorithm and
 reranking the documents within a cluster on the basis of log of a
 particular
 returned field value.



 I think you would have to implement your own QueryComponent.  However, you
 may be able to get away with implementing/using Solr's FunctionQuery
 capabilities.

 FieldCollapsing is also a likely source of inspiration/help (
 http://www.lucidimagination.com/search/?q=Field+Collapsing#/
 s:email,issues)

 As a side note, have you looked at
 http://issues.apache.org/jira/browse/SOLR-769 ?

 You might also have a look at the de-duplication patch that is working it's
 way through dev: http://wiki.apache.org/solr/Deduplication



 How do we go about achieving this? Should we write this logic by
 implementing QueryResponseWriter. Also if we remove duplicate records, the
 total number of records that are actually returned are less than what were
 asked for in the query.

 Regards,
 CI


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search