no segments* file found

2007-11-12 Thread SDIS M. Beauchamp
I'm using solr to index our files servers ( 480K files ) 
 
If I don't optimize, I 've got a too many files open at about 450K files
and 3 Gb index
 
If i optimize I've got this stacktrace during the commit of all the
following update
 
result status=1java.io.FileNotFoundException: no segments* file
found in
org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index:
files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx
_7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis _7vx.fnm
_7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx _7vx.fdx _7ur.tii
_7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq _7xp.fdx _7ws.prx _7xs.frq
_7xo.prx _7xq.nrm _7vx.tii _7vx.prx _7xq.tii _7xs.fnm _7xs.tii _7ws.tii
_7xt.fdx _7vc.nrm _7vc.prx _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq
_7vx.nrm _7ur.fdt _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis
_7xp.fdt _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx
_7xo.fdt _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm
_7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm _7u6.prx
_7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt _7o1.fdt _7ur.nrm
_7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii _7u6.fdx _7xo.frq _7u6.frq
_7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii _7xo.tis _7u6.fdt _7xo.fdx
_7o1.fnm segments.gen
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
s.java:516)
at
org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243)
at
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616)
at
org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410)
at
org.apache.solr.update.SolrIndexWriter.lt;initgt;(SolrIndexWriter.java
:97)
at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler
.java:121)
at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl
er2.java:189)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.
java:267)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate
ProcessorFactory.java:67)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateR
equestHandler.java:196)
at
org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdate
RequestHandler.java:386)
at
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:
57)
   
 
/result
 
If I restart solr I've got a NullPointerException in DispatchFilter
 
tested with solr 1.2 and 1.3 , the behaviour is the same 
 
Regards
 
Florent BEAUCHAMP


Re: no segments* file found

2007-11-12 Thread Venkatraman S
are you using embedded solr?

I had stumbled on a similar error :
http://www.mail-archive.com/solr-user@lucene.apache.org/msg06085.html

-V

On Nov 12, 2007 2:16 PM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote:

 I'm using solr to index our files servers ( 480K files )

 If I don't optimize, I 've got a too many files open at about 450K files
 and 3 Gb index

 If i optimize I've got this stacktrace during the commit of all the
 following update

 result status=1java.io.FileNotFoundException: no segments* file
 found in
 org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index:
 files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx
 _7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis _7vx.fnm
 _7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx _7vx.fdx _7ur.tii
 _7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq _7xp.fdx _7ws.prx _7xs.frq
 _7xo.prx _7xq.nrm _7vx.tii _7vx.prx _7xq.tii _7xs.fnm _7xs.tii _7ws.tii
 _7xt.fdx _7vc.nrm _7vc.prx _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq
 _7vx.nrm _7ur.fdt _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis
 _7xp.fdt _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx
 _7xo.fdt _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm
 _7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm _7u6.prx
 _7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt _7o1.fdt _7ur.nrm
 _7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii _7u6.fdx _7xo.frq _7u6.frq
 _7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii _7xo.tis _7u6.fdt _7xo.fdx
 _7o1.fnm segments.gen
at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
 s.java:516)
at
 org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243)
at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616)
at
 org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410)
at
 org.apache.solr.update.SolrIndexWriter.lt ;initgt;(SolrIndexWriter.java
 :97)
at
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler
 .java:121)
at
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl
 er2.java:189)
at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.
 java:267)
at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdate
 ProcessorFactory.java :67)
at
 org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateR
 equestHandler.java:196)
at
 org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdate
 RequestHandler.java :386)
at
 org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:
 57)


 /result

 If I restart solr I've got a NullPointerException in DispatchFilter

 tested with solr 1.2 and 1.3 , the behaviour is the same

 Regards

 Florent BEAUCHAMP




--


Re: Trim filer active for solr.StrField ?

2007-11-12 Thread Jörg Kiegeland



what is your specific SolrQuery?

calling:
 query.setQuery(  stuff with spaces);

does not call trim(), but some other calls do.

My query looks e.g.

(myField:_T8sY05EAEdyU7fJs63mvdA OR myField:_T8sY0ZEAEdyU7fJs63mvdA 
OR myField:_T8sY0pEAEdyU7fJs63mvdA) AND NOT 
myField:_T8sY1JEAEdyU7fJs63mvdA



So I want to find all documents where field myField contains any of 
some UUIDs and must not contain another set of other UUIDs.


The only other thing I do is set the result limit:

   solrQuery.setRows(resultLimit);

The actual strings which are truncated are in other fields of returned 
documents.


Any idea?




RE: no segments* file found

2007-11-12 Thread SDIS M. Beauchamp
No , I'm using a custom indexer, written in C# which submits content using some 
post request.

I let lucene manage the index on his own

Florent BEAUCHAMP

-Message d'origine-
De : Venkatraman S [mailto:[EMAIL PROTECTED] 
Envoyé : lundi 12 novembre 2007 10:19
À : solr-user@lucene.apache.org
Objet : Re: no segments* file found

are you using embedded solr?

I had stumbled on a similar error :
http://www.mail-archive.com/solr-user@lucene.apache.org/msg06085.html

-V

On Nov 12, 2007 2:16 PM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote:

 I'm using solr to index our files servers ( 480K files )

 If I don't optimize, I 've got a too many files open at about 450K 
 files and 3 Gb index

 If i optimize I've got this stacktrace during the commit of all the 
 following update

 result status=1java.io.FileNotFoundException: no segments* file 
 found in
 org.apache.lucene.store.FSDirectory@/root/trunk/example/solr/data/index:
 files: _7xr.tis _7xt.fdt _7o1.tii _7xq.tis _7xn.nrm _7ws.fdt _7xt.prx 
 _7xp.nrm _7ws.nrm _7xo.nrm _7ws.tis _7xs.fdt _7vc.fnm _7u6.tis 
 _7vx.fnm _7vx.frq _7xs.nrm _7xn.tis _7xq.frq _7xs.tis _7xq.prx 
 _7vx.fdx _7ur.tii _7ur.frq _7xq.fnm _7xr.nrm _7vc.fdt _7xt.frq 
 _7xp.fdx _7ws.prx _7xs.frq _7xo.prx _7xq.nrm _7vx.tii _7vx.prx 
 _7xq.tii _7xs.fnm _7xs.tii _7ws.tii _7xt.fdx _7vc.nrm _7vc.prx 
 _7vc.tis _7xq.fdt _7ur.prx _7xn.fdx _7xp.frq _7vx.nrm _7ur.fdt 
 _7xr.fnm _7ws.fdx _7u6.tii _7xr.tii _7vc.frq _7vx.tis _7xp.fdt 
 _7xr.frq_7ur.tis _7xp.prx _7xr.fdx _7xt.fnm _7xn.tii _7vc.fdx _7xo.fdt 
 _7u6.fnm _7xn.frq _7xp.tis _7o1.frq _7xn.prx _7ur.fdx _7ur.fnm 
 _7o1.fdx _7xs.fdx _7xn.fdt _7xt.tis _7xp.fnm _7xo.fnm _7xn.fnm 
 _7u6.prx _7xq.fdx _7xo.tii _7ws.fnm _7vc.tii _7o1.prx _7xr.fdt 
 _7o1.fdt _7ur.nrm _7ws.frq _7u6.nrm _7o1.nrm _7vx.fdt _7xt.tii 
 _7u6.fdx _7xo.frq _7u6.frq _7xs.prx _7xr.prx _7o1.tis _7xt.nrm _7xp.tii 
 _7xo.tis _7u6.fdt _7xo.fdx _7o1.fnm segments.gen
at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
 s.java:516)
at
 org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243)
at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:616)
at
 org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:410)
at
 org.apache.solr.update.SolrIndexWriter.lt 
 ;initgt;(SolrIndexWriter.java
 :97)
at
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandl
 er
 .java:121)
at
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHan
 dl
 er2.java:189)
at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.
 java:267)
at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpda
 te
 ProcessorFactory.java :67)
at
 org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdat
 eR
 equestHandler.java:196)
at
 org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpda
 te
 RequestHandler.java :386)
at
 org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:
 57)


 /result

 If I restart solr I've got a NullPointerException in DispatchFilter

 tested with solr 1.2 and 1.3 , the behaviour is the same

 Regards

 Florent BEAUCHAMP




--


RE: Multiple indexes

2007-11-12 Thread Pierre-Yves LANDRON

Hello,

Until now, i've used two instance of solr, one for each of my collections ; it 
works fine, but i wonder
if there is an advantage to use multiple indexes in one instance over several 
instances with one index each ?
Note that the two indexes have different schema.xml.

Thanks.
PL

 Date: Thu, 8 Nov 2007 18:05:43 -0500
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: Multiple indexes
 
 Hi,
 
 I am looking for the way to utilize the multiple indexes for signle sole
 instance.
 I saw that there is the patch 215  available  and would like to ask someone
 who knows how to use multiple indexes.
 
 Thanks,
 
 Jae Joo

_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE

solr range query

2007-11-12 Thread Heba Farouk
Hello,

 

I would like to use solr  to return ranges of searches on an integer
field, if I wrote in the url  offset:[0 TO 10], it returns documents
with offset values 0, 1, 10 only  but I want to return the range 0,1,2,
3, 4 ,10. How can I do that with solr

 

Thanks in advance

 

 Best regards,

 

Heba Farouk

Software Engineer

Bibliotheca Alexandrina



Re: I18N with SOLR?

2007-11-12 Thread Ed Summers
I'd say yes. Solr supports Unicode and ships with language specific
analyzers, and allows you to provide your own custom analyzers if you
need them. This allows you to create different fieldType definitions
for the languages you want to support. For example here is an example
field type for French text which uses a French stopword list and
French stemming.

fieldType
  name=text_french
  class=solr.TextField 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter
  class=solr.FrenchStopFilterFactory
  ignoreCase=true
  words=stopwords_french.txt /
filter
  class=solr.FrenchPorterFilterFactory
  protected=protwords_french.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Then you can create a dynamicField definitions that allow you to
index and query your documents using the correct field type:

dynamicField
  name=*_french
  type=text_french
  indexed=true
  stored=true/

This means that when you index you need to know what language your
data is in so that you know what field names to use in your document
(e.g. title_french). And at search time you need to know what language
you are in so you know which fields to search.  Most user interfaces
are in a single language context so from the query perspective you'll
most likely know the language they want to search in. If you don't
know the language context in either case you could try to guess using
something like org.apache.nutch.analysis.lang.LanguageIdentifier.

I hope this helps. We used this technique (without the guessing) quite
effectively at the Library of Congress recently for a prototype
application that needed to provide search functionality in 7 different
languages.

//Ed

On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote:
 Hello,

   Does SOLR supports I18N (with multiple language support) ?
   Thanks in advance.

 Regards,
 Dilip TS




Re: Multiple indexes

2007-11-12 Thread Ryan McKinley
The advantages of a multi-core setup are configuration flexibility and 
dynamically changing available options (without a full restart).


For high-performance production solr servers, I don't think there is 
much reason for it.  You may want to split the two indexes on to two 
machines.  You may want to run each index in a separate JVM (so if one 
crashes, the other does not)


Maintaining 2 indexes is pretty easy, if that was a larger number or you 
need to create indexes for each user in a system then it would be worth 
investigating the multi-core setup (it is still in development)


ryan


Pierre-Yves LANDRON wrote:

Hello,

Until now, i've used two instance of solr, one for each of my collections ; it 
works fine, but i wonder
if there is an advantage to use multiple indexes in one instance over several 
instances with one index each ?
Note that the two indexes have different schema.xml.

Thanks.
PL


Date: Thu, 8 Nov 2007 18:05:43 -0500
From: [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Subject: Multiple indexes

Hi,

I am looking for the way to utilize the multiple indexes for signle sole
instance.
I saw that there is the patch 215  available  and would like to ask someone
who knows how to use multiple indexes.

Thanks,

Jae Joo


_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE




Re: solr range query

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 8:02 AM, Heba Farouk [EMAIL PROTECTED] wrote:
 I would like to use solr  to return ranges of searches on an integer
 field, if I wrote in the url  offset:[0 TO 10], it returns documents
 with offset values 0, 1, 10 only  but I want to return the range 0,1,2,
 3, 4 ,10. How can I do that with solr

Use fieldType=sint (sortable int... see the schema.xml), and reindex.

-Yonik


Query and heap Size

2007-11-12 Thread Jae Joo
In my system, the heap size (old generation) keeps growing up caused by
heavy traffic.
I have adjusted the size of young generation, but it does not work well.

Does anyone have any recommendation regarding this issue? - Solr
configuration and/or web.xml ...etc...

Thanks,

Jae


Re: Multiple indexes

2007-11-12 Thread Jae Joo
Here is my situation.

I have 6 millions articles indexed and adding about 10k articles everyday.
If I maintain only one index, whenever the daily feeding is running, it
consumes the heap area and causes FGC.
I am thinking the way to have multiple indexes - one is for ongoing querying
service and one is for update. Once update is done, switch the index by
automatically and/or my application.

Thanks,

Jae joo


On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote:

 The advantages of a multi-core setup are configuration flexibility and
 dynamically changing available options (without a full restart).

 For high-performance production solr servers, I don't think there is
 much reason for it.  You may want to split the two indexes on to two
 machines.  You may want to run each index in a separate JVM (so if one
 crashes, the other does not)

 Maintaining 2 indexes is pretty easy, if that was a larger number or you
 need to create indexes for each user in a system then it would be worth
 investigating the multi-core setup (it is still in development)

 ryan


 Pierre-Yves LANDRON wrote:
  Hello,
 
  Until now, i've used two instance of solr, one for each of my
 collections ; it works fine, but i wonder
  if there is an advantage to use multiple indexes in one instance over
 several instances with one index each ?
  Note that the two indexes have different schema.xml.
 
  Thanks.
  PL
 
  Date: Thu, 8 Nov 2007 18:05:43 -0500
  From: [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Subject: Multiple indexes
 
  Hi,
 
  I am looking for the way to utilize the multiple indexes for signle
 sole
  instance.
  I saw that there is the patch 215  available  and would like to ask
 someone
  who knows how to use multiple indexes.
 
  Thanks,
 
  Jae Joo
 
  _
  Discover the new Windows Vista
  http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE




Re: Multiple indexes

2007-11-12 Thread Ryan McKinley


just use the standard collection distribution stuff.  That is what it is 
made for! http://wiki.apache.org/solr/CollectionDistribution


Alternatively, open up two indexes using the same config/dir -- do your 
indexing on one and the searching on the other.  when indexing is done 
(or finishes a big chunk) send commit/ to the 'searching' one and it 
will see the new stuff.


ryan



Jae Joo wrote:

Here is my situation.

I have 6 millions articles indexed and adding about 10k articles everyday.
If I maintain only one index, whenever the daily feeding is running, it
consumes the heap area and causes FGC.
I am thinking the way to have multiple indexes - one is for ongoing querying
service and one is for update. Once update is done, switch the index by
automatically and/or my application.

Thanks,

Jae joo


On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote:


The advantages of a multi-core setup are configuration flexibility and
dynamically changing available options (without a full restart).

For high-performance production solr servers, I don't think there is
much reason for it.  You may want to split the two indexes on to two
machines.  You may want to run each index in a separate JVM (so if one
crashes, the other does not)

Maintaining 2 indexes is pretty easy, if that was a larger number or you
need to create indexes for each user in a system then it would be worth
investigating the multi-core setup (it is still in development)

ryan


Pierre-Yves LANDRON wrote:

Hello,

Until now, i've used two instance of solr, one for each of my

collections ; it works fine, but i wonder

if there is an advantage to use multiple indexes in one instance over

several instances with one index each ?

Note that the two indexes have different schema.xml.

Thanks.
PL


Date: Thu, 8 Nov 2007 18:05:43 -0500
From: [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Subject: Multiple indexes

Hi,

I am looking for the way to utilize the multiple indexes for signle

sole

instance.
I saw that there is the patch 215  available  and would like to ask

someone

who knows how to use multiple indexes.

Thanks,

Jae Joo

_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE








Re: Best way to create multiple indexes

2007-11-12 Thread Ryan McKinley
For starters, do you need to be able to search across groups or 
sub-groups (in one query?)


If so, then you have to stick everything in one index.

You can add a field to each document saying what 'group' or 'sub-group' 
it is in and then limit it at query time


 q=kittens +group:A

The advantage to splitting it into multiple indexes is that you could 
put each index on independent hardware.  Depending on your queries and 
index size that may make a big difference.


ryan


Rishabh Joshi wrote:

Hi,

I have a requirement and was wondering if someone could help me in how to go 
about it. We have to index about 8-9 million documents and their size can be 
anywhere from a few KBs to a couple of MBs. These documents are categorized 
into many 'groups' and 'sub-groups'. I wanted to know if we can create multiple 
indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, then how do 
we go about it? I tried going through the section on 'Collections' in the Solr 
Wiki, but could not make much use of it.

Regards,
Rishabh Joshi









Re: Best way to create multiple indexes

2007-11-12 Thread Dwarak R

Hi Guys

How do we add word documents / pdf / text / etc documents in solr ?. How the 
content of the files are stored or indexed ?. Does the documents are stored 
as XML in the filesystem ?


Regards

Dwarak R
- Original Message - 
From: Ryan McKinley [EMAIL PROTECTED]

To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 7:43 PM
Subject: Re: Best way to create multiple indexes


For starters, do you need to be able to search across groups or sub-groups 
(in one query?)


If so, then you have to stick everything in one index.

You can add a field to each document saying what 'group' or 'sub-group' it 
is in and then limit it at query time


 q=kittens +group:A

The advantage to splitting it into multiple indexes is that you could put 
each index on independent hardware.  Depending on your queries and index 
size that may make a big difference.


ryan


Rishabh Joshi wrote:

Hi,

I have a requirement and was wondering if someone could help me in how to 
go about it. We have to index about 8-9 million documents and their size 
can be anywhere from a few KBs to a couple of MBs. These documents are 
categorized into many 'groups' and 'sub-groups'. I wanted to know if we 
can create multiple indexes based on 'groups' and then on 'sub-groups' in 
Solr? If yes, then how do we go about it? I tried going through the 
section on 'Collections' in the Solr Wiki, but could not make much use of 
it.


Regards,
Rishabh Joshi











This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in error, 
please notify the sender[EMAIL PROTECTED]  immediately and delete the 
original. Any other use of the email by you is prohibited.


solr workflow ?

2007-11-12 Thread Dwarak R
Hi Guys

How do we add word documents / pdf / text / etc documents in solr ?. How do the 
content of the files are stored or indexed ?. Are these documents stored as XML 
in the SOLR filesystem ?

Regards

Dwarak R

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender[EMAIL PROTECTED]  immediately and delete the 
original. Any other use of the email by you is prohibited.


RE: Best way to create multiple indexes

2007-11-12 Thread Rishabh Joshi

Ryan,

We currently have 8-9 million documents to index and this number will grow in 
the future. Also, we will never have a query that will search across groups, 
but, we will have queries that will search across sub-groups for sure.
Now, keeping this in mind we were thinking if we could have multiple indexes at 
the 'group' level at least.
Also, can multiple indexes be created dynamically? For example: In my 
application if I create a 'logical group', then an index should be created for 
that group.

Rishabh

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Monday, November 12, 2007 7:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Best way to create multiple indexes

For starters, do you need to be able to search across groups or
sub-groups (in one query?)

If so, then you have to stick everything in one index.

You can add a field to each document saying what 'group' or 'sub-group'
it is in and then limit it at query time

  q=kittens +group:A

The advantage to splitting it into multiple indexes is that you could
put each index on independent hardware.  Depending on your queries and
index size that may make a big difference.

ryan


Rishabh Joshi wrote:
 Hi,

 I have a requirement and was wondering if someone could help me in how to go 
 about it. We have to index about 8-9 million documents and their size can be 
 anywhere from a few KBs to a couple of MBs. These documents are categorized 
 into many 'groups' and 'sub-groups'. I wanted to know if we can create 
 multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, 
 then how do we go about it? I tried going through the section on 
 'Collections' in the Solr Wiki, but could not make much use of it.


 Regards,
 Rishabh Joshi








Re: no segments* file found

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 3:46 AM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote:
 If I don't optimize, I 've got a too many files open at about 450K files
 and 3 Gb index

You may need to increase the number of filedescriptors in your system.
If you're using Linux, see this:
http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html
Check the system wide limit and the per-process limit.

-Yonik


Does SOLR supports multiple instances within the same webapplication?

2007-11-12 Thread Dilip.TS
Hello,

  Does SOLR supports multiple instances within the same web application? If
so how is this achieved?

  Thanks in advance.

Regards,
Dilip TS



leading wildcards

2007-11-12 Thread Traut
Hi
 I found the thread about enabling leading wildcards in
Solr as additional option in config file. I've got nightly Solr build
and I can't find any options connected with leading wildcards in
config files.

 How I can enable leading wildcard queries in Solr? Thank you


-- 
Best regards,
Traut


Re: solr workflow ?

2007-11-12 Thread Traut
rtfm :)
http://lucene.apache.org/solr/tutorial.html

On Nov 12, 2007 4:33 PM, Dwarak R [EMAIL PROTECTED] wrote:
 Hi Guys

 How do we add word documents / pdf / text / etc documents in solr ?. How do 
 the content of the files are stored or indexed ?. Are these documents stored 
 as XML in the SOLR filesystem ?

 Regards

 Dwarak R

 This message is for the designated recipient only and may contain privileged, 
 proprietary, or otherwise private information. If you have received it in 
 error, please notify the sender[EMAIL PROTECTED]  immediately and delete the 
 original. Any other use of the email by you is prohibited.




-- 
Best regards,
Traut


Re: Does SOLR supports multiple instances within the same webapplication?

2007-11-12 Thread Ryan McKinley

Dilip.TS wrote:

Hello,

  Does SOLR supports multiple instances within the same web application? If
so how is this achieved?



If you want multiple indices, you can run multiple web-apps.

If you need multiple indices in the same web-app, check SOLR-350 -- it 
is still in development, and make sure you *really* need it before going 
that route.


ryan


Re: solr workflow ?

2007-11-12 Thread Venkatraman S
Highly unfortunate!

On Nov 12, 2007 9:07 PM, Traut [EMAIL PROTECTED] wrote:

 rtfm :)
 http://lucene.apache.org/solr/tutorial.html

 On Nov 12, 2007 4:33 PM, Dwarak R [EMAIL PROTECTED] wrote:
  Hi Guys
 
  How do we add word documents / pdf / text / etc documents in solr ?. How
 do the content of the files are stored or indexed ?. Are these documents
 stored as XML in the SOLR filesystem ?
 
  Regards
 
  Dwarak R
 
  This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender[EMAIL PROTECTED] immediately 
 and delete the original. Any other use of the email by you is
 prohibited.
 



 --
 Best regards,
 Traut




--


Re: leading wildcards

2007-11-12 Thread Traut
Seems like there is no way to enable leading wildcard queries except
code editing and files repacking. :(

On 11/12/07, Bill Au [EMAIL PROTECTED] wrote:
 The related bug is still open:

 http://issues.apache.org/jira/browse/SOLR-218

 Bill

 On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote:
  Hi
   I found the thread about enabling leading wildcards in
  Solr as additional option in config file. I've got nightly Solr build
  and I can't find any options connected with leading wildcards in
  config files.
 
   How I can enable leading wildcard queries in Solr? Thank you
 
 
  --
  Best regards,
  Traut
 



-- 
Best regards,
Traut


Re: leading wildcards

2007-11-12 Thread Michael Kimsal
Vote for that issue and perhaps it'll gain some more traction.  A former
colleague of mine was the one who contributed the patch in SOLR 218 and it
would be nice to have that configuration option 'standard' (if off by
default) in the next SOLR release.


On Nov 12, 2007 11:18 AM, Traut [EMAIL PROTECTED] wrote:

 Seems like there is no way to enable leading wildcard queries except
 code editing and files repacking. :(

 On 11/12/07, Bill Au [EMAIL PROTECTED] wrote:
  The related bug is still open:
 
  http://issues.apache.org/jira/browse/SOLR-218
 
  Bill
 
  On Nov 12, 2007 10:25 AM, Traut [EMAIL PROTECTED] wrote:
   Hi
I found the thread about enabling leading wildcards in
   Solr as additional option in config file. I've got nightly Solr build
   and I can't find any options connected with leading wildcards in
   config files.
  
How I can enable leading wildcard queries in Solr? Thank
 you
  
  
   --
   Best regards,
   Traut
  
 


 --
 Best regards,
 Traut




-- 
Michael Kimsal
http://webdevradio.com


Re: Multiple indexes

2007-11-12 Thread Jae Joo
I have built the master solr instance and indexed some files. Once I run
snapshotter, i complains the error..  - snapshooter -d data/index (in
solr/bin directory)
Did I missed something?

++ date '+%Y/%m/%d %H:%M:%S'
+ echo 2007/11/12 12:38:40 taking snapshot
/solr/master/solr/data/index/snapshot.20071112123840
+ [[ -n '' ]]
+ mv 
/solr/master/solr/data/index/temp-snapshot.20071112123840/solr/master/solr/data/index/snapshot.20071112123840
mv: cannot access /solr/master/solr/data/index/temp-snapshot.20071112123840
Jae

On Nov 12, 2007 9:09 AM, Ryan McKinley [EMAIL PROTECTED] wrote:


 just use the standard collection distribution stuff.  That is what it is
 made for! http://wiki.apache.org/solr/CollectionDistribution

 Alternatively, open up two indexes using the same config/dir -- do your
 indexing on one and the searching on the other.  when indexing is done
 (or finishes a big chunk) send commit/ to the 'searching' one and it
 will see the new stuff.

 ryan



 Jae Joo wrote:
  Here is my situation.
 
  I have 6 millions articles indexed and adding about 10k articles
 everyday.
  If I maintain only one index, whenever the daily feeding is running, it
  consumes the heap area and causes FGC.
  I am thinking the way to have multiple indexes - one is for ongoing
 querying
  service and one is for update. Once update is done, switch the index by
  automatically and/or my application.
 
  Thanks,
 
  Jae joo
 
 
  On Nov 12, 2007 8:48 AM, Ryan McKinley [EMAIL PROTECTED] wrote:
 
  The advantages of a multi-core setup are configuration flexibility and
  dynamically changing available options (without a full restart).
 
  For high-performance production solr servers, I don't think there is
  much reason for it.  You may want to split the two indexes on to two
  machines.  You may want to run each index in a separate JVM (so if one
  crashes, the other does not)
 
  Maintaining 2 indexes is pretty easy, if that was a larger number or
 you
  need to create indexes for each user in a system then it would be worth
  investigating the multi-core setup (it is still in development)
 
  ryan
 
 
  Pierre-Yves LANDRON wrote:
  Hello,
 
  Until now, i've used two instance of solr, one for each of my
  collections ; it works fine, but i wonder
  if there is an advantage to use multiple indexes in one instance over
  several instances with one index each ?
  Note that the two indexes have different schema.xml.
 
  Thanks.
  PL
 
  Date: Thu, 8 Nov 2007 18:05:43 -0500
  From: [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Subject: Multiple indexes
 
  Hi,
 
  I am looking for the way to utilize the multiple indexes for signle
  sole
  instance.
  I saw that there is the patch 215  available  and would like to ask
  someone
  who knows how to use multiple indexes.
 
  Thanks,
 
  Jae Joo
  _
  Discover the new Windows Vista
  http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE
 
 




RE: Solr + autocomplete

2007-11-12 Thread Park, Michael
Thanks Ryan,

This looks like the way to go.  However, when I set up my schema I get,
Error loading class 'solr.EdgeNGramFilterFactory'.  For some reason
the class is not found.  I tried the stable 1.2 build and even tried the
nightly build.  I'm using filter class=solr.EdgeNGramFilterFactory
minGramSize=1 maxGramSize=20/.

Any suggestions?

Thanks,
Mike

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Monday, October 15, 2007 4:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr + autocomplete

 
 I would imagine there is a library to set up an autocomplete search
with
 Solr.  Does anyone have any suggestions?  Scriptaculous has a
JavaScript
 autocomplete library.  However, the server must return an unordered
 list.
 

Solr does not provide an autocomplete UI, but it can return JSON that a 
JS library can use to populate an autocomplete.

Depending on you index size/ query speed, you may be fine with a 
standard faceting prefix filter.  If the index is large, you may want to

index using the EdgeNGramFilterFactory.

Check the last comment in:
https://issues.apache.org/jira/browse/SOLR-357

ryan




RE: Solr + autocomplete

2007-11-12 Thread Chris Hostetter

: Error loading class 'solr.EdgeNGramFilterFactory'.  For some reason

EdgeNGramFilterFactory didn't exist when Solr 1.2 was released, but the 
EdgeNGramTokenizerFactory did.  (the javadocs that come with each release 
list all of the various factories in that release)


-Hoss



DINSTINCT ON functionality in Solr?

2007-11-12 Thread Jörg Kiegeland
Is there a way to define a query in that way that a search result 
contains only one representative of every set of documents which are 
equal on a given field (it is not important which representative 
document), i.e. to have the DINTINCT ON-concept from relational 
databases in Solr?


If this cannot be done with the search API of Lucene, may be one can use 
Solr server side hooks or filters to achieve this? How?


The reason why I do not want to do this filtering manually, is, because 
I want to have as many matches as possible with respect to my defined 
result limit for the query (and filtering the search result on client 
side may really kick me off from this limit far away).


Thanks..


Phrase-based (vs. Word-Based) Proximity Search

2007-11-12 Thread Chris Harris
I gather that the standard Solr query parser uses the same syntax for
proximity searches as Lucene, and that Lucene syntax is described at

  http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches

This syntax lets me look for terms that are within x words of each
other. Their example is that

  jakarta apache~10

will find documents where jakarta and apache occur within 10 words
of one another.

What I would like to do is is find documents where *phrases*, not just
terms, are within x words of each other. I want to be able to say
things like

  Find the documents where the phrases apache jakarta and sun
microsystems occur within ten words
  of one another.

If I gave such a search, I would *not* want it to count as a match if,
for instance, apache appeared near microsystems but apache
wasn't followed immediately by jakarta, or microsystems wasn't
preceded immediately by sun. I would also not want it to match if
apache jakarta appeared, but sun microsystems did not appear.

Is there any way to do such a search currently? I suppose it might work to say

  apache jakarta sun microsystems~10 +apache jakarta +sun microsystems

but that seems like an unfortunate hack. In any case it's not really
something I can expect my users to be able to type in by themselves.
In our current query language (which I'm hoping to wean our users off
of), they can type

  apache jakarta near/10 sun microsystems

which I believe is more intuitive.

Any ideas?

Chris


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Erick Erickson
DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be
useful

For your line number, page number etc perspective, it is possible to index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token $.
Coincident
with the last token of every paragraph index the token #. If you get
the
offsets of the matching terms, you can quite quickly simply count the number
of line and paragraph tokens using TermDocs/TermEnums and correlate hits
to lines and paragraphs. The trick is to index your special tokens with an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this).


Another possibility is to add a special field with each document with the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not indexed).
Again, given the offsets,  you can read in this field and figure out what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of your
particular problem space. I'm not sure either of them is suitable for very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective, so
don't
even *think* of asking me how to really make this work in SOLR G.

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote:

 Ryan (and others who need something to put them so sleep :) )

 Wow -- the light-bulb finally went off -- the Analzyer admin page is very
 cool -- I just was not at all thinking the SOLR/Lucene way.

 I need to rethink my whole approach now that I understand (from reviewing
 the schema.xml closer and playing with the Analyser) how compatible index
 and query policies can be applied automatically on a field by field basis by
 SOLR at both index and query time.

 I still may have a stumper here, but I need to give it some thought, and
 may return again with another question:

 The problem is that my text is book text (fairly large) that ooks very
 much like one would expect:
 book
 chapter
 parasen.../sensen/sen/para
 parasen.../sensen/sen/para
 parasen.../sensen.../sen/para
 /chapter
 /book

 The search results need to return exact sentences or paragraphs with their
 exact page:line numbers (which is available in the embedded markup in the
 text).

 There were previous responses by others, suggesting I look into payloads,
 but I did not fully understand that -- I may have to re-read those e-mails
 now that I am getting a clearer picture of SOLR/Lucene.

 However, the reason I resorted to indexing each paragraph as a single
 document, and then redundantly indexing each sentence as a single document,
 is because I was planning on pre-parsing the text myself (outside of SOLR)
 -- and feeding separate doc elements to the add because in that way I
 could produce the page:line reference in the pre-parsing (again outside of
 SOLR) and feed it in as explict field in the doc elements of the add
 requests.  Therefore at query time, I will have the exact page:line
 corresponding to the start of the paragraph or sentence.

 But I am beginning to suspect, I was planning to do a lot of work that
 SOLR can do for me.

 I will continue to study this and respond when I am a bit clearer, but the
 closer I could get to just submitting the books a chapter at a time -- and
 letting SOLR do the work, the better (cause I have all the books in well
 formed xml at chapter levels).  However, I don't  see yet how I could get
 par/sen granular search result hits, along with their exact page:line
 coordinates unless I approach it by explicitly indexing the pars and sens as
 single documents, not chapters hits, and also return the entire text of the
 sen or par, and highlight the keywords within (for the search result hit).
  Once a search result hit is selected, it would then act as expected and
 position into the chapter, at the selected reference, highlight again the
 key words, but this time in the context of an entire chapter (the whole
 document to the user's mind).

 Even with my new understanding you (and others) have given me, which I can
 use to certainly improve my approach -- it still seems to me that because
 multi-valued fields concatenate text -- even if you use the
 positionGapIncrment feature to prohibit unwanted phrase matches, how do you
 produce a well definied search result hit, bounded by the exact sen or par,
 unless you index them as single documents?

 Should I still read up on the payload discussion?

 Dave




 - Original Message 
 From: Ryan McKinley [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Saturday, November 10, 2007 5:00:43 PM
 Subject: Re: Redundant indexing * 4 only solution (for par/sen and case
 sensitivity)


 David Neubert wrote:
  Ryan,
 
  Thanks for your response.  I infer from your response that you can
  have a different analyzer for each field

 yes!  

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik,

Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant 
by case field, but I am not clear about your wording per-book setting 
attached at index time - would you mind ellaborating on that, so I am clear?

Dave

- Original Message 
From: Erik Hatcher [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 11, 2007 5:21:45 AM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


Solr query syntax is documented here: http://wiki.apache.org/solr/ 
SolrQuerySyntax

What Yonik is referring to is creating your own case field with the  
per-book setting attached at index time.

Erik


On Nov 11, 2007, at 12:55 AM, David Neubert wrote:

 Yonik (or anyone else)

 Do you know where on-line documentation on the +case: syntax is  
 located?  I can't seem to find it.

 Dave

 - Original Message 
 From: Yonik Seeley [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Saturday, November 10, 2007 4:56:40 PM
 Subject: Re: Redundant indexing * 4 only solution (for par/sen and  
 case sensitivity)


 On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote:
 So if I am hitting multiple fields (in the same search request) that
  invoke different Analyzers -- am I at a dead end, and have to  
 result to
  consequetive multiple queries instead

 Solr handles that for you automatically.

 The app that I am replacing (and trying to enhance) has the ability
  to search multiple books at once
 with sen/par and case sensitivity settings individually selectable
  per book

 You could easily select case sensitivity or not *per query* across
 all
  books.
 You should step back and see what the requirements actually are (i.e.
 the reasons why one needs to be able to select case
 sensitive/insensitive on a book level... it doesn't make sense to me
 at first blush).

 It could be done on a per-book level in solr with a more complex
 query
 structure though...

 (+case:sensitive +(normal relevancy query on the case sensitive
 fields
 goes here)) OR (+case:insensitive +(normal relevancy query on the
 case
 insensitive fields goes here))

 -Yonik





 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: Solr + autocomplete

2007-11-12 Thread Park, Michael
Will I need to use Solr 1.3 with the EdgeNGramFilterFactory in order to
get the autosuggest feature?

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 12, 2007 1:05 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr + autocomplete


: Error loading class 'solr.EdgeNGramFilterFactory'.  For some reason

EdgeNGramFilterFactory didn't exist when Solr 1.2 was released, but the 
EdgeNGramTokenizerFactory did.  (the javadocs that come with each
release 
list all of the various factories in that release)


-Hoss



Re: Phrase-based (vs. Word-Based) Proximity Search

2007-11-12 Thread Ken Krugler

Hi Chris,


I gather that the standard Solr query parser uses the same syntax for
proximity searches as Lucene, and that Lucene syntax is described at

http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches

This syntax lets me look for terms that are within x words of each
other. Their example is that

  jakarta apache~10

will find documents where jakarta and apache occur within 10 words
of one another.

What I would like to do is is find documents where *phrases*, not just
terms, are within x words of each other. I want to be able to say
things like

  Find the documents where the phrases apache jakarta and sun
microsystems occur within ten words
  of one another.


[snip]

I'd thought that span queries would allow you to do this type of 
thing, but they're not supported (currently) by the standard query 
parser.


E.g. check out the SpanNearQuery support in (recent) Lucene releases:

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/spans/SpanNearQuery.html

I would recommend re-posting this on the Lucene user list.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Re: Associating pronouns instances to proper nouns?

2007-11-12 Thread David Neubert
Attempting to answer my own question, which I should probably just try, 
assuming I can doctor the indexed text ---I suppose I could do something like 
change all instances or I, he, etc that refer to one person to IJBA HEJBA, 
HIMJBA (making sure they would never equal a normal word) -- then use the 
synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author 
(although, even if this were a good approach)  I don't know if you can link 
synonyms for phrases as opposed to a single word. And of course this would 
require a correlative translation mechanism at display time to render I, he, 
him, instead of the indexed acronym.

- Original Message 
From: David Neubert [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:54:11 PM
Subject: Associating pronouns instances to proper nouns?


All,

I am working with very exact text and search over permament documents (books).  
It would be great to associate pronouns like he, she, him, her, I, my, etc. 
with the acutal author or person the pronoun refers to.  I can see how I could 
get pretty darn close with the synonym feature in Lucene.  Unfortunately 
though, as I understand it, this would associate all instances or I, he, she, 
etc. instead of particular instances.

I have come up with a crude mechanism, adding the initials for the referred 
person, immediately after the pronoun ... him{DGN}, but this of course 
complicates word counts and potential prhase lookups, etc. (which I could 
probably live with and work around).

But after understanding how easy it is to add synonymns for any particular
 word in a document, is there any standard practical way to add synonymns to a 
particular word instance within a document?  That would really do the trick?

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik - thanks, I am considering this approach, verses explicit redundant 
indexing -- and am also considering Lucene -- problem is, I am one week into 
both technologies (though have years in the search space) -- wish I could go to 
Hong Kong -- any discounts available anywhere :)

Dave

- Original Message 
From: Erick Erickson [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:11:14 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this
 may be
useful

For your line number, page number etc perspective, it is possible to
 index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token $.
Coincident
with the last token of every paragraph index the token #. If you
 get
the
offsets of the matching terms, you can quite quickly simply count the
 number
of line and paragraph tokens using TermDocs/TermEnums and correlate
 hits
to lines and paragraphs. The trick is to index your special tokens with
 an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on
 this).


Another possibility is to add a special field with each document with
 the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not
 indexed).
Again, given the offsets,  you can read in this field and figure out
 what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of
 your
particular problem space. I'm not sure either of them is suitable for
 very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective,
 so
don't
even *think* of asking me how to really make this work in SOLR G.

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote:

 Ryan (and others who need something to put them so sleep :) )

 Wow -- the light-bulb finally went off -- the Analzyer admin page is
 very
 cool -- I just was not at all thinking the SOLR/Lucene way.

 I need to rethink my whole approach now that I understand (from
 reviewing
 the schema.xml closer and playing with the Analyser) how compatible
 index
 and query policies can be applied automatically on a field by field
 basis by
 SOLR at both index and query time.

 I still may have a stumper here, but I need to give it some thought,
 and
 may return again with another question:

 The problem is that my text is book text (fairly large) that ooks
 very
 much like one would expect:
 book
 chapter
 parasen.../sensen/sen/para
 parasen.../sensen/sen/para
 parasen.../sensen.../sen/para
 /chapter
 /book

 The search results need to return exact sentences or paragraphs with
 their
 exact page:line numbers (which is available in the embedded markup in
 the
 text).

 There were previous responses by others, suggesting I look into
 payloads,
 but I did not fully understand that -- I may have to re-read those
 e-mails
 now that I am getting a clearer picture of SOLR/Lucene.

 However, the reason I resorted to indexing each paragraph as a single
 document, and then redundantly indexing each sentence as a single
 document,
 is because I was planning on pre-parsing the text myself (outside of
 SOLR)
 -- and feeding separate doc elements to the add because in that
 way I
 could produce the page:line reference in the pre-parsing (again
 outside of
 SOLR) and feed it in as explict field in the doc elements of the
 add
 requests.  Therefore at query time, I will have the exact page:line
 corresponding to the start of the paragraph or sentence.

 But I am beginning to suspect, I was planning to do a lot of work
 that
 SOLR can do for me.

 I will continue to study this and respond when I am a bit clearer,
 but the
 closer I could get to just submitting the books a chapter at a time
 -- and
 letting SOLR do the work, the better (cause I have all the books in
 well
 formed xml at chapter levels).  However, I don't  see yet how I could
 get
 par/sen granular search result hits, along with their exact page:line
 coordinates unless I approach it by explicitly indexing the pars and
 sens as
 single documents, not chapters hits, and also return the entire text
 of the
 sen or par, and highlight the keywords within (for the search result
 hit).
  Once a search result hit is selected, it would then act as expected
 and
 position into the chapter, at the selected reference, highlight again
 the
 key words, but this time in the context of an entire chapter (the
 whole
 document to the user's mind).

 Even with my new understanding you (and others) have given me, which
 I can
 use to certainly improve my approach -- it still seems to me that
 because
 multi-valued fields concatenate text -- even if you use the
 positionGapIncrment feature to prohibit unwanted phrase matches, how
 do you
 produce a 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote:
 Erik - thanks, I am considering this approach, verses explicit redundant 
 indexing -- and am also considering Lucene -

There's not a well defined solution in either IMO.

 - problem is, I am one week into both technologies (though have years in the 
 search space) -- wish I could
 go to Hong Kong -- any discounts available anywhere :)

Unfortunately the OS Summit has been canceled.

-Yonik


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Chris Hostetter

:  - problem is, I am one week into both technologies (though have years in 
the search space) -- wish I could
:  go to Hong Kong -- any discounts available anywhere :)
: 
: Unfortunately the OS Summit has been canceled.

Or rescheduled to 2008 ... depending on wether you are a half-empty / 
half-full kind of person.

And lets not forget atlanta ... starting today and all...

http://us.apachecon.com/us2007/



-Hoss



Re: Associating pronouns instances to proper nouns?

2007-11-12 Thread David Neubert
All

 have found (from using the Admin/Analysis page) that if I were to append 
unique initials (that didn't match any other word or acronym) to each pronoun 
(e.g. I-WCN, she-WCN,  my-WCN etc) that the default parsing and tokenization 
for the text field in SOLR might actually do the trick -- it parses down to  I, 
wcn, IWCN, i, idgn -- all at the same word position -- so that is perfect.  I 
haven't exhaustively tested all capitalization nuances, but am too woried about 
that.

If I want to do an exhaustive search for person WCN, i just have to enter 
his/her initials and than can get all references including pronouns?

Anybody see any holes in this?  (sounds alarmingly easy so far)?

Dave

- Original Message 
From: David Neubert [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 3:04:20 PM
Subject: Re: Associating pronouns instances to proper nouns?


Attempting to answer my own question, which I should probably just try, 
assuming I can doctor the indexed text ---I suppose I could do something like 
change all instances or I, he, etc that refer to one person to IJBA HEJBA, 
HIMJBA (making sure they would never equal a normal word) -- then use the 
synonym feature to link IJBA, HEJBA, HIMJBA, Joe Book Author, J.B.Author 
(although, even if this were a good approach)  I don't know if you can link 
synonyms for phrases as opposed to a single word. And of course this would 
require a correlative translation mechanism at display time to render I, he, 
him, instead of the indexed acronym.

- Original Message 
From: David Neubert [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:54:11 PM
Subject: Associating pronouns instances to proper nouns?


All,

I am working with very exact text and search over permament documents (books).  
It would be great to associate pronouns like he, she, him, her, I, my, etc. 
with the acutal author or person the pronoun refers to.  I can see how I could 
get pretty darn close with the synonym feature in Lucene.  Unfortunately 
though, as I understand it, this would associate all instances or I, he, she, 
etc. instead of particular instances.

I have come up with a crude mechanism, adding the initials for the referred 
person, immediately after the pronoun ... him{DGN}, but this of course 
complicates word counts and potential prhase lookups, etc. (which I could 
probably live with and work around).

But after understanding how easy it is to add synonymns for any particular
 word in a document, is there any standard practical way to add synonymns to a 
particular word instance within a document?  That would really do the trick?

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Faceting over limited result set

2007-11-12 Thread Pieter Berkel
On 13/11/2007, Chris Hostetter [EMAIL PROTECTED] wrote:


 can you elaborate on your use case ... the only time i've ever seen people
 ask about something like this it was because true facet counts were too
 expensive to compute, so they were doing sampling of the first N
 results.

 In Solr, Sampling like this would likely be just as expensive as getting
 the full count.


It's not really a performance-related issue, the primary goal is to use the
facet information to determine the most relevant product category related to
the particular search being performed.

Generally the facets returned by simple, generic queries are fine for this
purpose (e.g. a search for nokia will correctly return Mobile / Cell
Phone as the most frequent facet), however facet data for more specific
searches are not as clear-cut (e.g. samsung tv where TVs will appear at
the top of the search results, but will also match other samsung' products
like mobile phones and mp3 players - obviously I could tweak 'mm' parameter
to fix this particular case, but it wouldn't really solve my problem).

The theory is that facet information generated from the first 'x' (lets say
100) matches to a query (ordered by score / relevance) will be more accurate
(for the above purpose) than facets obtained over the entire result set.  So
ideally, it would be useful to be able to contstrain the size of the DocSet
somehow (as you mention below).


matching occurs in increasing order of docid, so even if there was as hook
 to say stop matching after N docs those N wouldn't be a good
 representative sample, they would be biased towards older documents
 (based on when they were indexed, not on any particular date field)

 if what you are interested in is stats on the first N docs according to a
 specific sort (score or otherwise) then you could write a custom request
 handler that executed a search with a limit of N, got the DocList,
 iterated over it to build a DocSet, and then used that DocSet to do
 faceting ... but that would probably take even longer then just using the
 full DocSet matching the entire query.



I was hoping to avoid having to write a custom request handler but your
suggestion above sounds like it would do the trick.  I'm also debating
whether to extract my own facet info from a result set on the client side,
but this would be even slower.

Thanks for your suggestions so far,
Piete


Re: DINSTINCT ON functionality in Solr?

2007-11-12 Thread Pieter Berkel
Currently this functionality is not available in Solr out-of-the-box,
however there is a patch implementing Field Collapsing
http://issues.apache.org/jira/browse/SOLR-236 which might be similar to what
you are trying to achieve.

Piete



On 13/11/2007, Jörg Kiegeland [EMAIL PROTECTED] wrote:

 Is there a way to define a query in that way that a search result
 contains only one representative of every set of documents which are
 equal on a given field (it is not important which representative
 document), i.e. to have the DINTINCT ON-concept from relational
 databases in Solr?

 If this cannot be done with the search API of Lucene, may be one can use
 Solr server side hooks or filters to achieve this? How?

 The reason why I do not want to do this filtering manually, is, because
 I want to have as many matches as possible with respect to my defined
 result limit for the query (and filtering the search result on client
 side may really kick me off from this limit far away).

 Thanks..



Re: Faceting over limited result set

2007-11-12 Thread Chris Hostetter

: It's not really a performance-related issue, the primary goal is to use the
: facet information to determine the most relevant product category related to
: the particular search being performed.

ah ... ok, i understand now.  the order does matter, you want the top N 
documents sorted by some criteria (either score, or maybe popularity i 
would imagine) and then you want to pick the categories based on that.

i had to build this for CNET back before solr went open source, but yes - 
i did it using a custom subclass of dismax similar to what i 
discribed before.

one thing to watch out for is that you probably want to use a consistent 
sort independent of the user's sort -- if the user re-sorts by price it 
can be disconcerting for them if that changes the navigation links.


-Hoss



Re: Does SOLR supports multiple instances within the same webapplication?

2007-11-12 Thread James liu
if I understand correct,,u just do it like that:(i use php)

$data1 = getDataFromInstance1($url);
$data2 = getDataFromInstance2($url);

it just have multi solr Instance. and getData from the distance.


On Nov 12, 2007 11:15 PM, Dilip.TS [EMAIL PROTECTED] wrote:

 Hello,

  Does SOLR supports multiple instances within the same web application? If
 so how is this achieved?

  Thanks in advance.

 Regards,
 Dilip TS




-- 
regards
jl