Re: Index size vs. number of documents

2008-08-14 Thread Chris Hostetter

: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is

Unless the data in "stored" fields is significantly greater then "indexed" 
fields the Index size almost never grows linearly with the number of 
documents -- it's the number of unique terms that tends to primarily 
influence the size of the index.

At some point someone on the java-user list who really understood the file 
formats wrote a really great forumla for estimating the size of the index 
assuming some ratios of unique terms per doc, but i can't find it now.


-Hoss



Re: IndexOutOfBoundsException

2008-08-14 Thread Yonik Seeley
Since this looks like more of a lucene issue, I've replied in
[EMAIL PROTECTED]

-Yonik

On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor <[EMAIL PROTECTED]> wrote:
> I seem to be able to reproduce this very easily and the data is
> medline (so I am sure I can share it if needed with a quick email to
> check).
>
> - I am using fedora:
> %uname -a
> Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30
> 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> %java -version
> java version "1.7.0"
> IcedTea Runtime Environment (build 1.7.0-b21)
> IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
> - single core (will use shards but each machine just as one HDD so
> didn't see how cores would help but I am new at this)
> - next run I will keep the output to check for earlier errors
> - very and I can share code + data if that will help
>
> On Thu, Aug 14, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> Yikes... not good.  This shouldn't be due to anything you did wrong
>> Ian... it looks like a lucene bug.
>>
>> Some questions:
>> - what platform are you running on, and what JVM?
>> - are you using multicore? (I fixed some index locking bugs recently)
>> - are there any exceptions in the log before this?
>> - how reproducible is this?
>>
>> -Yonik
>>
>> On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I have rebuilt my index a few times (it should get up to about 4
>>> Million but around 1 Million it starts to fall apart).
>>>
>>> Exception in thread "Lucene Merge Thread #0"
>>> org.apache.lucene.index.MergePolicy$MergeException:
>>> java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>>>at 
>>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
>>>at 
>>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
>>> Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>>>at java.util.ArrayList.rangeCheck(ArrayList.java:572)
>>>at java.util.ArrayList.get(ArrayList.java:350)
>>>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
>>>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188)
>>>at 
>>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670)
>>>at 
>>> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349)
>>>at 
>>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
>>>at 
>>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
>>>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
>>>at 
>>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
>>>at 
>>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
>>>
>>>
>>> When this happens, the disk usage goes right up and the indexing
>>> really starts to slow down. I am using a Solr build from about a week
>>> ago - so my Lucene is at 2.4 according to the war files.
>>>
>>> Has anyone seen this error before? Is it possible to tell which Array
>>> is too large? Would it be an Array I am sending in or another internal
>>> one?
>>>
>>> Regards,
>>> Ian Connor
>>>
>>
>
>
>
> --
> Regards,
>
> Ian Connor
>


Re: IndexOutOfBoundsException

2008-08-14 Thread Ian Connor
I seem to be able to reproduce this very easily and the data is
medline (so I am sure I can share it if needed with a quick email to
check).

- I am using fedora:
%uname -a
Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30
13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
%java -version
java version "1.7.0"
IcedTea Runtime Environment (build 1.7.0-b21)
IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode)
- single core (will use shards but each machine just as one HDD so
didn't see how cores would help but I am new at this)
- next run I will keep the output to check for earlier errors
- very and I can share code + data if that will help

On Thu, Aug 14, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Yikes... not good.  This shouldn't be due to anything you did wrong
> Ian... it looks like a lucene bug.
>
> Some questions:
> - what platform are you running on, and what JVM?
> - are you using multicore? (I fixed some index locking bugs recently)
> - are there any exceptions in the log before this?
> - how reproducible is this?
>
> -Yonik
>
> On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I have rebuilt my index a few times (it should get up to about 4
>> Million but around 1 Million it starts to fall apart).
>>
>> Exception in thread "Lucene Merge Thread #0"
>> org.apache.lucene.index.MergePolicy$MergeException:
>> java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>>at 
>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
>>at 
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
>> Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>>at java.util.ArrayList.rangeCheck(ArrayList.java:572)
>>at java.util.ArrayList.get(ArrayList.java:350)
>>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
>>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188)
>>at 
>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670)
>>at 
>> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349)
>>at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
>>at 
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
>>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
>>at 
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
>>at 
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
>>
>>
>> When this happens, the disk usage goes right up and the indexing
>> really starts to slow down. I am using a Solr build from about a week
>> ago - so my Lucene is at 2.4 according to the war files.
>>
>> Has anyone seen this error before? Is it possible to tell which Array
>> is too large? Would it be an Array I am sending in or another internal
>> one?
>>
>> Regards,
>> Ian Connor
>>
>



-- 
Regards,

Ian Connor


Re: Simple Searching Question

2008-08-14 Thread Jake Conk
Rob,

Actually I am copying *_facet to text. I have the following for
copyField in my schema:

 
 


This is my field configuration in my schema:

 
   
   
   
   


   
   
   
   
   
   
   
   
   

   
   
   
   
   
   
 

Thanks,
- Jake



On Thu, Aug 14, 2008 at 5:49 PM, Rob Casson <[EMAIL PROTECTED]> wrote:
> you're likely not copyField-ing *_facet to text, and we'd need to see
> what type of field it is to see how it will be analyzed at both
> search/index time.
>
> the default schema.xml file is pretty well documented, so you might
> want to spend some time looking thru it, and reading the
> commentslots of good info in there.
>
> cheers,
> rob
>
> On Thu, Aug 14, 2008 at 7:17 PM, Jake Conk <[EMAIL PROTECTED]> wrote:
>> Hi Shalin,
>>
>> "foobar_facet" is a dynamic field. Its defined in my schema like this:
>>
>> 
>>
>> I have the default search field set to text. Can I use more than one
>> default search field?
>>
>> text
>>
>> Thanks,
>> - Jake
>>
>>
>> On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar
>> <[EMAIL PROTECTED]> wrote:
>>> Hi Jake,
>>>
>>> What is the type of the foobar_facet field in your schema.xml ?
>>> Did you add foobar_facet as the default search field?
>>>
>>> On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote:
>>>
 Hello,

 I inserted the following documents into Solr:


 ---

 
 
  124
  Jake Conk
 
 
  125
  Jake Conk
 
 


 ---

 id is the only required integer field.
 foobar_facet is a dynamic string field.

 When I try to search for anything with the word Jake in it the
 following ways I get no results.


 select?q=Jake
 select?q=Jake*


 I thought one of those two should work but the only way I got it to
 work was by specifying which field "Jake" is in along with a wild
 card.


 select?q=foobar_facet:Jake*


 1) Does this mean for each field I would like to search if Jake exists
 I would have to add each field like I did above to the query?

 2) How would I search if I want to find the name Jake anywhere in the
 string? The documentation
 (http://lucene.apache.org/java/docs/queryparsersyntax.html) states
 that I cannot use a wildcard as the first character such as *Jake*

 Thanks,
 - Jake

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>
>


Re: Simple Searching Question

2008-08-14 Thread Rob Casson
you're likely not copyField-ing *_facet to text, and we'd need to see
what type of field it is to see how it will be analyzed at both
search/index time.

the default schema.xml file is pretty well documented, so you might
want to spend some time looking thru it, and reading the
commentslots of good info in there.

cheers,
rob

On Thu, Aug 14, 2008 at 7:17 PM, Jake Conk <[EMAIL PROTECTED]> wrote:
> Hi Shalin,
>
> "foobar_facet" is a dynamic field. Its defined in my schema like this:
>
> 
>
> I have the default search field set to text. Can I use more than one
> default search field?
>
> text
>
> Thanks,
> - Jake
>
>
> On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar
> <[EMAIL PROTECTED]> wrote:
>> Hi Jake,
>>
>> What is the type of the foobar_facet field in your schema.xml ?
>> Did you add foobar_facet as the default search field?
>>
>> On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> I inserted the following documents into Solr:
>>>
>>>
>>> ---
>>>
>>> 
>>> 
>>>  124
>>>  Jake Conk
>>> 
>>> 
>>>  125
>>>  Jake Conk
>>> 
>>> 
>>>
>>>
>>> ---
>>>
>>> id is the only required integer field.
>>> foobar_facet is a dynamic string field.
>>>
>>> When I try to search for anything with the word Jake in it the
>>> following ways I get no results.
>>>
>>>
>>> select?q=Jake
>>> select?q=Jake*
>>>
>>>
>>> I thought one of those two should work but the only way I got it to
>>> work was by specifying which field "Jake" is in along with a wild
>>> card.
>>>
>>>
>>> select?q=foobar_facet:Jake*
>>>
>>>
>>> 1) Does this mean for each field I would like to search if Jake exists
>>> I would have to add each field like I did above to the query?
>>>
>>> 2) How would I search if I want to find the name Jake anywhere in the
>>> string? The documentation
>>> (http://lucene.apache.org/java/docs/queryparsersyntax.html) states
>>> that I cannot use a wildcard as the first character such as *Jake*
>>>
>>> Thanks,
>>> - Jake
>>>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>


Re: Best way to index without diacritics

2008-08-14 Thread Norberto Meijome
On Thu, 14 Aug 2008 11:34:47 -0400
"Steven A Rowe" <[EMAIL PROTECTED]> wrote:

[...]
> The kind of filter Walter is talking about - a generalized language-aware 
> character normalization Solr/Lucene filter - does not yet exist.  My guess is 
> that if/when it does materialize, both the Solr and the Lucene projects will 
> want to have it.  Historically, most functionality shared by Solr and Lucene 
> is eventually hosted by Lucene, since Solr has a Lucene dependency, but not 
> vice-versa.
> 
> So, yes, Solr would be responsible for hosting configuration for such a 
> filter, but the responsibility for doing something with the configuration 
> would be Lucene's responsibility, assuming that Lucene would (eventually) 
> host the filter and Solr would host a factory over the filter.
> 
> Steve

thanks for the thorough explanation ,Steve .
B

_
{Beto|Norberto|Numard} Meijome

"Throughout the centuries there were [people] who took first steps down new 
paths armed only with their own vision."
   Ayn Rand

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: More files in index directory than expected

2008-08-14 Thread Yonik Seeley
On Thu, Aug 14, 2008 at 6:31 PM, Chris Harris <[EMAIL PROTECTED]> wrote:
> (The only time a
> segment will be modified is if you delete files from it, and that will
> only alter the segment's .del file, leaving .tis and friends alone.)

Actually, these days .del files are even versioned.

> I don't know exactly what happened, but I restarted Solr once or twice
> and then when I started adding documents again, Solr started deleting
> segment files, and brought things down from like 500GB to like 18GB. I
> feel like I read somewhere that Solr sometimes has trouble deleting
> segment files when running on Windows. (I'm on Windows right now.) I
> wonder if that's related.

Right... file deletion just tends to be delayed a bit longer on Windows.
For example, if you do an optimize, all the segments will be merged
into a single segment, but because a reader is still holding open the
old index, you will see both sets of files.  The old set of files will
be deleted when a new IndexWriter is opened after the old reader is
closed.

-Yonik


Re: Simple Searching Question

2008-08-14 Thread Jake Conk
Hi Shalin,

"foobar_facet" is a dynamic field. Its defined in my schema like this:



I have the default search field set to text. Can I use more than one
default search field?

text

Thanks,
- Jake


On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar
<[EMAIL PROTECTED]> wrote:
> Hi Jake,
>
> What is the type of the foobar_facet field in your schema.xml ?
> Did you add foobar_facet as the default search field?
>
> On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> I inserted the following documents into Solr:
>>
>>
>> ---
>>
>> 
>> 
>>  124
>>  Jake Conk
>> 
>> 
>>  125
>>  Jake Conk
>> 
>> 
>>
>>
>> ---
>>
>> id is the only required integer field.
>> foobar_facet is a dynamic string field.
>>
>> When I try to search for anything with the word Jake in it the
>> following ways I get no results.
>>
>>
>> select?q=Jake
>> select?q=Jake*
>>
>>
>> I thought one of those two should work but the only way I got it to
>> work was by specifying which field "Jake" is in along with a wild
>> card.
>>
>>
>> select?q=foobar_facet:Jake*
>>
>>
>> 1) Does this mean for each field I would like to search if Jake exists
>> I would have to add each field like I did above to the query?
>>
>> 2) How would I search if I want to find the name Jake anywhere in the
>> string? The documentation
>> (http://lucene.apache.org/java/docs/queryparsersyntax.html) states
>> that I cannot use a wildcard as the first character such as *Jake*
>>
>> Thanks,
>> - Jake
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: More files in index directory than expected

2008-08-14 Thread Mark Miller



The main thing that bugs me about this index now is that the latest
version of Luke (0.8.1) won't open it. ("Unknown format version: -6")
The Solr Luke handler works fine with it, though.
  
Luke comes with a released version of Lucene probably, while solr is 
using a later version. You have to start luke with the solr Lucene jar 
on the classpath. There are directions to do this type of thing on the 
Luke webpage.


- Mark


Re: More files in index directory than expected

2008-08-14 Thread Chris Harris
On Thu, Aug 14, 2008 at 2:01 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
> Chris Harris <[EMAIL PROTECTED]> wrote:
>> It's my understanding that if my mergeFactor is 10, then there
>> shouldn't be more than 11 segments in my index directory (10 segments,
>> plus an additional segment if a merge is in progress).
>
> Actually, mergeFactor 10 means each *level* will have <= 10 segments,
> where a level is roughly 10X the size of the previous level.
>
> EG after 10 segments (level 0) are flushed, they get merged into a
> single level 1 segment.  Another 10 produces another level 1 segment.
> Etc.  Until you have 10 level 1 segments, which then get merged into a
> single level 2 segment.
>
> The number of levels you have is logarithmic in the size of your index.

Thanks, that undoes a lot of my confusion.

As for segment creation, is it accurate to say the following: Solr
will write a new level 0 segment to disk each time an additional
ramBufferSizeMB (default=32MB) worth of data have been added to the
index. Furthermore, once that 32MB worth of data has been written to
disk, those segment's files will never be modified. (The only time a
segment will be modified is if you delete files from it, and that will
only alter the segment's .del file, leaving .tis and friends alone.)

>> I just discovered that one of my other indexes has over 11,000 tis
>> files. That's disturbing. I'm not sure if it would have the same
>> underlying cause.
>
> That does NOT sound right.  Can you provide more details how this
> index is created/maintained?

I don't know exactly what happened, but I restarted Solr once or twice
and then when I started adding documents again, Solr started deleting
segment files, and brought things down from like 500GB to like 18GB. I
feel like I read somewhere that Solr sometimes has trouble deleting
segment files when running on Windows. (I'm on Windows right now.) I
wonder if that's related.

The main thing that bugs me about this index now is that the latest
version of Luke (0.8.1) won't open it. ("Unknown format version: -6")
The Solr Luke handler works fine with it, though.


Re: Highlighting returns incorrect text on some results?

2008-08-14 Thread Mark Miller
A question mark huh? You sure there are no character encoding issues 
going on?


Otis Gospodnetic wrote:

Paul, we had many highlighter-related changes since 1.2, so I suggest you try 
the nightly.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
  

From: pdovyda2 <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, August 14, 2008 2:56:42 PM
Subject: Highlighting returns incorrect text on some results?


This is kind of a strange issue, but when I submit a query and ask for
highlighting back, sometimes the highlighted text includes a question mark
at the beginning, although a question mark character does not appear in the
field that the highlighted text is taken from.

I've put some sample XML output on the web at
http://ucair.cs.uiuc.edu/pdovyda2/problem.xml
If you look at the first and third highlights, you'll see what I'm talking
about.  


Besides looking a bit odd, it is causing my application to break because the
highlighted field is multivalued, and I was doing text matching to determine
which of the values was chosen for highlighting.

Is this actually a bug, or have I just misconfigured something?  By the way,
I am using the 1.2 release, I have not yet tried out a nightly build to see
if this is an old problem.

Thanks,
Paul
--
View this message in context: 
http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html

Sent from the Solr - User mailing list archive at Nabble.com.



  




Re: Simple Searching Question

2008-08-14 Thread Shalin Shekhar Mangar
Hi Jake,

What is the type of the foobar_facet field in your schema.xml ?
Did you add foobar_facet as the default search field?

On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I inserted the following documents into Solr:
>
>
> ---
>
> 
> 
>  124
>  Jake Conk
> 
> 
>  125
>  Jake Conk
> 
> 
>
>
> ---
>
> id is the only required integer field.
> foobar_facet is a dynamic string field.
>
> When I try to search for anything with the word Jake in it the
> following ways I get no results.
>
>
> select?q=Jake
> select?q=Jake*
>
>
> I thought one of those two should work but the only way I got it to
> work was by specifying which field "Jake" is in along with a wild
> card.
>
>
> select?q=foobar_facet:Jake*
>
>
> 1) Does this mean for each field I would like to search if Jake exists
> I would have to add each field like I did above to the query?
>
> 2) How would I search if I want to find the name Jake anywhere in the
> string? The documentation
> (http://lucene.apache.org/java/docs/queryparsersyntax.html) states
> that I cannot use a wildcard as the first character such as *Jake*
>
> Thanks,
> - Jake
>



-- 
Regards,
Shalin Shekhar Mangar.


Simple Searching Question

2008-08-14 Thread Jake Conk
Hello,

I inserted the following documents into Solr:

---



 124
 Jake Conk


 125
 Jake Conk



---

id is the only required integer field.
foobar_facet is a dynamic string field.

When I try to search for anything with the word Jake in it the
following ways I get no results.


select?q=Jake
select?q=Jake*


I thought one of those two should work but the only way I got it to
work was by specifying which field "Jake" is in along with a wild
card.


select?q=foobar_facet:Jake*


1) Does this mean for each field I would like to search if Jake exists
I would have to add each field like I did above to the query?

2) How would I search if I want to find the name Jake anywhere in the
string? The documentation
(http://lucene.apache.org/java/docs/queryparsersyntax.html) states
that I cannot use a wildcard as the first character such as *Jake*

Thanks,
- Jake


Re: QueryResultsCache and DocSet filter

2008-08-14 Thread Kevin Osborn
The DocSet isn't part of the cache key. The key is usually just a simple string 
(e.g. companyId). They just return a DocSet. I think the user caches are fine. 
This DocSet is then used as a filter for the actual query. I believe it is this 
step that is slow.

However, I am guessing that the solution is still to have the user caches 
return a Query object so that I can supply a List to SolrIndexSearcher, 
causing it to use the QueryResultsCache. Correct?



- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, August 14, 2008 1:41:50 PM
Subject: Re: QueryResultsCache and DocSet filter

On Thu, Aug 14, 2008 at 3:15 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> The problem here is that the calls in SolrIndexSearcher don't appear to use 
> the QueryResultsCache if the filer is a DocSet rather than a List.

Right... using a DocSet as part of the cache key would be pretty slow
(key comparisons) and more memory intensive.

> Is it recomended to re-work the caches to return Query objects and use that 
> as my filterList?

Yes, that should work.

-Yonik



  

Re: Highlighting returns incorrect text on some results?

2008-08-14 Thread Otis Gospodnetic
Paul, we had many highlighter-related changes since 1.2, so I suggest you try 
the nightly.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: pdovyda2 <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, August 14, 2008 2:56:42 PM
> Subject: Highlighting returns incorrect text on some results?
> 
> 
> This is kind of a strange issue, but when I submit a query and ask for
> highlighting back, sometimes the highlighted text includes a question mark
> at the beginning, although a question mark character does not appear in the
> field that the highlighted text is taken from.
> 
> I've put some sample XML output on the web at
> http://ucair.cs.uiuc.edu/pdovyda2/problem.xml
> If you look at the first and third highlights, you'll see what I'm talking
> about.  
> 
> Besides looking a bit odd, it is causing my application to break because the
> highlighted field is multivalued, and I was doing text matching to determine
> which of the values was chosen for highlighting.
> 
> Is this actually a bug, or have I just misconfigured something?  By the way,
> I am using the 1.2 release, I have not yet tried out a nightly build to see
> if this is an old problem.
> 
> Thanks,
> Paul
> -- 
> View this message in context: 
> http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: More files in index directory than expected

2008-08-14 Thread Michael McCandless
Chris Harris <[EMAIL PROTECTED]> wrote:
> It's my understanding that if my mergeFactor is 10, then there
> shouldn't be more than 11 segments in my index directory (10 segments,
> plus an additional segment if a merge is in progress).

Actually, mergeFactor 10 means each *level* will have <= 10 segments,
where a level is roughly 10X the size of the previous level.

EG after 10 segments (level 0) are flushed, they get merged into a
single level 1 segment.  Another 10 produces another level 1 segment.
Etc.  Until you have 10 level 1 segments, which then get merged into a
single level 2 segment.

The number of levels you have is logarithmic in the size of your index.

> I'm noticing that _2pk, _2pl, _2pm, _2pn, _2po are sequential file
> names, alphabetically speaking, and their last modified times are very
> close to one another. Does this mean they're actually part of the same
> segment, even though they are in separate files?

No, these are different segments, just flushed shortly after one
another in time.

> I just discovered that one of my other indexes has over 11,000 tis
> files. That's disturbing. I'm not sure if it would have the same
> underlying cause.

That does NOT sound right.  Can you provide more details how this
index is created/maintained?

Mike


Re: QueryResultsCache and DocSet filter

2008-08-14 Thread Yonik Seeley
On Thu, Aug 14, 2008 at 3:15 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> The problem here is that the calls in SolrIndexSearcher don't appear to use 
> the QueryResultsCache if the filer is a DocSet rather than a List.

Right... using a DocSet as part of the cache key would be pretty slow
(key comparisons) and more memory intensive.

> Is it recomended to re-work the caches to return Query objects and use that 
> as my filterList?

Yes, that should work.

-Yonik


Re: IndexOutOfBoundsException

2008-08-14 Thread Yonik Seeley
Yikes... not good.  This shouldn't be due to anything you did wrong
Ian... it looks like a lucene bug.

Some questions:
- what platform are you running on, and what JVM?
- are you using multicore? (I fixed some index locking bugs recently)
- are there any exceptions in the log before this?
- how reproducible is this?

-Yonik

On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have rebuilt my index a few times (it should get up to about 4
> Million but around 1 Million it starts to fall apart).
>
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
>at java.util.ArrayList.rangeCheck(ArrayList.java:572)
>at java.util.ArrayList.get(ArrayList.java:350)
>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188)
>at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670)
>at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349)
>at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
>
>
> When this happens, the disk usage goes right up and the indexing
> really starts to slow down. I am using a Solr build from about a week
> ago - so my Lucene is at 2.4 according to the war files.
>
> Has anyone seen this error before? Is it possible to tell which Array
> is too large? Would it be an Array I am sending in or another internal
> one?
>
> Regards,
> Ian Connor
>


NOTICE - solrj MultiCore{Params/Request/Response} have been renamed CoreAdmin{Params/Request/Response}

2008-08-14 Thread Ryan McKinley
In the effort to clean up confusion around MultiCore usage, we have  
renamed the class that handle runtime core administration from  
"MultiCoreX" to CoreAdminX.  Additionally, the path that the default  
MultiCoreRequest expects to hit is: /admin/cores rather then /admin/ 
multicore -- if you have an existing solr.xml file, you may want to  
update the "adminPath" attribute


for a detailed change log, see:
http://svn.apache.org/viewvc?view=rev&revision=685989

ryan


QueryResultsCache and DocSet filter

2008-08-14 Thread Kevin Osborn
We have a bunch of user caches that return DocSet objects. So, we intersect 
them and send a DocSet filter and the actual query to getDocListAndSet or 
getDocList. The problem here is that the calls in SolrIndexSearcher don't 
appear to use the QueryResultsCache if the filer is a DocSet rather than a 
List.

So, the end result is that our new version of Solr is slower than our old 
version. Our old version cached Query objects instead of DocSets. However, it 
had quite a few other problems. I thought it would be an improvement to use 
DocSets. Is it recomended to re-work the caches to return Query objects and use 
that as my filterList?

Our index size is about million documents. Our typical result set size is about 
15, but can occasionally be in the range of 5,000-15,000.



  

Re: term list

2008-08-14 Thread Grant Ingersoll
Assuming you mean significant in the traditional IR sense, I would  
start with the MoreLikeThis.  See http://wiki.apache.org/solr/MoreLikeThisHandler 
  In particular the mlt.interestingTerms option.


As for phrases, that is a bit harder.  You could try playing around  
with token-based n-grams (called Shingles) and MoreLikeThis together,  
for starters, I think.


If you have some other notion of "significant" in relation to language  
in general, then you've got quite a bit more work to do, most of which  
is way beyond the scope of Solr (although it could plugin to Solr  
nicely).


HTH,
Grant


On Aug 14, 2008, at 2:43 PM, Jack Tuhman wrote:


Humm, I am new to the world of search

I am looking for something that will give me a list of significant  
words or

phrases extracted from a document stored in solr.
Jack


On Fri, Aug 8, 2008 at 9:33 AM, Grant Ingersoll  
<[EMAIL PROTECTED]> wrote:


See https://issues.apache.org/jira/browse/SOLR-651.  I've got some  
of this

coded up and hope to have a patch soon.

Or, do you mean, is there a way to get the terms the MLT uses to  
generate

the new query?


On Aug 5, 2008, at 8:41 PM, Jack Tuhman wrote:

Hi all,


is there a way to get key terms from an item?  If each item has an  
id, can

I
pass that ID to a search and get back the key terms like you can  
with the

mlt filter.

Does this make sense?

Jack










Re: spellcheck collation

2008-08-14 Thread Doug Steigerwald
Right before I sent the message.  Did a 'svn up src/;and clean;ant  
dist' and it failed.  Seems to work fine now.


On Aug 14, 2008, at 2:38 PM, Ryan McKinley wrote:


have you updated recently?

isEnabled() was removed last night...


On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote:


I'd try, but the build is failing from (guessing) Ryan's last commit:

compile:
  [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core
  [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ 
java/solr/build/core
  [javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ 
org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129:  
cannot find symbol

  [javac] symbol  : method isEnabled()
  [javac] location: class org.apache.solr.core.CoreContainer
  [javac]   multicore.isEnabled() ) {

Doug

On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote:

I believe I just fixed this on SOLR-606 (thanks to Stefan's  
patch).  Give it a try and let us know.


-Grant

On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote:

I've noticed a few things with the new spellcheck component that  
seem a little strange.


Here's my document:


5
wii blackberry blackjack creative labs zen  
ipod video nano



Some sample queries:

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true

When spellchecking 'blackberri wi', the collation returned is  
'blackberry wii'.  When spellchecking 'blackberr wi', the  
collation returned is 'blackberrywii'.  'blackber wi' returns  
'blackberrwiiwi'.


Doug








Highlighting returns incorrect text on some results?

2008-08-14 Thread pdovyda2

This is kind of a strange issue, but when I submit a query and ask for
highlighting back, sometimes the highlighted text includes a question mark
at the beginning, although a question mark character does not appear in the
field that the highlighted text is taken from.

I've put some sample XML output on the web at
http://ucair.cs.uiuc.edu/pdovyda2/problem.xml
If you look at the first and third highlights, you'll see what I'm talking
about.  

Besides looking a bit odd, it is causing my application to break because the
highlighted field is multivalued, and I was doing text matching to determine
which of the values was chosen for highlighting.

Is this actually a bug, or have I just misconfigured something?  By the way,
I am using the 1.2 release, I have not yet tried out a nightly build to see
if this is an old problem.

Thanks,
Paul
-- 
View this message in context: 
http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html
Sent from the Solr - User mailing list archive at Nabble.com.



IndexOutOfBoundsException

2008-08-14 Thread Ian Connor
Hi,

I have rebuilt my index a few times (it should get up to about 4
Million but around 1 Million it starts to fall apart).

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33
at java.util.ArrayList.rangeCheck(ArrayList.java:572)
at java.util.ArrayList.get(ArrayList.java:350)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670)
at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)


When this happens, the disk usage goes right up and the indexing
really starts to slow down. I am using a Solr build from about a week
ago - so my Lucene is at 2.4 according to the war files.

Has anyone seen this error before? Is it possible to tell which Array
is too large? Would it be an Array I am sending in or another internal
one?

Regards,
Ian Connor


Re: term list

2008-08-14 Thread Jack Tuhman
Humm, I am new to the world of search

I am looking for something that will give me a list of significant words or
phrases extracted from a document stored in solr.
Jack


On Fri, Aug 8, 2008 at 9:33 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> See https://issues.apache.org/jira/browse/SOLR-651.  I've got some of this
> coded up and hope to have a patch soon.
>
> Or, do you mean, is there a way to get the terms the MLT uses to generate
> the new query?
>
>
> On Aug 5, 2008, at 8:41 PM, Jack Tuhman wrote:
>
>  Hi all,
>>
>> is there a way to get key terms from an item?  If each item has an id, can
>> I
>> pass that ID to a search and get back the key terms like you can with the
>> mlt filter.
>>
>> Does this make sense?
>>
>> Jack
>>
>
>
>


Re: spellcheck collation

2008-08-14 Thread Ryan McKinley

have you updated recently?

isEnabled() was removed last night...


On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote:


I'd try, but the build is failing from (guessing) Ryan's last commit:

compile:
   [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core
   [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ 
java/solr/build/core
   [javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ 
org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129:  
cannot find symbol

   [javac] symbol  : method isEnabled()
   [javac] location: class org.apache.solr.core.CoreContainer
   [javac]   multicore.isEnabled() ) {

Doug

On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote:

I believe I just fixed this on SOLR-606 (thanks to Stefan's  
patch).  Give it a try and let us know.


-Grant

On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote:

I've noticed a few things with the new spellcheck component that  
seem a little strange.


Here's my document:


5
wii blackberry blackjack creative labs zen  
ipod video nano



Some sample queries:

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true

When spellchecking 'blackberri wi', the collation returned is  
'blackberry wii'.  When spellchecking 'blackberr wi', the  
collation returned is 'blackberrywii'.  'blackber wi' returns  
'blackberrwiiwi'.


Doug








More files in index directory than expected

2008-08-14 Thread Chris Harris
It's my understanding that if my mergeFactor is 10, then there
shouldn't be more than 11 segments in my index directory (10 segments,
plus an additional segment if a merge is in progress). It would seem
to follow that there shouldn't be more than 11 fdt files, 11 tis
files, etc.. However, I'm looking at one of my indexes now, and this
doesn't seem to be the case. Here are the tis files for this index,
for instance:

07/22/2008  07:49 PM77,925,180 _1je.tis
07/23/2008  02:57 AM65,988,651 _256.tis
07/23/2008  04:18 AM13,159,578 _29t.tis
07/23/2008  05:08 AM10,146,941 _2cw.tis
07/23/2008  05:39 AM 6,749,665 _2el.tis
07/23/2008  06:24 AM12,274,012 _2he.tis
07/23/2008  07:01 AM14,069,531 _2kh.tis
07/23/2008  07:53 AM13,795,213 _2nu.tis
07/23/2008  08:20 AM 6,284,902 _2p0.tis
07/23/2008  08:27 AM 1,980,945 _2p9.tis
07/23/2008  08:36 AM 1,674,640 _2pk.tis
07/23/2008  08:37 AM   311,483 _2pl.tis
07/23/2008  08:38 AM   285,881 _2pm.tis
07/23/2008  08:39 AM   245,138 _2pn.tis
07/23/2008  08:40 AM   116,881 _2po.tis
07/17/2008  11:22 PM69,635,905 _rp.tis
07/18/2008  12:59 AM15,883,866 _xu.tis

There are 17 of these files. (File sizes are in bytes.) When I open up
the index in Luke, it says all of them are "In Use" and it doesn't
list any of them as "Deletable". This seems to rule out the
possibility that Solr/Lucene somehow "forget" to clean up files that
were no longer in use.

I'm noticing that _2pk, _2pl, _2pm, _2pn, _2po are sequential file
names, alphabetically speaking, and their last modified times are very
close to one another. Does this mean they're actually part of the same
segment, even though they are in separate files? If those files are
indeed part of a single segment, then the number of segments
represented by these files would really be 17-4=13. But that's still
more than the expected 11 segments.

I just discovered that one of my other indexes has over 11,000 tis
files. That's disturbing. I'm not sure if it would have the same
underlying cause.

Any ideas?


Re: spellcheck collation

2008-08-14 Thread Doug Steigerwald

I'd try, but the build is failing from (guessing) Ryan's last commit:

compile:
[mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core
[javac] Compiling 337 source files to /Users/dsteiger/Desktop/ 
java/solr/build/core
[javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ 
org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129:  
cannot find symbol

[javac] symbol  : method isEnabled()
[javac] location: class org.apache.solr.core.CoreContainer
[javac]   multicore.isEnabled() ) {

Doug

On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote:

I believe I just fixed this on SOLR-606 (thanks to Stefan's patch).   
Give it a try and let us know.


-Grant

On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote:

I've noticed a few things with the new spellcheck component that  
seem a little strange.


Here's my document:


5
wii blackberry blackjack creative labs zen ipod  
video nano



Some sample queries:

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true

When spellchecking 'blackberri wi', the collation returned is  
'blackberry wii'.  When spellchecking 'blackberr wi', the collation  
returned is 'blackberrywii'.  'blackber wi' returns 'blackberrwiiwi'.


Doug






Re: spellcheck collation

2008-08-14 Thread Grant Ingersoll
I believe I just fixed this on SOLR-606 (thanks to Stefan's patch).   
Give it a try and let us know.


-Grant

On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote:

I've noticed a few things with the new spellcheck component that  
seem a little strange.


Here's my document:


 5
 wii blackberry blackjack creative labs zen ipod  
video nano



Some sample queries:

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true

http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true

When spellchecking 'blackberri wi', the collation returned is  
'blackberry wii'.  When spellchecking 'blackberr wi', the collation  
returned is 'blackberrywii'.  'blackber wi' returns 'blackberrwiiwi'.


Doug





Duplicate Data Across Fields

2008-08-14 Thread wojtekpia

I have 2 fields which will sometimes contain the same data. When they do
contain the same data, am I paying the same performance cost as when they
contain unique data? I think the real question here is: does Lucene index
values per field, or per document?
-- 
View this message in context: 
http://www.nabble.com/Duplicate-Data-Across-Fields-tp18986515p18986515.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Synonyms help in 1.3-HEAD?

2008-08-14 Thread Matthew Runo
Thank you for your suggestion, I really don't see anything 'wrong'  
with the longer lists.. I entered https://issues.apache.org/jira/browse/SOLR-702 
 for this issue, and attached relevant files. If you need anything  
more, don't hesitate to contact me!


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833

On Aug 14, 2008, at 10:16 AM, Yonik Seeley wrote:


There should be no limit, so you may have uncovered a bug.  Could you
open a JIRA issue?  If it's a real bug, it should get fixed before
1.3.

-Yonik

On Thu, Aug 14, 2008 at 12:35 PM, Matthew Runo <[EMAIL PROTECTED]>  
wrote:

Hello folks!

Having a heck of a time trying to get a synonyms file to work  
properly. It
seems that something's wrong with the way it's been set up, but,  
honestly, I

can't see anything wrong with it. Some samples...

This works...
zutanoapparel => zutano

But this does not...
aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis,  
adaidas,
adaies, addedas, addedis, addidaas, addidads, addidais, addidas,  
addidascom,
addiddas, addides, addidis, adeadas, adedas, adeddas, adedias,  
adiada,
adiadas, adiadis, adiads, adida, adidaas, adidas1, adidass, adidaz,  
adidda,

adiddas, adiddias, adidias, adidis, adiidas, aditas, adudas, afidas,
aididas, wwwadidascom => adidas

This works...
liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian,  
lumiana,

lumianai, lumiari, luminani, lumini, luminiani => lumiani

But this does not...
clegerie, cleregie, clergerie, clergie, robertclaregie, robert  
claregie,

robertclargeries, robert clargeries, robertclegerie, robert clegerie,
robertcleregie, robert cleregie, robertclergeic, robert clergeic,
robertclergerie, robertclergi, robert clergi, robertclergie, robert  
clergie,

robertclergoe, robert clergoe, robertclerige, robert clerige,
robertclerterie, robert clerterie => Robert Clergerie

This is how they're set up in my schema..


Is there a limit to the number of terms in the list of options? It  
seems
that the ones that are shorter work, while the longer lists don't.  
I'm at a

loss as to why though..

Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833








Re: Synonyms help in 1.3-HEAD?

2008-08-14 Thread Yonik Seeley
There should be no limit, so you may have uncovered a bug.  Could you
open a JIRA issue?  If it's a real bug, it should get fixed before
1.3.

-Yonik

On Thu, Aug 14, 2008 at 12:35 PM, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Hello folks!
>
> Having a heck of a time trying to get a synonyms file to work properly. It
> seems that something's wrong with the way it's been set up, but, honestly, I
> can't see anything wrong with it. Some samples...
>
> This works...
> zutanoapparel => zutano
>
> But this does not...
> aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis, adaidas,
> adaies, addedas, addedis, addidaas, addidads, addidais, addidas, addidascom,
> addiddas, addides, addidis, adeadas, adedas, adeddas, adedias, adiada,
> adiadas, adiadis, adiads, adida, adidaas, adidas1, adidass, adidaz, adidda,
> adiddas, adiddias, adidias, adidis, adiidas, aditas, adudas, afidas,
> aididas, wwwadidascom => adidas
>
> This works...
> liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian, lumiana,
> lumianai, lumiari, luminani, lumini, luminiani => lumiani
>
> But this does not...
> clegerie, cleregie, clergerie, clergie, robertclaregie, robert claregie,
> robertclargeries, robert clargeries, robertclegerie, robert clegerie,
> robertcleregie, robert cleregie, robertclergeic, robert clergeic,
> robertclergerie, robertclergi, robert clergi, robertclergie, robert clergie,
> robertclergoe, robert clergoe, robertclerige, robert clerige,
> robertclerterie, robert clerterie => Robert Clergerie
>
> This is how they're set up in my schema..
>  ignoreCase="true" expand="true"/>
>
> Is there a limit to the number of terms in the list of options? It seems
> that the ones that are shorter work, while the longer lists don't. I'm at a
> loss as to why though..
>
> Thanks for your time!
>
> Matthew Runo
> Software Engineer, Zappos.com
> [EMAIL PROTECTED] - 702-943-7833
>
>


Synonyms help in 1.3-HEAD?

2008-08-14 Thread Matthew Runo

Hello folks!

Having a heck of a time trying to get a synonyms file to work  
properly. It seems that something's wrong with the way it's been set  
up, but, honestly, I can't see anything wrong with it. Some samples...


This works...
zutanoapparel => zutano

But this does not...
aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis,  
adaidas, adaies, addedas, addedis, addidaas, addidads, addidais,  
addidas, addidascom, addiddas, addides, addidis, adeadas, adedas,  
adeddas, adedias, adiada, adiadas, adiadis, adiads, adida, adidaas,  
adidas1, adidass, adidaz, adidda, adiddas, adiddias, adidias, adidis,  
adiidas, aditas, adudas, afidas, aididas, wwwadidascom => adidas


This works...
liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian,  
lumiana, lumianai, lumiari, luminani, lumini, luminiani => lumiani


But this does not...
clegerie, cleregie, clergerie, clergie, robertclaregie, robert  
claregie, robertclargeries, robert clargeries, robertclegerie, robert  
clegerie, robertcleregie, robert cleregie, robertclergeic, robert  
clergeic, robertclergerie, robertclergi, robert clergi, robertclergie,  
robert clergie, robertclergoe, robert clergoe, robertclerige, robert  
clerige, robertclerterie, robert clerterie => Robert Clergerie


This is how they're set up in my schema..
ignoreCase="true" expand="true"/>


Is there a limit to the number of terms in the list of options? It  
seems that the ones that are shorter work, while the longer lists  
don't. I'm at a loss as to why though..


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
[EMAIL PROTECTED] - 702-943-7833



Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber



Erick Erickson wrote:

I'm surprised, as you are, by the non-linearity. Out of curiosity, what is
your MaxFieldLength? By default only the first 10,000 tokens are added
to a field per document. If you haven't set this higher, that could account
for it.


We set it to a very large number so we index the entire document.



As far as I know, optimization shouldn't really affect the index size if you
are not deleting documents, but I'm no expert in that area.

I've indexed OCR data and it's no fun for the reasons you cite. We had
better results searching if we cleaned the data at index time. By "cleaning"
I mean we took out all of the characters that *couldn't* be indexed. What
*can't* be indexed depends upon your requirements, but in our case we
could just use the low ASCII characters by folding all the accented
characters
into their low-ascii counterparts because we had no need for native-
language support. And we also replaced most non-printing characters
with spaces. A legitimate question is whether indexing single characters
makes sense (in our case, genealogy, it actually does. Sggghhh)


Fortunatel non-printing characters are not a problem but we need native 
language query support so limiting to US-ASCII will not work for us. 
One possibility is to identify the dominant language in the document and 
use dictionaries to remove junk however proper names are a big problem 
with that approach.  Another might be to use heuristics like removing 
"words" with numbers in the middle of them.  Whatever we do wil lhave to 
be fast.




In a mixed-language environment, this provided surprisingly good results
given how crude the transformations were. Of course it's totally
unacceptable to so mangle non-English text this crudely if you must
support native-language searching.


yes.



I'd be interested in how this changes your index size if you do decide
to try it. There's nothing like having somebody else do research for
me .


Best
Erick

On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:


We're indexing the ocr for a large number of books.  Our experimental
schema is simple and id field and an ocr text field (not stored).

Currently we just have two data points:

3005 documents = 723 MB index
174237 documents = 51460 MB index

These indexes are not optimized.

If the index size were a linear function of number of documents, based on
just these two data points, you'd expect the index for 174237 docs to be
approximately 57.98 times larger that 723 MB or about 41921 MB. Actually
it's 51460 or about 22% bigger.

I suspect the non-linear increase is due to dirty ocr that continually
increases the number of unique words that need to be indexed.

Another possibility is that the larger index has a higher proportion of
documents containing characters from non-Latin alphabets thereby increasing
the number of unique words.  I can't verify that at this point.

Are these reasonable assumptions or am I missing other factors that could
contribute to the non-linear growth in index size?

Regards,

Phil






RE: List of available facet fields returned with the query results

2008-08-14 Thread Barry Harding
Hi Shalin,

As there is certainly the potential for several thousand different
attribute types across all of our category's I guess I will have to
manage them myself (was hoping for a short-cut or that I was missing a
trick) but no problem. Solr still seems to outperform the commercial
package we are using

Thanks

barry

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: 14 August 2008 16:06
To: solr-user@lucene.apache.org
Subject: Re: List of available facet fields returned with the query
results

Hi Barry,

If each category has an exclusive set of fields on which you want to
facet
on, then you can simply facet on all facet-able fields (across all
categories). The ones which are not present for the selected category
will
show up with zero facets which your front-end can suppress. However if
the
total number of such fields are very large, you may be better off
managing
the mappings yourself for performance reasons. But, as always, first
measure
then optimize :)


On Thu, Aug 14, 2008 at 7:12 PM, Barry Harding
<[EMAIL PROTECTED]>wrote:

> Hi,
> I have solr setup to index technical data for a number of different
> types of products, and this means that different product have
different
> facet fields available.
>
> For example here would be a small example of the sort of data we are
> indexing, in reality there are between 10 and 20 facet fields per
> product dependent upon its category, but a user could perform a search
> across more than one category.
>
> A typical search would be:
>
> Stage 1
> 1) User types a keyword
> 2) The matched category's from that search are displayed
> 3) User chooses a category
> 4) All results that match that category and keyword are displayed and
> it's at this point I would like to display the available facets and
> values.
>
>
>
>
> Name: Microsoft Optical Mouse for PC
> DPI: 1200
> Type: Laser
> Category: PC
> Price: 45.00
>
> Name: Eee PC
> Storage Type: Flash
> Screen Size: 17
> Category: Netbook
> Price: 200.00
>
> And so on.
>
> So if I search for "PC" and then category "Netbook" I would like solr
to
> be able to tell me what facet fields are available without me
resorting
> to a database to store which facets fields are available to search for
> which products, is there any way to get SOLR to return as part of the
> results a list of the facets available for the current search results
or
> even better could I get it to automatically do a facet query for each
of
> those fields to allow drill down querys.
>
> The current commercial tool we use that we are hoping solr can replace
> is called "FactFinder" and does exactly this, but I do have to have
> drilled down a number of times before this occurs, to stop the search
> attempting to return facets for every item in the index.
>
> I suspect am I missing a trick here or making this more complicated
than
> needed, any help or ideas much appreciated.
>
>
> Thanks
>
> Barry H
>
>

> Misco is a division of Systemax Europe Ltd.  Registered in Scotland
Number
> 114143.  Registered Office: Caledonian Exchange, 19a Canning Street,
> Edinburgh EH3 8EG.  Telephone +44 (0)1933 686000.
>



-- 
Regards,
Shalin Shekhar Mangar.


Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number 
114143.  Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh 
EH3 8EG.  Telephone +44 (0)1933 686000.


RE: Best way to index without diacritics

2008-08-14 Thread Steven A Rowe
Hi Norberto,

On 08/14/2008 at 8:10 AM, Norberto Meijome wrote:
> > On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote:
> > 
> > > Hi Norberto,
> > > 
> > > https://issues.apache.org/jira/browse/LUCENE-1343
> 
> hi Steve,
> thanks for the pointer. this is a Lucene entry... I thought the 
> Latin-filter was a SOLR feature? I, for one, definitely meant a SOLR filter.

A fair portion of Solr is a set of wrappers over Lucene functionality.  
ISOLatin1FilterFactory, for example, wraps Lucene's ISOLatin1AccentFilter.  
Here is the entirety of the Solr code:

public class ISOLatin1AccentFilterFactory extends BaseTokenFilterFactory {
  public ISOLatin1AccentFilter create(TokenStream input) {
return new ISOLatin1AccentFilter(input);
  }
}

Of course, BaseTokenFilterFactory brings more to the party, but my point is 
that adding Lucene filters to Solr is generally a trivial exercise - a Solr 
...FilterFactory around LUCENE-1343 would not be much longer than the four 
lines listed above, since the configuration aspects are already handled by 
BaseTokenFilterFactory.

> Given what Walter rightly pointed out about differences in language, I suspect
> it would be a SOLR-level thing - fieldType name="textDE" language="DE" would
> apply the filter of unicode chars to {ascii?} with the appropriate mapping
> for German, etc.
> 
> Or is this that Lucene would / should take care of ?

The kind of filter Walter is talking about - a generalized language-aware 
character normalization Solr/Lucene filter - does not yet exist.  My guess is 
that if/when it does materialize, both the Solr and the Lucene projects will 
want to have it.  Historically, most functionality shared by Solr and Lucene is 
eventually hosted by Lucene, since Solr has a Lucene dependency, but not 
vice-versa.

So, yes, Solr would be responsible for hosting configuration for such a filter, 
but the responsibility for doing something with the configuration would be 
Lucene's responsibility, assuming that Lucene would (eventually) host the 
filter and Solr would host a factory over the filter.

Steve


Re: List of available facet fields returned with the query results

2008-08-14 Thread Shalin Shekhar Mangar
Hi Barry,

If each category has an exclusive set of fields on which you want to facet
on, then you can simply facet on all facet-able fields (across all
categories). The ones which are not present for the selected category will
show up with zero facets which your front-end can suppress. However if the
total number of such fields are very large, you may be better off managing
the mappings yourself for performance reasons. But, as always, first measure
then optimize :)


On Thu, Aug 14, 2008 at 7:12 PM, Barry Harding <[EMAIL PROTECTED]>wrote:

> Hi,
> I have solr setup to index technical data for a number of different
> types of products, and this means that different product have different
> facet fields available.
>
> For example here would be a small example of the sort of data we are
> indexing, in reality there are between 10 and 20 facet fields per
> product dependent upon its category, but a user could perform a search
> across more than one category.
>
> A typical search would be:
>
> Stage 1
> 1) User types a keyword
> 2) The matched category's from that search are displayed
> 3) User chooses a category
> 4) All results that match that category and keyword are displayed and
> it's at this point I would like to display the available facets and
> values.
>
>
>
>
> Name: Microsoft Optical Mouse for PC
> DPI: 1200
> Type: Laser
> Category: PC
> Price: 45.00
>
> Name: Eee PC
> Storage Type: Flash
> Screen Size: 17
> Category: Netbook
> Price: 200.00
>
> And so on.
>
> So if I search for "PC" and then category "Netbook" I would like solr to
> be able to tell me what facet fields are available without me resorting
> to a database to store which facets fields are available to search for
> which products, is there any way to get SOLR to return as part of the
> results a list of the facets available for the current search results or
> even better could I get it to automatically do a facet query for each of
> those fields to allow drill down querys.
>
> The current commercial tool we use that we are hoping solr can replace
> is called "FactFinder" and does exactly this, but I do have to have
> drilled down a number of times before this occurs, to stop the search
> attempting to return facets for every item in the index.
>
> I suspect am I missing a trick here or making this more complicated than
> needed, any help or ideas much appreciated.
>
>
> Thanks
>
> Barry H
>
> 
> Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number
> 114143.  Registered Office: Caledonian Exchange, 19a Canning Street,
> Edinburgh EH3 8EG.  Telephone +44 (0)1933 686000.
>



-- 
Regards,
Shalin Shekhar Mangar.


RE: Exception during Solr startup

2008-08-14 Thread Kashyap, Raghu
Hi Yonik & Erik,

  Thanks to both of you. It seems like our container had some issues and
was causing this problem. 

Thanks,
Raghu

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, August 13, 2008 10:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Exception during Solr startup

On Wed, Aug 13, 2008 at 10:55 AM, Kashyap, Raghu
<[EMAIL PROTECTED]> wrote:
> SEVERE: java.lang.UnsupportedClassVersionError: Bad version number in
> .class file

This is normally a mismatch between java compiler and runtime (like
using Java6 to compile and Java5 to try and run).

-Yonik


Re: Administrative questions

2008-08-14 Thread Jason Rennie
On Wed, Aug 13, 2008 at 1:52 PM, Jon Drukman <[EMAIL PROTECTED]> wrote:

> Duh.  I should have thought of that.  I'm a big fan of djbdns so I'm quite
> familiar with daemontools.
>
> Thanks!
>

:)  My pleasure.  Was nice to hear recently that DJB is moving toward more
flexible licensing terms.  For anyone unfamiliar w/ daemontools, here's
DJB's explanation of why they rock compared to inittab, ttys, init.d, and
rc.local:

http://cr.yp.to/daemontools/faq/create.html#why

Jason


List of available facet fields returned with the query results

2008-08-14 Thread Barry Harding
Hi,
I have solr setup to index technical data for a number of different
types of products, and this means that different product have different
facet fields available.

For example here would be a small example of the sort of data we are
indexing, in reality there are between 10 and 20 facet fields per
product dependent upon its category, but a user could perform a search
across more than one category.

A typical search would be:

Stage 1
1) User types a keyword
2) The matched category's from that search are displayed
3) User chooses a category
4) All results that match that category and keyword are displayed and
it's at this point I would like to display the available facets and
values.




Name: Microsoft Optical Mouse for PC
DPI: 1200
Type: Laser
Category: PC
Price: 45.00

Name: Eee PC
Storage Type: Flash
Screen Size: 17
Category: Netbook
Price: 200.00

And so on.

So if I search for "PC" and then category "Netbook" I would like solr to
be able to tell me what facet fields are available without me resorting
to a database to store which facets fields are available to search for
which products, is there any way to get SOLR to return as part of the
results a list of the facets available for the current search results or
even better could I get it to automatically do a facet query for each of
those fields to allow drill down querys.

The current commercial tool we use that we are hoping solr can replace
is called "FactFinder" and does exactly this, but I do have to have
drilled down a number of times before this occurs, to stop the search
attempting to return facets for every item in the index.

I suspect am I missing a trick here or making this more complicated than
needed, any help or ideas much appreciated.


Thanks

Barry H


Misco is a division of Systemax Europe Ltd.  Registered in Scotland Number 
114143.  Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh 
EH3 8EG.  Telephone +44 (0)1933 686000.


Re: Best way to index without diacritics

2008-08-14 Thread Norberto Meijome
( 2 in 1 reply) 
On Wed, 13 Aug 2008 09:59:21 -0700
Walter Underwood <[EMAIL PROTECTED]> wrote:

> Stripping accents doesn't quite work. The correct translation
> is language-dependent. In German, o-dieresis should turn into
> "oe", but in English, it shoulde be "o" (as in "co__perate" or
> "M__tley Cr__e"). In Swedish, it should not be converted at all.

Hi Walter,
understood. This goes back to the question of language-specific field
definitions / parsers... more on this below.

> 
> There are other character-to-string conversions: ae-ligature
> to "ae", "__" to "ss", and so on. Luckily, those are independent
> of language.
> 
> wunder
> 
> On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote:
> 
> > Hi Norberto,
> > 
> > https://issues.apache.org/jira/browse/LUCENE-1343

hi Steve,
thanks for the pointer. this is a Lucene entry... I thought the Latin-filter
was a SOLR feature? I, for one, definitely meant a SOLR filter. 

Given what Walter rightly pointed out about differences in language, I suspect
it would be a SOLR-level thing - fieldType name="textDE" language="DE" would
apply the filter of unicode chars to {ascii?} with the appropriate mapping for
German, etc. 

Or is this that Lucene would / should take care of ?

B
_
{Beto|Norberto|Numard} Meijome

"I've dirtied my hands writing poetry, for the sake of seduction; that is,  for
the sake of a useful cause." Dostoevsky

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Spellcheker and Dismax both

2008-08-14 Thread Norberto Meijome
On Thu, 14 Aug 2008 12:21:13 +0530
"Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote:

> The SpellCheckerRequestHandler is now deprecated with Solr 1.3 and it has
> been replaced by SpellCheckComponent.
> 
> http://wiki.apache.org/solr/SpellCheckComponent


which works quite well with dismax.
B

_
{Beto|Norberto|Numard} Meijome

Never attribute to malice what can adequately be explained by incompetence.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.