RE: search with wildcard

2013-11-21 Thread Scott Schneider
I know it's documented that Lucene/Solr doesn't apply filters to queries with 
wildcards, but this seems to trip up a lot of users.  I can also see why 
wildcards break a number of filters, but a number of filters (e.g. mapping 
charsets) could mostly or entirely work.  The N-gram filter is another one that 
would be great to still run when there wildcards.  If you indexed 4-grams and 
the query is a *testp*, you currently won't get any results; but the N-gram 
filter could have a wildcard mode that, in this case, would return just the 
first 4-gram as a token.

Is this something you've considered?  It would have to be enabled in the core 
network, but disabled by default for existing filters; then it could be enabled 
1-by-1 for existing filters.  Apologies if the dev list is a better place for 
this.

Scott


 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Thursday, November 21, 2013 8:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: search with wildcard
 
 Hi Adnreas,
 
 If you don't want to use wildcards at query time, alternative way is to
 use NGrams at indexing time. This will produce a lot of tokens. e.g.
 For example 4grams of your example : Supertestplan = supe uper pert
 erte rtes *test* estp stpl tpla plan
 
 
 Is that you want? By the way why do you want to search inside of words?
 
 filter class=solr.NGramFilterFactory minGramSize=3
 maxGramSize=4/
 
 
 
 
 On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch
 wrote:
 
 I suppose i have to create another field with diffenet tokenizers and
 set
 the boost very low so it doesn't really mess with my ranking because
 there
 the word is now in 2 fields. What kind of tokenizer can do the job?
 
 
 
 From: Andreas Owen [mailto:a...@conx.ch]
 Sent: Donnerstag, 21. November 2013 16:13
 To: solr-user@lucene.apache.org
 Subject: search with wildcard
 
 
 
 I am querying test in solr 4.3.1 over the field below and it's not
 finding
 all occurences. It seems that if it is a substring of a word like
 Supertestplan it isn't found unless I use a wildcards *test*. This
 is
 write because of my tokenizer but does someone know a way around this?
 I
 don't want to add wildcards because that messes up queries with
 multiple
 words.
 
 
 
 fieldType name=text_de class=solr.TextField
 positionIncrementGap=100
 
       analyzer
 
         tokenizer class=solr.StandardTokenizerFactory/
 
         filter class=solr.LowerCaseFilterFactory/
 
 
 
         filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_de.txt format=snowball
 enablePositionIncrements=true/ !-- remove common words --
 
         filter class=solr.GermanNormalizationFilterFactory/
 
                                filter
 class=solr.SnowballPorterFilterFactory language=German/ !--
 remove
 noun/adjective inflections like plural endings --
 
 
 
       /analyzer
 
     /fieldType


RE: fq efficiency

2013-11-07 Thread Scott Schneider
Thanks, that link is very helpful, especially the section, Leapfrog, anyone?  
This actually seems quite slow for my use case.  Suppose we have 10,000 users 
and 1,000,000 documents.  We search for hello for a particular user and let's 
assume that the fq set for the user is cached.  hello is a common word and 
perhaps 10,000 documents will match.  If the user has 100 documents, then 
finding the intersection requires checking each list ~100 times.  If the user 
has 1,000 documents, we check each list ~1,000 times.  That doesn't scale well.

My searches are usually in one user's data.  How can I take advantage of that?  
I could have a separate index for each user, but loading so many indexes at 
once seems infeasible; and dynamically loading  unloading indexes is a pain.

Or I could create a filter that takes tokens and prepends them with the user 
id.  That seems like a good solution, since my keyword searches always include 
a user id (and usually just 1 user id).  Though I wonder if there is a downside 
I haven't thought of.

Thanks,
Scott


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Tuesday, November 05, 2013 4:35 PM
 To: solr-user@lucene.apache.org
 Subject: Re: fq efficiency
 
 On 11/5/2013 3:36 PM, Scott Schneider wrote:
  I'm wondering if filter queries are efficient enough for my use
 cases.  I have lots and lots of users in a big, multi-tenant, sharded
 index.  To run a search, I can use an fq on the user id and pass in the
 search terms.  Does this scale well with the # users?  I suppose that,
 since user id is indexed, generating the filter data (which is cached)
 will be fast.  And looking up search terms is fast, of course.  But if
 the search term is a common one that many users have in their
 documents, then Solr may have to perform an intersection between two
 large sets:  docs from all users with the search term and all of the
 current user's docs.
 
  Also, how about auto-complete and searching with a trailing wildcard?
 As I understand it, these work well in a single-tenant index because
 keywords are sorted in the index, so it's easy to get all the search
 terms that match foo*.  In a multi-tenant index, all users' keywords
 are stored together.  So if Lucene were to look at all the keywords
 from foo to fooz (I'm not sure if it actually does this), it
 would skip over a large majority of keywords that don't belong to this
 user.
 
  From what I understand, there's not really a whole lot of difference
 between queries and filter queries when they are NOT cached, except
 that
 the main query and the filter queries are executed in parallel, which
 can save time.
 
 When filter queries are found in the filterCache, it's a different
 story.  They get applied *before* the main query, which means that the
 main query won't have to work as hard.  The filterCache stores
 information about which documents in the entire index match the filter.
 By storing it as a bitset, the amount of space required is relatively
 low.  Applying filterCache results is very efficient.
 
 There are also advanced techniques, like assigning a cost to each
 filter
 and creating postfilters:
 
 http://yonik.com/posts/advanced-filter-caching-in-solr/
 
 Thanks,
 Shawn



RE: fq efficiency

2013-11-07 Thread Scott Schneider
Digging a bit more, I think I have answered my own questions.  Can someone 
please say if this sounds right?

http://wiki.apache.org/solr/LotsOfCores looks like a pretty good solution.  If 
I give each user his own shard, each query can be run in only one shard.  The 
effect of the filter query will basically be to find that shard.  The 
requirements listed on the wiki suggest that performance will be good.  But in 
Solr 3.x, this won't scale with the # users/shards.

Prepending a user id to indexed keywords using an analyzer will break wildcard 
search.  If there is a wildcard, the query analyzer doesn't run filters, so it 
won't prepend the user id.  I could prepend the user id myself before calling 
Solr, but that seems... bad.

Scott



 -Original Message-
 From: Scott Schneider [mailto:scott_schnei...@symantec.com]
 Sent: Thursday, November 07, 2013 2:03 PM
 To: solr-user@lucene.apache.org
 Subject: RE: fq efficiency
 
 Thanks, that link is very helpful, especially the section, Leapfrog,
 anyone?  This actually seems quite slow for my use case.  Suppose we
 have 10,000 users and 1,000,000 documents.  We search for hello for a
 particular user and let's assume that the fq set for the user is
 cached.  hello is a common word and perhaps 10,000 documents will
 match.  If the user has 100 documents, then finding the intersection
 requires checking each list ~100 times.  If the user has 1,000
 documents, we check each list ~1,000 times.  That doesn't scale well.
 
 My searches are usually in one user's data.  How can I take advantage
 of that?  I could have a separate index for each user, but loading so
 many indexes at once seems infeasible; and dynamically loading 
 unloading indexes is a pain.
 
 Or I could create a filter that takes tokens and prepends them with the
 user id.  That seems like a good solution, since my keyword searches
 always include a user id (and usually just 1 user id).  Though I wonder
 if there is a downside I haven't thought of.
 
 Thanks,
 Scott
 
 
  -Original Message-
  From: Shawn Heisey [mailto:s...@elyograg.org]
  Sent: Tuesday, November 05, 2013 4:35 PM
  To: solr-user@lucene.apache.org
  Subject: Re: fq efficiency
 
  On 11/5/2013 3:36 PM, Scott Schneider wrote:
   I'm wondering if filter queries are efficient enough for my use
  cases.  I have lots and lots of users in a big, multi-tenant, sharded
  index.  To run a search, I can use an fq on the user id and pass in
 the
  search terms.  Does this scale well with the # users?  I suppose
 that,
  since user id is indexed, generating the filter data (which is
 cached)
  will be fast.  And looking up search terms is fast, of course.  But
 if
  the search term is a common one that many users have in their
  documents, then Solr may have to perform an intersection between two
  large sets:  docs from all users with the search term and all of the
  current user's docs.
  
   Also, how about auto-complete and searching with a trailing
 wildcard?
  As I understand it, these work well in a single-tenant index because
  keywords are sorted in the index, so it's easy to get all the search
  terms that match foo*.  In a multi-tenant index, all users'
 keywords
  are stored together.  So if Lucene were to look at all the keywords
  from foo to fooz (I'm not sure if it actually does this), it
  would skip over a large majority of keywords that don't belong to
 this
  user.
 
   From what I understand, there's not really a whole lot of difference
  between queries and filter queries when they are NOT cached, except
  that
  the main query and the filter queries are executed in parallel, which
  can save time.
 
  When filter queries are found in the filterCache, it's a different
  story.  They get applied *before* the main query, which means that
 the
  main query won't have to work as hard.  The filterCache stores
  information about which documents in the entire index match the
 filter.
  By storing it as a bitset, the amount of space required is relatively
  low.  Applying filterCache results is very efficient.
 
  There are also advanced techniques, like assigning a cost to each
  filter
  and creating postfilters:
 
  http://yonik.com/posts/advanced-filter-caching-in-solr/
 
  Thanks,
  Shawn



fq efficiency

2013-11-05 Thread Scott Schneider
Hi all,

I'm wondering if filter queries are efficient enough for my use cases.  I have 
lots and lots of users in a big, multi-tenant, sharded index.  To run a search, 
I can use an fq on the user id and pass in the search terms.  Does this scale 
well with the # users?  I suppose that, since user id is indexed, generating 
the filter data (which is cached) will be fast.  And looking up search terms is 
fast, of course.  But if the search term is a common one that many users have 
in their documents, then Solr may have to perform an intersection between two 
large sets:  docs from all users with the search term and all of the current 
user's docs.

Also, how about auto-complete and searching with a trailing wildcard?  As I 
understand it, these work well in a single-tenant index because keywords are 
sorted in the index, so it's easy to get all the search terms that match 
foo*.  In a multi-tenant index, all users' keywords are stored together.  So 
if Lucene were to look at all the keywords from foo to fooz (I'm not 
sure if it actually does this), it would skip over a large majority of keywords 
that don't belong to this user.

Thanks,
Scott



RE: Problem loading my codec sometimes

2013-09-26 Thread Scott Schneider
Ok, I created SOLR-5278.  Thanks again!

Scott


 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Wednesday, September 25, 2013 10:15 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Problem loading my codec sometimes
 
 
 : Ah, I fixed it.  I wasn't properly including the
 : org.apache.lucene.codecs.Codec file in my jar.  I wasn't sure if it
 was
 : necessary in Solr, since I specify my factory in solrconfig.xml.  I
 : think that's why I could create a new index, but not load an existing
 : one.
 
 Ah interesting.
 
 yes, you definitely need the SPI registration in the jar file so that
 it
 can resolve codec files found on disk when opening them -- the
 configuration in solrconfig.xml tells solr hch codec to use when
 writing
 new segments, but it must respect the codec information in segements
 found
 on disk when opening them (that's how the index backcompat works), and
 those are looked up via SPI.
 
 Can you do me a favor please and still file an issue with these
 details.
 the attachments i asked about before would still be handy, but probably
 not neccessary -- at a minimum could you show us the jar tf output of
 your plugin jar when you were having the problem.
 
 Even if the codec factory code can find the configured codec on
 startup,
 we should probably throw a very load error write away if that same
 codec
 can't be found by name using SPI to prevent people from running into
 confusing problems when making mistakes like this.
 
 
 
 -Hoss


RE: Problem loading my codec sometimes

2013-09-25 Thread Scott Schneider
Thanks for your quick response!  My jar was in solr/lib.  I removed all the 
lib directives from solrconfig.xml, but I still get the error.  My solr.xml 
doesn't have sharedLib.

By the way, I am running Solr 4.4.0 with most of the default example files 
(including solr.xml).  My schema.xml and solrconfig.xml are from another 
project using Solr 3.6.  I modified them a bit to fix any obvious errors.

I still wonder why it can create a new index using my codec, but not load an 
index previously created with my codec.  In solrconfig.xml, I specify the 
CodecFactory along with the package name, whereas the codec name that is read 
from the index file has no package name.  Could that be the problem?  I think 
that's the way it's supposed to be.  Could it be that Solr has my jar in the 
classpath, but SPI is not registering my codec class from the jar?  I'm not 
familiar with SPI.

What else can I try?

Thanks,
Scott


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Tuesday, September 24, 2013 5:51 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem loading my codec sometimes
 
 On 9/24/2013 6:32 PM, Scott Schneider wrote:
  I created my own codec and Solr can find it sometimes and not other
 times.  When I start fresh (delete the data folder and run Solr), it
 all works fine.  I can add data and query it.  When I stop Solr and
 start it again, I get:
 
  Caused by: java.lang.IllegalArgumentException: A SPI class of type
 org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You
 need to add the corresponding JAR file supporting this SPI to your
 classpath.The current classpath supports the following names:
 [SimpleText, Appending, Lucene40, Lucene3x, Lucene41, Lucene42]
 
  I added the JAR to the path and I'm pretty sure Java sees it, or else
 it would not be using my codec when I start fresh.  (I've looked at the
 index files and verified that it's using my codec.)  I suppose Solr is
 asking SPI for my codec based on the codec class name stored in the
 index files, but I don't see why this would fail when a fresh start
 works.
 
 What I always recommend for those who want to use custom and contrib
 jars is that they put all such jars (and their dependencies) into
 ${solr.solr.home}/lib, don't use any lib directives in
 solrconfig.xml,
 and don't put the sharedLib attribute into solr.xml.  Doing it in any
 other way has a tendency to trigger bugs or causes jars to get loaded
 more than once.
 
 The ${solr.solr.home} property defaults to $CWD/solr (CWD is current
 working directory for those who don't already know) and is the location
 of the solr.xml file.  Note that depending on the exact version of Solr
 and which servlet container you are using, there may actually be two
 solr.xml files, one which loads solr into your container and one that
 configures Solr.  I am referring to the latter.
 
 If you are using the solr example and its directory layout, the
 directory you would need to put all jars into is example/solr/lib ...
 which is a directory that doesn't exist and has to be created.
 
 http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29
 http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond
 
 Thanks,
 Shawn



RE: Problem loading my codec sometimes

2013-09-25 Thread Scott Schneider
Ah, I fixed it.  I wasn't properly including the org.apache.lucene.codecs.Codec 
file in my jar.  I wasn't sure if it was necessary in Solr, since I specify my 
factory in solrconfig.xml.  I think that's why I could create a new index, but 
not load an existing one.

Scott


 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Wednesday, September 25, 2013 9:49 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Problem loading my codec sometimes
 
 
 : I still wonder why it can create a new index using my codec, but not
 : load an index previously created with my codec.  In solrconfig.xml, I
 : specify the CodecFactory along with the package name, whereas the
 codec
 : name that is read from the index file has no package name.  Could
 that
 : be the problem?  I think that's the way it's supposed to be.  Could
 it
 : be that Solr has my jar in the classpath, but SPI is not registering
 my
 : codec class from the jar?  I'm not familiar with SPI.
 
 it's very possible that there is a classloader / SPI runtime race
 condition in looking up the codec names found in segment files.  This
 sort
 of classpath related runtime issue is extremely hard to write tests
 for.
 
 Could you please file a bug and include...
 
  * the source of your codec (or a simple sample codec that you can
also use to reproduce the problem)
  * a ziped up copy of your entire solr home directory, including
the jar file containing your codec so we can verify the SPI files
are in their properly
 - no need to include an actual index here
  * some simple sample docments in xml or json taht we can index
with the schema you are using
 
 
 
 -Hoss


RE: Querying a non-indexed field?

2013-09-24 Thread Scott Schneider
Ok, thanks for your answers!

Scott


 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Wednesday, September 18, 2013 5:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Querying a non-indexed field?
 
 Moreover, you may be trying to save/optimize in a wrong place. Maybe
 these
 additional indexed fields are not so costly. Maybe you can optimize in
 some
 other part of your setup.
 
 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Sep 18, 2013 5:47 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
  : Subject: Re: Querying a non-indexed field?
  :
  : No.  --wunder
 
  To elaborate just a bit...
 
  : query on a few indexed fields, getting a small # of results.  I
 want to
  : restrict this further based on values from non-indexed, stored
 fields.
  : I can obviously do this myself, but it would be nice if Solr could
 do
 
  ...you could implement this in a custom SearchComponent, or custom
 qparser
  that would generate PostFilter compatible queries, that looked at the
  stored field values -- but it's extremeley unlikeley that you would
 ever
  convince any of the lucene/solr devs to agree to commit a general
 purpose
  version of this type of logic into the code base -- because in the
 general
  case (arbitrary unknown number of documents matching the main query)
 it
  would be extremely inefficient and would encourage bad user
 behavior.
 
  -Hoss
 


Problem loading my codec sometimes

2013-09-24 Thread Scott Schneider
Hello,

I created my own codec and Solr can find it sometimes and not other times.  
When I start fresh (delete the data folder and run Solr), it all works fine.  I 
can add data and query it.  When I stop Solr and start it again, I get:

Caused by: java.lang.IllegalArgumentException: A SPI class of type 
org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You need to 
add the corresponding JAR file supporting this SPI to your classpath.The 
current classpath supports the following names: [SimpleText, Appending, 
Lucene40, Lucene3x, Lucene41, Lucene42]

I added the JAR to the path and I'm pretty sure Java sees it, or else it would 
not be using my codec when I start fresh.  (I've looked at the index files and 
verified that it's using my codec.)  I suppose Solr is asking SPI for my codec 
based on the codec class name stored in the index files, but I don't see why 
this would fail when a fresh start works.

Any thoughts?

Thanks,
Scott



Querying a non-indexed field?

2013-09-17 Thread Scott Schneider
Hello,

Is it possible to restrict query results using a non-indexed, stored field?  
e.g. I might index fewer fields to reduce the index size.  I query on a few 
indexed fields, getting a small # of results.  I want to restrict this further 
based on values from non-indexed, stored fields.  I can obviously do this 
myself, but it would be nice if Solr could do this for me.

Thanks,
Scott



Solr substring search

2013-09-05 Thread Scott Schneider
Hello,

I'm trying to find out how Solr runs a query for *foo*.  Google tells me that 
you need to use NGramFilterFactory for that kind of substring search, but I 
find that even with very simple fieldTypes, it just works.  (Perhaps because 
I'm testing on very small data sets, Solr is willing to look through all the 
keywords.)  e.g. This works on the tutorial.

Can someone tell me exactly how this works and/or point me to the Lucene code 
that implements this?

Thanks,
Scott