Performance Drop from 1.3 to 1.4

2009-08-31 Thread Ilan Rabinovitch

Hello,

We recently began migrating a few of our applications from 1.3 to 1.4 in 
order to take advantage of the replication and performance improvements.



In practice however, we are noticing that our instances which make use 
of LocalSolr have experienced some performance degradation from 1.3 to 
1.4.  This mostly appears to be due to some dramatically different GC 
patterns between the two despite using the same JDK, data set, and GC 
parameters, and solr cache sizes.



Versions
==
LocalSolr: HEAD as of August 27
Solr: 1.4-DEV from 6/10/09
JDK:  Sun Hotspot 1.6.0_14


Has anyone else seen this type of behavior with Solr/LocalSolr queries?

Some graphs of our results (transactions per second) between both 
versions are available at the following link.  While performance is 
relatively similar when using UseParallelGC rather than 
UseConcMarkSweepGC, we do still notice some relatively long pause times 
in 1.4 when compared with 1.3 during GC windows.



https://dl.getdropbox.com/u/162474/performance.png


The queries run for this test make use of LocalSolr sorting and 
filtering based on distances between 2 sets of lat/long coordinates.



We are currently using LocalSolr HEAD with 6/10/09 revision of Solr 1.4. 
 The reason we selected that revision was due to it being the revision 
that LocalSolr was last released against. We do however plan to spend 
some time this week testing newer builds of Solr 1.4 to see if similar 
behavior exists.




Regards,
Ilan





--
Ilan Rabinovitch
i...@fonz.net



Re: solr and approximate string matching

2009-08-31 Thread Ryszard Szopa
Hi,

On Sun, Aug 30, 2009 at 9:32 PM, Shalin Shekhar
Mangarshalinman...@gmail.com wrote:

 The best way to debug these kind of problems is to look at analysis.jsp
 and/or use debugQuery=on on the query to see exactly how it is being parsed.

 Can you post the output of your query with debugQuery=on?

Thanks a lot for your answer. Fortunately, I've managed to deal with
the problem by myself, and it turned out to be mostly unrelated with
the schema. I was using AND as the default operator, and that didn't
play nicely with ngrams.

  -- RS

-- 
http://gryziemy.net
http://robimy.net


sql server indexing using dih problem

2009-08-31 Thread rameshgalla

Hi,

I am trying to index sql server table using dih.

my data-config.xml file configuration:

dataConfig 
dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
type=JdbcDataSource
url=jdbc:sqlserver://10.232.6.38:1433;databaseName=Rames user=sa
password=password-1 / 
document name=customers 
entity name=customers query=select
CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from
customers 
field column=CustomerID name=CustomerID/
field column=Title name=Title/
field column=Forename name=Forename/
field column=Surname name=Surname/
field column=Address_1 name=Address_1/
field column=Address_2 name=Address_2/
field column=Town name=Town/
field column=Postcode name=Postcode/ 
/entity 
/document 
/dataConfig

When i have tried to debug i got the following error:

?xml version=1.0 encoding=UTF-8 ? 
- response
- lst name=responseHeader
  int name=status0/int 
  int name=QTime29672/int 
  /lst
- lst name=initArgs
- lst name=defaults
  str name=configdb-data-config.xml/str 
  /lst
  /lst
  str name=commandfull-import/str 
  str name=modedebug/str 
  null name=documents / 
- lst name=verbose-output
- lst name=entity:customers
- lst name=document#1
  str name=queryselect
CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from
customers/str 
  str
name=EXCEPTIONorg.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: select
CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from
customers Processing Document # 1 at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:186)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:143)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:43)
at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:183)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:74)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:190)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP
connection to the host 10.232.6.38, port 1433 has failed. Error: Connection
refused: connect. Verify the connection properties, check that an instance
of SQL Server is running on the host and accepting TCP/IP connections at the
port, and that no firewall is blocking TCP connections to the port.. at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1049)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:833)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:716)
at

How to set similarity to catch more results ?

2009-08-31 Thread Kaoul
Hello,

I'm new to Solr and don't find in documentation how-to to set
similarity. I want it more flexible, as if I make a mistake with
letters, results are found like with google.

Thank you in advance.


Re: How to set similarity to catch more results ?

2009-08-31 Thread rajan chandi
There are fuzzy searches which might be able to help a bit.
There could be more but I am just a newbie.

Regards
Rajan

On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote:

 Hello,

 I'm new to Solr and don't find in documentation how-to to set
 similarity. I want it more flexible, as if I make a mistake with
 letters, results are found like with google.

 Thank you in advance.



Re: sql server indexing using dih problem

2009-08-31 Thread Shalin Shekhar Mangar
On Mon, Aug 31, 2009 at 3:25 PM, rameshgalla ramesh.ga...@cognizant.comwrote:


 Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP
 connection to the host 10.232.6.38, port 1433 has failed. Error:
 Connection
 refused: connect. Verify the connection properties, check that an instance
 of SQL Server is running on the host and accepting TCP/IP connections at
 the
 port, and that no firewall is blocking TCP connections to the port.. at

 com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170)
 at

 com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1049)
 at

 com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:833)
 at



The reason is given in the exception itself. The driver is not able to
connect to your server at port 1433. Either your server is down or the
host/port is incorrect or there is a firewall which is blocking access.

-- 
Regards,
Shalin Shekhar Mangar.


Hierarchical schema design

2009-08-31 Thread Pooja Verlani
Hi all,
Is there a possibility to have a hierarchical schema in solr, meaning can we
have objects under objects.
For example, for a doc like:
doc
 a1
 b1
 b2
 b3
/a1
a2
  b1
  b2
  ,b3
/a2
.
.
.
.
.
.
.
/doc

I need to make schema with 3 types of such objects and all of them having
different field values for each.

Please reply if there exists such a possibility.

Regards.
Pooja


Re: filtering facets

2009-08-31 Thread Mike Topper
Hi Olivier,

are the facet counts on the urls you dont want 0?

if so you can use facet.mincount to only return results greater than 0.

-Mike

Olivier H. Beauchesne wrote:
 Hi,

 Long time lurker, first time poster.

 I have a multi-valued field, let's call it article_outlinks containing
 all outgoing urls from a document. I want to get all matching urls
 sorted by counts.

 For exemple, I want to get all outgoing wikipedia url in my documents
 sorted by counts.

 So I execute a query like this:
 q=article_outlinks:http*wikipedia.org*  and I facet on article_outlinks

 But I get facets containing the other urls in the documents. I can get
 something close by using facet.prefix=http://en.wikipedia.org but I
 want to include other subdomains on wikipedia (ex: fr.wikipedia.org).

 Is there a way to do a search and getting facets only matching my query?

 I know facet.prefix isn't a query, but is there a way to get that
 behavior?

 Is it easy to extend solr to do something like that?

 Thank you,

 Olivier

 Sorry for my english.



-- 
my public key can be found by:

gpg --keyserver pgp.mit.edu  --recv-keys 26A5C87F




Re: Performance Drop from 1.3 to 1.4

2009-08-31 Thread Yonik Seeley
I don't know exactly how the local solr stuff currently works (it's
not currently part of Solr), but it's possible to get worse memory
performance if you're not careful.  Solr and Lucene now do per-segment
searching and sorting in a single index... and that means fieldcache
entries populated at the segment level instead of the top level
multireader.  It's possible/probably that some elements of LocalSolr
use a top-level reader and other elements use per-segment (via Lucene
or Solr) for geo fields, thus doubling the memory footprint from
before.

-Yonik
http://www.lucidimagination.com

On Mon, Aug 31, 2009 at 3:24 AM, Ilan Rabinovitchi...@fonz.net wrote:
 Hello,

 We recently began migrating a few of our applications from 1.3 to 1.4 in
 order to take advantage of the replication and performance improvements.


 In practice however, we are noticing that our instances which make use of
 LocalSolr have experienced some performance degradation from 1.3 to 1.4.
  This mostly appears to be due to some dramatically different GC patterns
 between the two despite using the same JDK, data set, and GC parameters, and
 solr cache sizes.


 Versions
 ==
 LocalSolr: HEAD as of August 27
 Solr: 1.4-DEV from 6/10/09
 JDK:  Sun Hotspot 1.6.0_14


 Has anyone else seen this type of behavior with Solr/LocalSolr queries?

 Some graphs of our results (transactions per second) between both versions
 are available at the following link.  While performance is relatively
 similar when using UseParallelGC rather than UseConcMarkSweepGC, we do still
 notice some relatively long pause times in 1.4 when compared with 1.3 during
 GC windows.


 https://dl.getdropbox.com/u/162474/performance.png


 The queries run for this test make use of LocalSolr sorting and filtering
 based on distances between 2 sets of lat/long coordinates.


 We are currently using LocalSolr HEAD with 6/10/09 revision of Solr 1.4.
  The reason we selected that revision was due to it being the revision that
 LocalSolr was last released against. We do however plan to spend some time
 this week testing newer builds of Solr 1.4 to see if similar behavior
 exists.



 Regards,
 Ilan





 --
 Ilan Rabinovitch
 i...@fonz.net




Re: filtering facets

2009-08-31 Thread Olivier H. Beauchesne

Hi Mike,

No, my problem is that the field article_outlinks is multivalued thus it 
contains several urls not related to my search. I would like to facet 
only urls matching my query.


For exemple(only on one document, but my search targets over 1M docs):

Doc1:
article_url:
url1.com/1
url2.com/2
url1.com/1
url1.com/3

And my query is: article_url:url1.com* and I facet by article_url and I 
want it to give me:

url1.com/1 (2)
url1.com/3 (1)

But right now, because url2.com/2 is contained in a multivalued field 
with the matching urls, I get this:

url1.com/1 (2)
url1.com/3 (1)
url2.com/2 (1)

I can use facet.prefix to filter, but it's not very flexible if my url 
contains a subdomain as facet.prefix doesn't support wildcards.


Thank you,

Olivier

Mike Topper a écrit :

Hi Olivier,

are the facet counts on the urls you dont want 0?

if so you can use facet.mincount to only return results greater than 0.

-Mike

Olivier H. Beauchesne wrote:
  

Hi,

Long time lurker, first time poster.

I have a multi-valued field, let's call it article_outlinks containing
all outgoing urls from a document. I want to get all matching urls
sorted by counts.

For exemple, I want to get all outgoing wikipedia url in my documents
sorted by counts.

So I execute a query like this:
q=article_outlinks:http*wikipedia.org*  and I facet on article_outlinks

But I get facets containing the other urls in the documents. I can get
something close by using facet.prefix=http://en.wikipedia.org but I
want to include other subdomains on wikipedia (ex: fr.wikipedia.org).

Is there a way to do a search and getting facets only matching my query?

I know facet.prefix isn't a query, but is there a way to get that
behavior?

Is it easy to extend solr to do something like that?

Thank you,

Olivier

Sorry for my english.





  


Help! Issue with tokens in custom synonym filter

2009-08-31 Thread Lajos

Hi all,

I've been writing some custom synonym filters and have run into an issue 
with returning a list of tokens. I have a synonym filter that uses the 
WordNet database to extract synonyms. My problem is how to define the 
offsets and position increments in the new Tokens I'm returning.


For an input token, I get a list of synonyms from the WordNet database. 
I then create a ListToken of those results. Each Token is created with 
the same startOffset, endOffset and positionIncrement of the input 
Token. Is this correct? My understanding from looking at the Lucene 
codebase is that the startOffset/endOffset should be the same, as we are 
referring to the same term in the original text. However, I don't quite 
get the positionIncrement. I understand that it is relative to the 
previous term ... does this mean all my synonyms should have a 
positionIncrement of 0? But whether I use 0 or the positionIncrement of 
the original input Token, Solr seems to ignore the returned tokens ...


This is a summary of what is in my filter:

*

private IteratorToken output;
private ArrayListToken synonyms = null;

public Token next(Token in) throws IOException {
  if (output != null) {
// Here we are just outputing matched synonyms
// that we previously created from the input token
// The input token has already been returned
if (output.hasNext()) {
  return output.next();
} else {
  return null;
}
  }

  synonyms = new ArrayListToken();

  Token t = input.next(in);
  if (t == null) return null;

  String value = new String(t.termBuffer(), 0,
t.termLength()).toLowerCase();

  // Get list of WordNet synonyms (code removed)
  // Iterate thru WordNet synonyms
  for (String wordNetSyn : wordNetSyns) {
Token synonym = new Token(t.startOffset(), t.endOffset(), 
t.type());	synonym.setPositionIncrement(t.getPositionIncrement());

synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
  wordNetSyn .length());
synonyms.add(synonym);
  }

  output = synonyms.iterator();

  // Return the original word, we want it
  return t;
}


Re: Help! Issue with tokens in custom synonym filter

2009-08-31 Thread AHMET ARSLAN
 I've been writing some custom synonym filters and have run
 into an issue with returning a list of tokens. I have a
 synonym filter that uses the WordNet database to extract
 synonyms. My problem is how to define the offsets and
 position increments in the new Tokens I'm returning.
 
 For an input token, I get a list of synonyms from the
 WordNet database. I then create a ListToken of those
 results. Each Token is created with the same startOffset,
 endOffset and positionIncrement of the input Token. Is this
 correct? My understanding from looking at the Lucene
 codebase is that the startOffset/endOffset should be the
 same, as we are referring to the same term in the original
 text. However, I don't quite get the positionIncrement. I
 understand that it is relative to the previous term ... does
 this mean all my synonyms should have a positionIncrement of
 0? But whether I use 0 or the positionIncrement of the
 original input Token, Solr seems to ignore the returned
 tokens ...

You can look at the source code of SynonymTokenFilter[1] and SynonymMap[2] in 
Lucene.

[1] 
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymTokenFilter.html
[2] 
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymMap.html


  


Is caching worth it when my whole index is in RAM?

2009-08-31 Thread Michael
Hi,
If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I
have a need for the document cache?  Or should I set it to 0 items, because
pulling field values from an index in RAM is so fast that the document cache
would be a duplication of effort?

Are there any other caches that I should turn off if I can get my entire
index in RAM?  Filter cache, query results cache, etc?

Thanks!
Michael


Re: Help! Issue with tokens in custom synonym filter

2009-08-31 Thread Smiley, David W.
Although this is not a direct answer to your question, you may want to consider 
generating a synonyms file from wordnet.  Then, you can use the standard 
synonym filter in Solr.  The only downside to this is that the synonym file 
might be pretty large... but you've probably got some large file for wordnet 
data any way.

~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server



On 8/31/09 10:32 AM, Lajos la...@protulae.com wrote:

Hi all,

I've been writing some custom synonym filters and have run into an issue
with returning a list of tokens. I have a synonym filter that uses the
WordNet database to extract synonyms. My problem is how to define the
offsets and position increments in the new Tokens I'm returning.

For an input token, I get a list of synonyms from the WordNet database.
I then create a ListToken of those results. Each Token is created with
the same startOffset, endOffset and positionIncrement of the input
Token. Is this correct? My understanding from looking at the Lucene
codebase is that the startOffset/endOffset should be the same, as we are
referring to the same term in the original text. However, I don't quite
get the positionIncrement. I understand that it is relative to the
previous term ... does this mean all my synonyms should have a
positionIncrement of 0? But whether I use 0 or the positionIncrement of
the original input Token, Solr seems to ignore the returned tokens ...

This is a summary of what is in my filter:

*

private IteratorToken output;
private ArrayListToken synonyms = null;

public Token next(Token in) throws IOException {
   if (output != null) {
 // Here we are just outputing matched synonyms
 // that we previously created from the input token
 // The input token has already been returned
 if (output.hasNext()) {
   return output.next();
 } else {
   return null;
 }
   }

   synonyms = new ArrayListToken();

   Token t = input.next(in);
   if (t == null) return null;

   String value = new String(t.termBuffer(), 0,
 t.termLength()).toLowerCase();

   // Get list of WordNet synonyms (code removed)
   // Iterate thru WordNet synonyms
   for (String wordNetSyn : wordNetSyns) {
 Token synonym = new Token(t.startOffset(), t.endOffset(),
t.type());  synonym.setPositionIncrement(t.getPositionIncrement());
 synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
   wordNetSyn .length());
 synonyms.add(synonym);
   }

   output = synonyms.iterator();

   // Return the original word, we want it
   return t;
}



Re: filtering facets

2009-08-31 Thread Michael
You could post-process the response and remove urls that don't match your
domain pattern.

On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.comwrote:

 Hi Mike,

 No, my problem is that the field article_outlinks is multivalued thus it
 contains several urls not related to my search. I would like to facet only
 urls matching my query.

 For exemple(only on one document, but my search targets over 1M docs):

 Doc1:
 article_url:
 url1.com/1
 url2.com/2
 url1.com/1
 url1.com/3

 And my query is: article_url:url1.com* and I facet by article_url and I
 want it to give me:
 url1.com/1 (2)
 url1.com/3 (1)

 But right now, because url2.com/2 is contained in a multivalued field with
 the matching urls, I get this:
 url1.com/1 (2)
 url1.com/3 (1)
 url2.com/2 (1)

 I can use facet.prefix to filter, but it's not very flexible if my url
 contains a subdomain as facet.prefix doesn't support wildcards.

 Thank you,

 Olivier

 Mike Topper a écrit :

  Hi Olivier,

 are the facet counts on the urls you dont want 0?

 if so you can use facet.mincount to only return results greater than 0.

 -Mike

 Olivier H. Beauchesne wrote:


 Hi,

 Long time lurker, first time poster.

 I have a multi-valued field, let's call it article_outlinks containing
 all outgoing urls from a document. I want to get all matching urls
 sorted by counts.

 For exemple, I want to get all outgoing wikipedia url in my documents
 sorted by counts.

 So I execute a query like this:
 q=article_outlinks:http*wikipedia.org*  and I facet on article_outlinks

 But I get facets containing the other urls in the documents. I can get
 something close by using facet.prefix=http://en.wikipedia.org but I
 want to include other subdomains on wikipedia (ex: fr.wikipedia.org).

 Is there a way to do a search and getting facets only matching my query?

 I know facet.prefix isn't a query, but is there a way to get that
 behavior?

 Is it easy to extend solr to do something like that?

 Thank you,

 Olivier

 Sorry for my english.










Re: Hierarchical schema design

2009-08-31 Thread Uri Boness

Hi,

The search index is flat. There are no hierarchies in there. Now, I'm 
not sure what you're referring to with this type of objects. But if 
you refer to having different types of documents in one index (and 
schema) that's certainly possible. You can define all the fields that 
you expect in all the different document types in one schema and have 
one special field (called type) to distinguish between the document 
types (it will hold a unique value for each document type). The only 
drawback of this solutions is that you cannot (in most cases) define the 
fields as required. Another solution would be to deploy the different 
documents on different core, where each core has its own schema (and 
index). The drawback here however is that you will not be able to search 
across the different document types.


Cheers,
Uri

Pooja Verlani wrote:

Hi all,
Is there a possibility to have a hierarchical schema in solr, meaning can we
have objects under objects.
For example, for a doc like:
doc
 a1
 b1
 b2
 b3
/a1
a2
  b1
  b2
  ,b3
/a2
.
.
.
.
.
.
.
/doc

I need to make schema with 3 types of such objects and all of them having
different field values for each.

Please reply if there exists such a possibility.

Regards.
Pooja

  


Re: How to set similarity to catch more results ?

2009-08-31 Thread Aakash Dharmadhikari
hi Kaoul,

  There are multiple ways that you can use to get the desired results.


   - Stemming - this makes all forms of a word (e.g. Run, Running, Runner)
   match to its stem or root word Run.
   - Synonyms - this will take a list of synonyms from you and would match
   veg = vegetarian and even tiger = lion if you map so.
   - PhoneticFilterFactory - As the name suggests, it would do all your
   soundex matches.

  Apart from these FilterFactories, using a StandardTokenizer would match
mickey mouse to mouse mickey, as you would expect from google.

  There are still tens of other Filters and Tokenizers that you can use,
depending on your need. I would suggest you to go through
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to get more
understanding of available options.

regards,
aakash

On Mon, Aug 31, 2009 at 3:33 PM, rajan chandi chandi.ra...@gmail.comwrote:

 There are fuzzy searches which might be able to help a bit.
 There could be more but I am just a newbie.

 Regards
 Rajan

 On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote:

  Hello,
 
  I'm new to Solr and don't find in documentation how-to to set
  similarity. I want it more flexible, as if I make a mistake with
  letters, results are found like with google.
 
  Thank you in advance.
 



Re: Release Date Solr 1.4

2009-08-31 Thread Yonik Seeley
Many of you probably know that Lucene went into code-freeze last
Thursday... which puts a probable Lucene release date at the end of
this week.

My day-job colleagues and I are all traveling this week (company
get-together) so that may slow things down a bit for some of us, and
perhaps cause the goal of releasing Solr 1 week after Lucene to slip a
little.  Still, if there are any issues that are assigned to you, and
that you can't get to this week (including the weekend) please
un-assign yourself as a signal that someone else should try and take
it up.

-Yonik
http://www.lucidimagination.com

ps: I've extended my stay so I can make the Lucene/Solr meetup this
Thursday... hope to see some of you there!


On Fri, Aug 21, 2009 at 11:34 PM, Yonik
Seeleyyo...@lucidimagination.com wrote:
 FYI, I'm on vacation in Ocean City MD starting tomorrow - but I will
 have internet access.
 The goal of releasing a week after 2.9 still seems very realistic - we
 just need to decide to finish all open issues one week from Lucene's
 code freeze.  And all of a sudden, Lucene went from 0 open issues,
 back to 16... but most of those may be resolved rapidly.

 -Yonik
 http://www.lucidimagination.com

 On Tue, Aug 18, 2009 at 1:37 PM, Yonik Seeleyyo...@lucidimagination.com 
 wrote:
 On Tue, Aug 18, 2009 at 9:02 AM, Mark Millermarkrmil...@gmail.com wrote:
 The last note I saw said we hope to release 1.4 a week or so after Lucene
 2.9 (though of course a week may not end up being enough).

 Yep, I think this is still doable.

 -Yonik
 http://www.lucidimagination.com



Re: Dismax Wildcard Queries

2009-08-31 Thread Smiley, David W.
Hi Kurt.  I'm the author of those JIRA issues.  I'm glad you have interest in 
them.  Please vote for them if you have not done so already.  I updated 
SOLR-758 and I hope it works out okay for you.  If you have further questions, 
please comment on the relevant issues.

~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server


On 8/30/09 3:21 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

On Tue, Aug 25, 2009 at 3:00 AM, Kurt N. kurt.nordst...@unt.edu wrote:


 Hello all.

 We have a situation in the requirements for our project that make it
 desireable to be able to perform a DisMax query with wildcard (* and ?)
 characters in it.

 We are using the standard release (not nightly) of Solr 1.3.

 Our first thought was to apply the SOLR-756 patch
 (http://issues.apache.org/jira/browse/SOLR-756), which aimed to make
 DisMax
 support all query types.  The patch was installed, and Solr recompiled
 without trouble.  Upon passing a query with the * wildcard in it, we did
 not
 get a result set that indicated that the query was working.

 Our next thought was to apply SOLR-758
 (http://issues.apache.org/jira/browse/SOLR-758), to see if that solved our
 problem.  In doing so, we had to install the SOLR-757
 (http://issues.apache.org/jira/browse/SOLR-757) patch as well.
 Unfortunately, at this point, Solr refused to compile.  From the error
 messages that Ant gave, it seemed that the new code from SOLR-758 was
 looking for a function called getNonLocalParams(), which, after grep'ing
 the source, doesn't seem to exist in the Solr codebase.


I can't find QParser ever having a method named getNonLocalParams. However,
looking at the way that method is being used in the patch, using params
instead of getNonLocalParams() should work.

Questions on patches are best asked on the respective issue.

--
Regards,
Shalin Shekhar Mangar.



Re: Why can't have sign in the text?

2009-08-31 Thread AHMET ARSLAN
 I use text as my field type, but whenever my field has
 '' sign, the post.jar will error out. What can I do to work around this?
 Thanks.

The files - that you are posting - must be valid xml. Escape special xml 
characters, e.g. replace  with amp; 



  


Re: WordDelimiterFilter to QueryParser to MultiPhraseQuery?

2009-08-31 Thread jOhn
This is mostly my misunderstanding of catenateAll=1 as I thought it would
break down with an OR using the full concatenated word.

Thus:

Jokers Wild - { jokers, wild } OR { jokerswild }

But really it becomes: { jokers, {wild, jokerswild}} which will not match.

And if you have a mistyped camel case like:

jOkerswild - { j, {okerswild, jokerswild}} again no match.

So it really requires some way to append the full word as an OR so: {j,
{okerswild}} OR {jokerswild}

severalTokensAtSamePosition=true is in the source code (QueryParser) as a
boolean flag which always ends up true in these cases and triggers creating
a MultiPhraseQuery as in the examples above.

To really get this right I'll need to do a custom QueryParser IMO.


Date Faceting and Double Counting

2009-08-31 Thread Stephen Duncan Jr
If we do date faceting and start at 2009-01-01T00:00:00Z, end at
2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at
exactly 2009-01-02T00:00:00Z will be included in both the returned counts
(2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z).  At the moment, this is
quite bad for us, as we only index the day-level, so all of our documents
are exactly on the line between each facet-range.

Because we know our data is indexed as being exactly at midnight each day, I
think we can simply always start from 1 second prior and get the results we
want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think
this problem would affect everyone, even if usually more subtly (instead of
all documents being counted twice, only a few on the fencepost between
ranges).

Is this a known behavior people are happy with, or should I file an issue
asking for ranges in date-facets to be constructed to subtract one second
from the end of each range (so that the effective range queries for my case
would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] 
[2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])?

Alternatively, is there some other suggested way of using the date faceting
to avoid this problem?

-- 
Stephen Duncan Jr
www.stephenduncanjr.com


Re: Help! Issue with tokens in custom synonym filter

2009-08-31 Thread Lajos

Hi David  Ahmet,

I hadn't seen the SynonymTokenFilter from Lucene, so that helped. 
Ultimately, however, it seems I was pretty much doing the right thing, 
although my token type might have been wrong.


Unfortunately, while the tokens are being returned properly (AFAIK), 
when I do a query using one of the synonyms, I can't get any results. 
This is not the case if I just directly code in the synonym into the 
synonyms file with the standard solr synonym filter.


So I'll have to keep on hacking away ;)

Regarding generating the file from WordNet, we'd considered that but our 
requirements essentially mean we have to do the heavy lifting within the 
filter itself. Not that I'm opposed, it is just that I'm apparently 
missing something simple still.


Thanks for the replies.

Lajos


Smiley, David W. wrote:

Although this is not a direct answer to your question, you may want to consider 
generating a synonyms file from wordnet.  Then, you can use the standard 
synonym filter in Solr.  The only downside to this is that the synonym file 
might be pretty large... but you've probably got some large file for wordnet 
data any way.

~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server



On 8/31/09 10:32 AM, Lajos la...@protulae.com wrote:

Hi all,

I've been writing some custom synonym filters and have run into an issue
with returning a list of tokens. I have a synonym filter that uses the
WordNet database to extract synonyms. My problem is how to define the
offsets and position increments in the new Tokens I'm returning.

For an input token, I get a list of synonyms from the WordNet database.
I then create a ListToken of those results. Each Token is created with
the same startOffset, endOffset and positionIncrement of the input
Token. Is this correct? My understanding from looking at the Lucene
codebase is that the startOffset/endOffset should be the same, as we are
referring to the same term in the original text. However, I don't quite
get the positionIncrement. I understand that it is relative to the
previous term ... does this mean all my synonyms should have a
positionIncrement of 0? But whether I use 0 or the positionIncrement of
the original input Token, Solr seems to ignore the returned tokens ...

This is a summary of what is in my filter:

*

private IteratorToken output;
private ArrayListToken synonyms = null;

public Token next(Token in) throws IOException {
   if (output != null) {
 // Here we are just outputing matched synonyms
 // that we previously created from the input token
 // The input token has already been returned
 if (output.hasNext()) {
   return output.next();
 } else {
   return null;
 }
   }

   synonyms = new ArrayListToken();

   Token t = input.next(in);
   if (t == null) return null;

   String value = new String(t.termBuffer(), 0,
 t.termLength()).toLowerCase();

   // Get list of WordNet synonyms (code removed)
   // Iterate thru WordNet synonyms
   for (String wordNetSyn : wordNetSyns) {
 Token synonym = new Token(t.startOffset(), t.endOffset(),
t.type());  synonym.setPositionIncrement(t.getPositionIncrement());
 synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
   wordNetSyn .length());
 synonyms.add(synonym);
   }

   output = synonyms.iterator();

   // Return the original word, we want it
   return t;
}







No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.409 / Virus Database: 270.13.71/2334 - Release Date: 08/29/09 17:51:00




Why can't have sign in the text?

2009-08-31 Thread Elaine Li
Hi,

I use text as my field type, but whenever my field has '' sign, the
post.jar will error out. What can I do to work around this? Thanks.

solr returned an error:
comctcwstxexcWstxLazyException_Unexpected_character___code_32_missing_name__
at_javaxxmlstreamSerializableLocation587f587f__comctcwstxexcWstxLazyException_comctcwstxexcWstxUnexpectedCharException_
Unexpected_character___code_32_missing_name__
at_javaxxmlstreamSerializableLocation587f587f__
at_comctcwstxexcWstxLazyExceptionthrowLazilyWstxLazyExceptionjava45__
at_comctcwstxsrStreamScannerthrowLazyErrorStreamScannerjava729__
at_comctcwstxsrBasicStreamReadersafeFinishTokenBasicStreamReaderjava3659__
at_comctcwstxsrBasicStreamReadergetTextBasicStreamReaderjava809__
at_orgapachesolrhandlerXmlUpdateRequestHandlerreadDocXmlUpdateRequestHandlerjava327__
.

fieldType name=mytext class=solr.TextField 
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

Please advise.

Elaine


Re: Why can't have sign in the text?

2009-08-31 Thread Elaine Li
Thanks a lot! Really helped.



On Mon, Aug 31, 2009 at 2:21 PM, AHMET ARSLANiori...@yahoo.com wrote:
 I use text as my field type, but whenever my field has
 '' sign, the post.jar will error out. What can I do to work around this?
 Thanks.

 The files - that you are posting - must be valid xml. Escape special xml 
 characters, e.g. replace  with 







Re: filtering facets

2009-08-31 Thread Olivier H. Beauchesne
yeah, but then I would have to retrieve *a lot* of facets. I think for 
now i'll retrieve all the subdomains with facet.prefix and then merge 
those queries. Not ideal, but when I will have more motivation, I will 
submit a patch to solr :-)


Michael a écrit :

You could post-process the response and remove urls that don't match your
domain pattern.

On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.comwrote:

  

Hi Mike,

No, my problem is that the field article_outlinks is multivalued thus it
contains several urls not related to my search. I would like to facet only
urls matching my query.

For exemple(only on one document, but my search targets over 1M docs):

Doc1:
article_url:
url1.com/1
url2.com/2
url1.com/1
url1.com/3

And my query is: article_url:url1.com* and I facet by article_url and I
want it to give me:
url1.com/1 (2)
url1.com/3 (1)

But right now, because url2.com/2 is contained in a multivalued field with
the matching urls, I get this:
url1.com/1 (2)
url1.com/3 (1)
url2.com/2 (1)

I can use facet.prefix to filter, but it's not very flexible if my url
contains a subdomain as facet.prefix doesn't support wildcards.

Thank you,

Olivier

Mike Topper a écrit :

 Hi Olivier,


are the facet counts on the urls you dont want 0?

if so you can use facet.mincount to only return results greater than 0.

-Mike

Olivier H. Beauchesne wrote:


  

Hi,

Long time lurker, first time poster.

I have a multi-valued field, let's call it article_outlinks containing
all outgoing urls from a document. I want to get all matching urls
sorted by counts.

For exemple, I want to get all outgoing wikipedia url in my documents
sorted by counts.

So I execute a query like this:
q=article_outlinks:http*wikipedia.org*  and I facet on article_outlinks

But I get facets containing the other urls in the documents. I can get
something close by using facet.prefix=http://en.wikipedia.org but I
want to include other subdomains on wikipedia (ex: fr.wikipedia.org).

Is there a way to do a search and getting facets only matching my query?

I know facet.prefix isn't a query, but is there a way to get that
behavior?

Is it easy to extend solr to do something like that?

Thank you,

Olivier

Sorry for my english.







  


  


Re: Sorting by Unindexed Fields

2009-08-31 Thread Isaac Foster
Hi Erik,

Sorry it took me a while to get back to your response. I appreciate any help
I can get.

The number of documents will start out small, but if we do well we'll have a
lot. The fields would all be numeric (we'll map categorical fields to
integers), and I would imagine the number of fields will be between 2 and 5,
but we're not going to limit it.

I think for this particular issue we may try to keep the solution in the
database so that that particular information can live in as few places as
possible.

Thanks as always for the help.

iSac




On Wed, Aug 26, 2009 at 9:32 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Solr sorts on indexed fields only, currently.  And only a single value per
 document per sort field (careful with analyzed fields, and no multiValued
 fields).

 Unwise and impossible - of course this depends on the scale you're speaking
 of.  How many documents?  What types of fields?   How small is fairly small
 number of fields?

Erik



 On Aug 26, 2009, at 6:33 PM, Isaac Foster wrote:

 Hi,

 I have a situation where a particular kind of document can be categorized
 in
 different ways, and depending on the categories it is in it will have
 different fields that describe it (in practice the number of fields will
 be
 fairly small, but whatever). These documents will each have a full-text
 field that Solr is perfect for, and it seems like Solr's dynamic fields
 ability makes it an even more perfect solution.

 I'd like to be able to sort by any of the fields, but indexing them all
 seems somewhere between unwise and impossible. Will Solr sort by fields
 that
 are unindexed?

 iSac





Re: Field names with whitespaces

2009-08-31 Thread Jay Hill
This seems to work:

?q=field\ name:something

Probably not a good idea to have field names with whitespace though.

-Jay

2009/8/28 Marcin Kuptel marcinkup...@gmail.com

 Hi,

 Is there a way to query solr about fields which names contain whitespaces?
 Indexing such data does not cause any problems but I have been unable to
 retrieve it.

 Regards,
 Marcin Kuptel



Re: How to set similarity to catch more results ?

2009-08-31 Thread Avlesh Singh

 I want it more flexible, as if I make a mistake with letters, results are
 found like with google.

You are talking about spelling mistakes?
http://wiki.apache.org/solr/SpellCheckComponent

Cheers
Avlesh

On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote:

 Hello,

 I'm new to Solr and don't find in documentation how-to to set
 similarity. I want it more flexible, as if I make a mistake with
 letters, results are found like with google.

 Thank you in advance.



Re: Date Faceting and Double Counting

2009-08-31 Thread Avlesh Singh
I don't think this behavior needs to be fixed. It is justified for the data
you have indexed.
date minus 1 second should definitely work for you.

Cheers
Avlesh

On Mon, Aug 31, 2009 at 11:37 PM, Stephen Duncan Jr 
stephen.dun...@gmail.com wrote:

 If we do date faceting and start at 2009-01-01T00:00:00Z, end at
 2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at
 exactly 2009-01-02T00:00:00Z will be included in both the returned counts
 (2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z).  At the moment, this is
 quite bad for us, as we only index the day-level, so all of our documents
 are exactly on the line between each facet-range.

 Because we know our data is indexed as being exactly at midnight each day,
 I
 think we can simply always start from 1 second prior and get the results we
 want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think
 this problem would affect everyone, even if usually more subtly (instead of
 all documents being counted twice, only a few on the fencepost between
 ranges).

 Is this a known behavior people are happy with, or should I file an issue
 asking for ranges in date-facets to be constructed to subtract one second
 from the end of each range (so that the effective range queries for my case
 would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] 
 [2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])?

 Alternatively, is there some other suggested way of using the date faceting
 to avoid this problem?

 --
 Stephen Duncan Jr
 www.stephenduncanjr.com



Re: Hierarchical schema design

2009-08-31 Thread Avlesh Singh
As Uri has already replied, there is no concept of a hierarchical schema
in Solr.
My gut feeling says you might be talking about Multiple
coreshttp://www.google.co.in/search?q=multiple+core+solrie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a
.

Cheers
Avlesh

On Mon, Aug 31, 2009 at 5:26 PM, Pooja Verlani pooja.verl...@gmail.comwrote:

 Hi all,
 Is there a possibility to have a hierarchical schema in solr, meaning can
 we
 have objects under objects.
 For example, for a doc like:
 doc
  a1
 b1
 b2
 b3
 /a1
 a2
  b1
  b2
  ,b3
 /a2
 .
 .
 .
 .
 .
 .
 .
 /doc

 I need to make schema with 3 types of such objects and all of them having
 different field values for each.

 Please reply if there exists such a possibility.

 Regards.
 Pooja



Re: Is caching worth it when my whole index is in RAM?

2009-08-31 Thread Avlesh Singh
Good question!
The application level cache, say filter cache, would still help because it
not only caches values but also the underlying computation. Even with all
the data in your RAM you will still end up doing the computations every
time.

Looking for responses from the more knowledgeable.

Cheers
Avlesh

On Mon, Aug 31, 2009 at 8:25 PM, Michael solrco...@gmail.com wrote:

 Hi,
 If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I
 have a need for the document cache?  Or should I set it to 0 items, because
 pulling field values from an index in RAM is so fast that the document
 cache
 would be a duplication of effort?

 Are there any other caches that I should turn off if I can get my entire
 index in RAM?  Filter cache, query results cache, etc?

 Thanks!
 Michael



Re: filtering facets

2009-08-31 Thread Avlesh Singh

 when I will have more motivation, I will submit a patch to solr :-)

You want to add more here?- https://issues.apache.org/jira/browse/SOLR-1387

Cheers
Avlesh

On Tue, Sep 1, 2009 at 2:51 AM, Olivier H. Beauchesne oliv...@olihb.comwrote:

 yeah, but then I would have to retrieve *a lot* of facets. I think for now
 i'll retrieve all the subdomains with facet.prefix and then merge those
 queries. Not ideal, but when I will have more motivation, I will submit a
 patch to solr :-)

 Michael a écrit :

  You could post-process the response and remove urls that don't match your
 domain pattern.

 On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.com
 wrote:



 Hi Mike,

 No, my problem is that the field article_outlinks is multivalued thus it
 contains several urls not related to my search. I would like to facet
 only
 urls matching my query.

 For exemple(only on one document, but my search targets over 1M docs):

 Doc1:
 article_url:
 url1.com/1
 url2.com/2
 url1.com/1
 url1.com/3

 And my query is: article_url:url1.com* and I facet by article_url and I
 want it to give me:
 url1.com/1 (2)
 url1.com/3 (1)

 But right now, because url2.com/2 is contained in a multivalued field
 with
 the matching urls, I get this:
 url1.com/1 (2)
 url1.com/3 (1)
 url2.com/2 (1)

 I can use facet.prefix to filter, but it's not very flexible if my url
 contains a subdomain as facet.prefix doesn't support wildcards.

 Thank you,

 Olivier

 Mike Topper a écrit :

  Hi Olivier,


 are the facet counts on the urls you dont want 0?

 if so you can use facet.mincount to only return results greater than 0.

 -Mike

 Olivier H. Beauchesne wrote:




 Hi,

 Long time lurker, first time poster.

 I have a multi-valued field, let's call it article_outlinks containing
 all outgoing urls from a document. I want to get all matching urls
 sorted by counts.

 For exemple, I want to get all outgoing wikipedia url in my documents
 sorted by counts.

 So I execute a query like this:
 q=article_outlinks:http*wikipedia.org*  and I facet on
 article_outlinks

 But I get facets containing the other urls in the documents. I can get
 something close by using facet.prefix=http://en.wikipedia.org but I
 want to include other subdomains on wikipedia (ex: fr.wikipedia.org).

 Is there a way to do a search and getting facets only matching my
 query?

 I know facet.prefix isn't a query, but is there a way to get that
 behavior?

 Is it easy to extend solr to do something like that?

 Thank you,

 Olivier

 Sorry for my english.