Re: C++ being filtered (please help)

2010-02-03 Thread Ahmet Arslan
 I have a field which may take the form C++,PHP 
 MySql,C#
 now i want to tokenize it based on comma or white space and
 other word
 delimiting characters only. Not on the plus sign. so that
 result after
 tokenization should be
 C++
 PHP
 MySql
 C#
 
 But the result I am getting is
 c
 php
 mysql
 c
 Please give me some pointers as to which analyzer and
 tokenizer to use
 

You can use this analyzer:

analyzer
charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt / 
tokenizer class=solr.WhitespaceTokenizerFactory / 
filter class=solr.LowerCaseFilterFactory /  
/analyzer

With mappings.txt file:
, =  

you can add more characters (to mappings.txt file) that you want to break words 
at.


  


RE: Solr response extremely slow

2010-02-03 Thread Doddamani, Prakash
Hey 
Can any one say which is the latest and stable version,
We are using 1.2

Solr Specification Version: 1.2.0
Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12
Lucene Specification Version: 2007-05-20_00-04-53
Lucene Implementation Version: build 2007-05-20
Current Time: Wed Feb 03 03:45:56 EST 2010


Regards
Prakash

-Original Message-
From: Vijayant Kumar [mailto:vijay...@websitetoolbox.com] 
Sent: Wednesday, February 03, 2010 1:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr response extremely slow

Hi Rajat,

You can find the version of solr by

http://localhost:8983/solr/admin/registry.jsp

-- 

Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211

 Java version is -

 java version 1.5.0_18
 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_18-b02) 
 Java HotSpot(TM) Server VM (build 1.5.0_18-b02, mixed mode)

 Not sure how to find solr version. Can you tell me how to look it up?

 Also, i don't have a dedicated server to run this on.
 --
 View this message in context:
 http://old.nabble.com/Solr-response-extremely-slow-tp27432229p27432419
 .html Sent from the Solr - User mailing list archive at Nabble.com.







Re: Solr response extremely slow

2010-02-03 Thread Shalin Shekhar Mangar
On Wed, Feb 3, 2010 at 2:18 PM, Doddamani, Prakash 
prakash.doddam...@corp.aol.com wrote:

 Hey
 Can any one say which is the latest and stable version,
 We are using 1.2

Solr Specification Version: 1.2.0
Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12
Lucene Specification Version: 2007-05-20_00-04-53
Lucene Implementation Version: build 2007-05-20
Current Time: Wed Feb 03 03:45:56 EST 2010


Solr 1.4 is the latest stable release.

In future, please don't reply to an unrelated email thread. Start a new
thread instead.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Deploying Solr 1.3 in JBoss 5

2010-02-03 Thread Luca Molteni
Apparently, that worked! I've never realized that the order of the
elements in XML is significant, nice to see.

As always, problems leads to other problems, so now I'm facing with a
Xerces ClassCastException with JDK 6.

org.jboss.xb.binding.JBossXBRuntimeException: Failed to create a new SAX parser
at 
org.jboss.xb.binding.UnmarshallerFactory$UnmarshallerFactoryImpl.newUnmarshaller(UnmarshallerFactory.java:100)
at 
org.jboss.web.tomcat.service.deployers.JBossContextConfig.processContextConfig(JBossContextConfig.java:549)
at 
org.jboss.web.tomcat.service.deployers.JBossContextConfig.init(JBossContextConfig.java:536)
at 
org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:279)
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at 
org.apache.catalina.core.StandardContext.init(StandardContext.java:5436)
at 
org.apache.catalina.core.StandardContext.start(StandardContext.java:4148)
at 
org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeployInternal(TomcatDeployment.java:310)
at 
org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeploy(TomcatDeployment.java:142)
at 
org.jboss.web.deployers.AbstractWarDeployment.start(AbstractWarDeployment.java:461)
at org.jboss.web.deployers.WebModule.startModule(WebModule.java:118)
at org.jboss.web.deployers.WebModule.start(WebModule.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:157)
at org.jboss.mx.server.Invocation.dispatch(Invocation.java:96)
at org.jboss.mx.server.Invocation.invoke(Invocation.java:88)
at 
org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264)
at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:668)
at 
org.jboss.system.microcontainer.ServiceProxy.invoke(ServiceProxy.java:206)
at $Proxy38.start(Unknown Source)
at 
org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:42)
at 
org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:37)
at 
org.jboss.dependency.plugins.action.SimpleControllerContextAction.simpleInstallAction(SimpleControllerContextAction.java:62)
at 
org.jboss.dependency.plugins.action.AccessControllerContextAction.install(AccessControllerContextAction.java:71)
at 
org.jboss.dependency.plugins.AbstractControllerContextActions.install(AbstractControllerContextActions.java:51)
at 
org.jboss.dependency.plugins.AbstractControllerContext.install(AbstractControllerContext.java:348)
at 
org.jboss.system.microcontainer.ServiceControllerContext.install(ServiceControllerContext.java:297)
at 
org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:1633)
at 
org.jboss.dependency.plugins.AbstractController.incrementState(AbstractController.java:935)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:1083)
at 
org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:985)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:823)
at 
org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:553)
at 
org.jboss.system.ServiceController.doChange(ServiceController.java:688)
at org.jboss.system.ServiceController.start(ServiceController.java:460)
at 
org.jboss.system.deployers.ServiceDeployer.start(ServiceDeployer.java:163)
at 
org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:99)
at 
org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:46)
at 
org.jboss.deployers.spi.deployer.helpers.AbstractSimpleRealDeployer.internalDeploy(AbstractSimpleRealDeployer.java:62)
at 
org.jboss.deployers.spi.deployer.helpers.AbstractRealDeployer.deploy(AbstractRealDeployer.java:50)
at 
org.jboss.deployers.plugins.deployers.DeployerWrapper.deploy(DeployerWrapper.java:171)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doDeploy(DeployersImpl.java:1440)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doInstallParentFirst(DeployersImpl.java:1158)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.doInstallParentFirst(DeployersImpl.java:1179)
at 
org.jboss.deployers.plugins.deployers.DeployersImpl.install(DeployersImpl.java:1099)
at 

how to stress test solr

2010-02-03 Thread James liu
before stressing test, Should i close SolrCache?

which tool u use?

How to do stress test correctly?

Any pointers?

-- 
regards
j.L ( I live in Shanghai, China)


Re: DataImportHandler - convertType attribute

2010-02-03 Thread Erik Hatcher
One thing I find awkward about convertType is that it is  
JdbcDataSource specific, rather than field-specific.  Isn't the  
current implementation far too broad?


Erik

On Feb 3, 2010, at 1:16 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



implicit conversion can cause problem when Transformers are applied.
It is hard for user to guess the type of the field by looking at the
schema.xml. In Solr, String is the most commonly used type. if you
wish to do numeric operations on a field convertType will cause
problems.
If it is explicitly set, user knows why the type got changed.

On Tue, Feb 2, 2010 at 6:38 PM, Alexey Serba ase...@gmail.com wrote:

Hello,

I encountered blob indexing problem and found convertType solution in
FAQhttp://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5 



I was wondering why it is not enabled by default and found the
following comment
http://www.lucidimagination.com/search/document/169e6cc87dad5e67/dataimporthandler_and_blobs#169e6cc87dad5e67 
in

mailing list:

We used to attempt type conversion from the SQL type to the  
field's given

type. We
found that it was error prone and switched to using the  
ResultSet#getObject

for all columns (making the old behavior a configurable option –
convertType in JdbcDataSource).

Why it is error prone? Is it safe enough to enable convertType for  
all jdbc

data sources by default? What are the side effects?

Thanks in advance,
Alex





--
-
Noble Paul | Systems Architect| AOL | http://aol.com




wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith
Hi,

I am wondering if there is some way to maintain a stopword list with widcards:

ignoring anything that starts with foo:
foo*

i am doing some funky hackery inside DIH via javascript to make my autosuggest 
work. i basically split phrases and store them together with the full phrase:

the phrase:
Foo Bar

becomes:

Foo Bar
foo bar
{foo}Foo_Bar
{bar}Foo_Bar

the phrase:
Foo-Bar

becomes:

Foo-Bar
foo-bar
{foo}Foo-Bar
{bar}Foo-Bar

However if bar is a stop word, i would like to simply ignore all tokens that 
start with {bar}

Obviously I could have this logic inside my DIH script, but then i would need 
to read in the stopword.txt file the script, which i would like to avoid, then 
again it would probably be the more efficient approach.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: DataImportHandler - convertType attribute

2010-02-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Feb 3, 2010 at 3:31 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 One thing I find awkward about convertType is that it is JdbcDataSource
 specific, rather than field-specific.  Isn't the current implementation far
 too broad?
it is feature of JdbcdataSource and no other dataSource offers it. we
offer it because JDBC drivers have mechanism to do type conversion

What do you mean by it is too broad?


        Erik

 On Feb 3, 2010, at 1:16 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 implicit conversion can cause problem when Transformers are applied.
 It is hard for user to guess the type of the field by looking at the
 schema.xml. In Solr, String is the most commonly used type. if you
 wish to do numeric operations on a field convertType will cause
 problems.
 If it is explicitly set, user knows why the type got changed.

 On Tue, Feb 2, 2010 at 6:38 PM, Alexey Serba ase...@gmail.com wrote:

 Hello,

 I encountered blob indexing problem and found convertType solution in

 FAQhttp://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5

 I was wondering why it is not enabled by default and found the
 following comment

 http://www.lucidimagination.com/search/document/169e6cc87dad5e67/dataimporthandler_and_blobs#169e6cc87dad5e67in
 mailing list:

 We used to attempt type conversion from the SQL type to the field's
 given
 type. We
 found that it was error prone and switched to using the
 ResultSet#getObject
 for all columns (making the old behavior a configurable option –
 convertType in JdbcDataSource).

 Why it is error prone? Is it safe enough to enable convertType for all
 jdbc
 data sources by default? What are the side effects?

 Thanks in advance,
 Alex




 --
 -
 Noble Paul | Systems Architect| AOL | http://aol.com





-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Lucene User Group Meetup in Amsterdam

2010-02-03 Thread Uri Boness

Hi All,

On 17th February we'll host the first Dutch Lucene User Group Meetup. 
This meet-up will be split into two parts:


- The first part will be dedicated to the user group itself. We'll have 
an introduction to the members and have an open discussion about the 
goals of the user group and the expectations from it.


- In the second part, Anne Veling (http://www.beyondtrees.com) will give 
a session about his latest experiences with large scale Solr deployments.


Of course, you will not only get food for thought, but also food for you 
stomach - we'll have a pizza break between the parts and of course beer 
during  after.


Date: 17th February 2010
Time: 17:00
Location:   Frederiksplein 1
 1017XK Amsterdam
 The Netherlands

For more information or questions, please visit: 
http://www.lucene-nl.org/first_meetup


Hope to see you there!

Cheers,
Uri


Re: DataImportHandler - convertType attribute

2010-02-03 Thread Erik Hatcher


On Feb 3, 2010, at 5:36 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:
On Wed, Feb 3, 2010 at 3:31 PM, Erik Hatcher  
erik.hatc...@gmail.com wrote:
One thing I find awkward about convertType is that it is  
JdbcDataSource
specific, rather than field-specific.  Isn't the current  
implementation far

too broad?

it is feature of JdbcdataSource and no other dataSource offers it. we
offer it because JDBC drivers have mechanism to do type conversion

What do you mean by it is too broad?


I mean the convertType flag is not field-specific (or at least field  
overridable).  Conversions occur on a per-field basis, but the setting  
is for the entire data source and thus all fields.


Erik



RE: Basic indexing question

2010-02-03 Thread Stefan Maric
Thanks that was it - I've now configured a dismax requesthandler that suits
my needs



-Original Message-
From: Joe Calderon [mailto:calderon@gmail.com]
Sent: 03 February 2010 00:20
To: solr-user@lucene.apache.org
Subject: Re: Basic indexing question


see http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field for
details on default field, most people use the dismax handler when
handling queries from user
see http://wiki.apache.org/solr/DisMaxRequestHandler for more details,
if you dont have many fields you can write your own query using the
lucene query parser as i mentioned before, the syntax cen be found at
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

hope this helps


--joe
On Tue, Feb 2, 2010 at 3:59 PM, Stefan Maric sma...@ntlworld.com wrote:
 Thanks for the quick reply
 I will have to see if the default query mechanism will suffice for most of
 my needs

 I have skimmed through most of the Solr documentation and didn't see
 anything describing

 I can easily change my DB View so that I only source Solr with a single
 string plus my id field
 (as my application makng the search will have to collate associated
 information into a presentable screen anyhow - so I'm not too worried
about
 info being returned by Solr as such)

 Would that be a reasonable way of using Solr




 -Original Message-
 From: Joe Calderon [mailto:calderon@gmail.com]
 Sent: 02 February 2010 23:42
 To: solr-user@lucene.apache.org
 Subject: Re: Basic indexing question


 by default solr will only search the default fields, you have to
 either query all fields field1:(ore) or field2:(ore) or field3:(ore)
 or use a different query parser like dismax

 On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric sma...@ntlworld.com wrote:
 I have got a basic configuration of Solr up and running and have loaded
 some data to experiment with
  When I run a query for 'ore' I get 3 results when I'm expecting 4
 Dataimport is pulling the expected number of rows in from my DB view

  In my schema.xml I have
  field name=id type=string indexed=true stored=true
 required=true /
  field name=atomId type=string indexed=true stored=true
 required=true /
  field name=name type=text indexed=true stored=true/
  field name=description type=text indexed=true stored=true /

  and  the defaults
 field name=text type=text indexed=true stored=false
 multiValued=true/
 copyField source=name dest=text/

  From an SQL point of view - I am expecting a search for 'ore' to
retrieve
 4 results (which the following does)
 select * from v_sm_search_sectors where description like '% ore%' or name
 like '% ore%';
 121 B0.010.010      Mining and quarrying
 Mining of metal ore, stone, sand, clay, coal and other solid minerals
 1000144 E0.030              Metal and metal ores wholesale
 (null)
 1000145 E0.030.010      Metal and metal ores wholesale
 (null)
 1000146 E0.030.020      Metal and metal ores wholesale agents   (null)

 From a Solr query for 'ore' - I get the following
 response
 -
      lst name=responseHeader
      int name=status0/int
      int name=QTime0/int
      -
      lst name=params
      str name=rows10/str
      str name=start0/str
      str name=indenton/str
      str name=qore/str
      str name=version2.2/str
      /lst
      /lst
      -
      result name=response numFound=3 start=0
      -
      doc
      str name=atomIdE0.030/str
      str name=id1000144/str
      str name=nameMetal and metal ores wholesale/str
      /doc
      -
      doc
      str name=atomIdE0.030.010/str
      str name=id1000145/str
      str name=nameMetal and metal ores wholesale/str
      /doc
      -
      doc
      str name=atomIdE0.030.020/str
      str name=id1000146/str
      str name=nameMetal and metal ores wholesale agents/str
      /doc
      /result
      /response


      So I don't retrieve the document where 'ore' is in the descritpion
 field (and NOT the name field)

      It would seem that Solr is ONLY returning me results based on what
 has been put into the field name=text by the copyField source=name
 dest=text/

      Any hints as to what I've missed ??

      Regards
      Stefan Maric

 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 8.5.435 / Virus Database: 271.1.1/2663 - Release Date: 02/02/10
 07:35:00


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.435 / Virus Database: 271.1.1/2664 - Release Date: 02/02/10
19:35:00



Another basic question

2010-02-03 Thread Stefan Maric
I have got a basic configuration of Solr up and running and have loaded some
data to experiment with
Dataimport is pulling the expected number of rows in from my DB view

If I query for Beekeeping i get one result returned (as expected)

If I query for bee - I get no results
similarly for Bee
etc

What areas of Solr configuration do I need to look into

Thanks
Stefan Maric



Re: Another basic question

2010-02-03 Thread Ahmet Arslan
 I have got a basic configuration of
 Solr up and running and have loaded some
 data to experiment with
 Dataimport is pulling the expected number of rows in from
 my DB view
 
 If I query for Beekeeping i get one result returned (as
 expected)
 
 If I query for bee - I get no results
 similarly for Bee
 etc

Do you want the query (bee) to return documents containing beekeeping?

You can use prefix query bee* but I think DisMax does not support it.

Alternatively you can use index time synonym expansion :
filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt 
ignoreCase=true expand=true / 

with index_synonyms.txt :
beekeeping, bee keeping, bee-keeping


  


query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Hi all,

 

Is it possible to query some field in order to get only not empty
documents?

 

All documents where field x is filled?

 

Thanks,

Frederico

 

 

 

 



Re: DataImportHandler - convertType attribute

2010-02-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Feb 3, 2010 at 4:16 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 On Feb 3, 2010, at 5:36 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 On Wed, Feb 3, 2010 at 3:31 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:

 One thing I find awkward about convertType is that it is JdbcDataSource
 specific, rather than field-specific.  Isn't the current implementation
 far
 too broad?

 it is feature of JdbcdataSource and no other dataSource offers it. we
 offer it because JDBC drivers have mechanism to do type conversion

 What do you mean by it is too broad?

 I mean the convertType flag is not field-specific (or at least field
 overridable).  Conversions occur on a per-field basis, but the setting is
 for the entire data source and thus all fields.
Yes. it is true.
First of all this is not very widely used, so fine tuning did not make sense

        Erik





-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


How can I make my solr admin Password Protected

2010-02-03 Thread Vijayant Kumar
Hi,

Can any one help me, how can I make my solr adim password protected so
that only authorise person can access it.


-- 

Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211



Re: How can I make my solr admin Password Protected

2010-02-03 Thread Erik Hatcher

There's some basic info for Jetty and Resin here: 
http://wiki.apache.org/solr/SolrSecurity

Keep in mind the various URLs that Solr exposes though, so if you  
aren't protecting /solr completely you'll want to be aware that / 
update can add/update/delete anything, and so on.


Erik


On Feb 3, 2010, at 6:40 AM, Vijayant Kumar wrote:


Hi,

Can any one help me, how can I make my solr adim password protected so
that only authorise person can access it.


--

Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211





Re: Indexing an oracle warehouse table

2010-02-03 Thread Alexey Serba
 What would be the right way to point out which field contains the term 
 searched for.
I would use highlighting for all of these fields and then post process
Solr response in order to check highlighting tags. But I don't have so
many fields usually and don't know if it's possible to configure Solr
to highlight fields using '*' as dynamic fields.

On Wed, Feb 3, 2010 at 2:43 AM, caman aboxfortheotherst...@gmail.com wrote:

 Thanks all. I am on track.
 Another question:
 What would be the right way to point out which field contains the term
 searched for.
 e.g. If I search for SOLR and if the term exist in field788 for a document,
 how do I pinpoint that which field has the term.
 I copied all the fields in field called 'body' which makes searching easier
 but would be nice to show the field which has that exact term.

 thanks

 caman wrote:

 Hello all,

 hope someone can point me to right direction. I am trying to index an
 oracle warehouse table(TableA) with 850 columns. Out of the structure
 about 800 fields are CLOBs and are good candidate to enable full-text
 searching. Also have few columns which has relational link to other
 tables. I am clean on how to create a root entity and then pull data from
 other relational link as child entities.  Most columns in TableA are named
 as field1,field2...field800.
 Now my question is how to organize the schema efficiently:
 First option:
 if my query is 'select * from TableA', Do I  define field name=attr1
 column=FIELD1 / for each of those 800 columns?   Seems cumbersome. May
 be can write a script to generate XML instead of handwriting both in
 data-config.xml and schema.xml.
 OR
 Dont define any field name=attr1 column=FIELD1 / so that column in
 SOLR will be same as in the database table. But questions are 1)How do I
 define unique field in this scenario? 2) How to copy all the text fields
 to a common field for easy searching?

 Any helpful is appreciated. Please feel free to suggest any alternative
 way.

 Thanks







 --
 View this message in context: 
 http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27429352.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: wildcards in stopword list

2010-02-03 Thread Ahmet Arslan
 I am wondering if there is some way to maintain a stopword
 list with widcards:
 
 ignoring anything that starts with foo:
 foo*

A custom TokenFilterFactory derived from StopFilterFactory can remove a token 
if it matches a java.util.regex.Pattern. List of patterns can be loaded from a 
file in a similar fashion to stopwords.txt

 i am doing some funky hackery inside DIH via javascript to
 make my autosuggest work. i basically split phrases and
 store them together with the full phrase:
 
 the phrase:
 Foo Bar
 
 becomes:
 
 Foo Bar
 foo bar
 {foo}Foo_Bar
 {bar}Foo_Bar

What is the benefit of storing {foo}Foo_Bar and {bar}Foo_Bar?
Then how are you querying this to auto-suggest?


  


RE: query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Ok, if anyone needs it:

I tried fieldX:[* TO *]
I think this is correct.

In my case I found out that I was not indexing this field correctly
because they are all empty. :)



-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: quarta-feira, 3 de Fevereiro de 2010 11:34
To: solr-user@lucene.apache.org
Subject: query all filled field?

Hi all,

 

Is it possible to query some field in order to get only not empty
documents?

 

All documents where field x is filled?

 

Thanks,

Frederico

 

 

 

 



Re: wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith

On 03.02.2010, at 13:07, Ahmet Arslan wrote:

 i am doing some funky hackery inside DIH via javascript to
 make my autosuggest work. i basically split phrases and
 store them together with the full phrase:
 
 the phrase:
 Foo Bar
 
 becomes:
 
 Foo Bar
 foo bar
 {foo}Foo_Bar
 {bar}Foo_Bar
 
 What is the benefit of storing {foo}Foo_Bar and {bar}Foo_Bar?
 Then how are you querying this to auto-suggest?


this way i can do a prefix facet search for the term foo or bar and in both 
cases i can show the user Foo Bar with a bit of frontend logic to split off 
the payload aka original data.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: query all filled field?

2010-02-03 Thread Ahmet Arslan

 Is it possible to query some field in order to get only not
 empty
 documents?
 
  
 
 All documents where field x is filled?

Yes. q=x:[* TO *] will bring documents that has non-empty x field.


  


Re: wildcards in stopword list

2010-02-03 Thread Ahmet Arslan
 this way i can do a prefix facet search for the term foo
 or bar and in both cases i can show the user Foo Bar
 with a bit of frontend logic to split off the payload aka
 original data.

So you have a list of phrases (pre-extracted) to be used for auto-suggest? Or 
you are using bi-gram shingles?


  


Re: wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith

On 03.02.2010, at 13:41, Ahmet Arslan wrote:

 this way i can do a prefix facet search for the term foo
 or bar and in both cases i can show the user Foo Bar
 with a bit of frontend logic to split off the payload aka
 original data.
 
 So you have a list of phrases (pre-extracted) to be used for auto-suggest? Or 
 you are using bi-gram shingles?


For the actual search I am using bi-gram shingles for phrase boosting.
However for autosuggest this is not practical.

The issue is that I have multiple fields of data (names, address etc) that 
should all be relevant for the auto suggest. Furthermore a phrase entered can 
either match on one field or any combination of fields. Phrase in this context 
means separated by spaces or dash. For this I found the above approach the only 
feasible solution. 

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





RE: query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Hum, strange.. I reindexed some docs with the field corrected.

Now I'm sure the field is filled because:

fieldX:(*a*) returns docs.

But fieldX:[* TO *] is returning the same as *.* (all results)

I tried with -fieldX:[* TO *] and I get no results at all.

I wonder if someone has tried this before with success?

The field is indexed as string, indexed=true and stored=true.

Thanks,
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: quarta-feira, 3 de Fevereiro de 2010 11:48
To: solr-user@lucene.apache.org
Subject: Re: query all filled field?


 Is it possible to query some field in order to get only not
 empty
 documents?
 
  
 
 All documents where field x is filled?

Yes. q=x:[* TO *] will bring documents that has non-empty x field.


  


Re: wildcards in stopword list

2010-02-03 Thread Ahmet Arslan

 Actually I plan to write a bigger blog post about the
 approach. In order to match the different fields I actually
 have a separate core with an index dedicated to auto suggest
 alone where I merge all fields together via some javascript
 code:
 
 This way I can then use terms for a single word entered and
 a facet prefix search with the last term as the prefix and
 the rest as the query for multi term entries into the auto
 suggest box.
 
 The idea is that I can then enter any part of any of the
 fields, but I will then be suggested the entire phrase in
 that field:
 
 So if I have a field:
 Foo Bar Ding Dong
 
 and I enter ding into the search box, I would get a
 suggestion of Foo Bar Ding Dong

If I am not wrong you have a list of suggestion candidates indexed in a 
separate core dedicated to auto suggest alone.

I think you can use this field type for suggestion.

fieldType name=prefix_token class=solr.TextField positionIncrementGap=1
 analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20 
/ 
  /analyzer
 analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  /analyzer
  /fieldType

With this field type, the query ding or din or di would return Foo Bar 
Ding Dong. 

Also you do not need to index all combinations like:
{foo} Foo Bar Ding Dong
{bar} Foo Bar Ding Dong
{ding}Foo Bar Ding Dong
{dong}Foo Bar Ding Dong


  


Re: wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith

On 03.02.2010, at 14:34, Ahmet Arslan wrote:

 
 Actually I plan to write a bigger blog post about the
 approach. In order to match the different fields I actually
 have a separate core with an index dedicated to auto suggest
 alone where I merge all fields together via some javascript
 code:
 
 This way I can then use terms for a single word entered and
 a facet prefix search with the last term as the prefix and
 the rest as the query for multi term entries into the auto
 suggest box.
 
 The idea is that I can then enter any part of any of the
 fields, but I will then be suggested the entire phrase in
 that field:
 
 So if I have a field:
 Foo Bar Ding Dong
 
 and I enter ding into the search box, I would get a
 suggestion of Foo Bar Ding Dong
 
 If I am not wrong you have a list of suggestion candidates indexed in a 
 separate core dedicated to auto suggest alone.
 
 I think you can use this field type for suggestion.

First up:
I very much appreciate your input!

 fieldType name=prefix_token class=solr.TextField 
 positionIncrementGap=1
 analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20 
 / 
  /analyzer
 analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  /analyzer
  /fieldType
 
 With this field type, the query ding or din or di would return Foo Bar 
 Ding Dong. 


hmm wouldnt it return foo bar ding dong ?
obviously i have to decide how important it is for me to get the original mixed 
case string for auto suggest, but it does matter a bit more over here in Europe 
than in the US for example.

if i would both index the original mixed case and the lower case version and 
remove the solr.LowerCaseFilterFactory in both analyzer sections, then it 
should work however as long as terms usually start with an upper case letter if 
they do contain upper case letters.

let me try this out ..

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: ContentStreamUpdateRequest addFile fails to close Stream

2010-02-03 Thread Mark Miller
Hey Christoph,

Could you give the patch at
https://issues.apache.org/jira/browse/SOLR-1744 a try and let me know
how it works out for you?

-- 
- Mark

http://www.lucidimagination.com





Mark Miller wrote:
 Christoph Brill wrote:
   
 I tried to fix it in CommonsHttpSolrServer but I wasn't sure how to do
 it. I tried to close the stream after the method got executed, but
 somehow getContent() always returned null (see attached patch against
 solr 1.4 for my non-working attempt).

 Who's responsible for closing a stream? CommonsHttpSolrServer? The
 caller? FileStream? I'm unsure because I don't know solrj in depth.

 Regards,
   Chris

 Am 02.02.2010 14:37, schrieb Mark Miller:
   
 
 Broken by design?

 How about we just fix BinaryUpdateRequestHandler (and possibly
 CommonsHttpSolrServer) to close the stream it gets?
 
   
   
 
 That class is a little messy for following - but I'd try just assigning
 the stream to a local stream thats available through the whole method,
 and then at the very bottom finally block, if stream != null, close it.
 I think we also want to close it if the exception that causes a retry
 happens:

 catch( NoHttpResponseException r ) {
   // This is generally safe to retry on
   method.releaseConnection();
   method = null;
   // If out of tries then just rethrow (as normal error).
   if( ( tries  1 ) ) {
 throw r;
   }
   //log.warn( Caught:  + r + . Retrying... );
 }

   





Re: wildcards in stopword list

2010-02-03 Thread Ahmet Arslan
  With this field type, the query ding or din or
 di would return Foo Bar Ding Dong. 
 
 hmm wouldnt it return foo bar ding dong ?

No, it will return original string. In this method you are not using faceting 
anymore. You are just querying and requesting a field.  

   q=suggest_field:difl=suggest_field


  


Re: wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith

On 03.02.2010, at 15:19, Ahmet Arslan wrote:

 With this field type, the query ding or din or
 di would return Foo Bar Ding Dong. 
 
 hmm wouldnt it return foo bar ding dong ?
 
 No, it will return original string. In this method you are not using faceting 
 anymore. You are just querying and requesting a field.  
 
   q=suggest_field:difl=suggest_field

Yeah, I just realized that while I was trying it out. :-)
Still testing ..

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: how to stress test solr

2010-02-03 Thread Marc Sturlese

I like to use JMeter with a large queries file. This way you can measure
response times with lots of requests at the same time. Having JConsole
opened at the same time you can check the memory status

James liu-2 wrote:
 
 before stressing test, Should i close SolrCache?
 
 which tool u use?
 
 How to do stress test correctly?
 
 Any pointers?
 
 -- 
 regards
 j.L ( I live in Shanghai, China)
 
 

-- 
View this message in context: 
http://old.nabble.com/how-to-stress-test-solr-tp27433733p27437524.html
Sent from the Solr - User mailing list archive at Nabble.com.



Any idea what could be wrong with this fq value?

2010-02-03 Thread javaxmlsoapdev

Following is my solr URL. 

http://hostname:port/solr/entities/select/?version=2.2start=0indent=onqt=dismaxrows=60fq=statusName:(Open
OR Cancelled)debugQuery=trueq=devfq=groupName:Infrastructure“ 

“groupName” is one of the attributes I create fq (filterQuery) on. This
field(groupName) is being indexed and stored. 

When I search for anything else other than “Infrastructure” in fq groupName
Solr brings me back correct results. When I pass in “Infrastructure” in the
fq=groupName:Infrastructure“ it never brings me anything back. If I remove
“fq” completely it will bring me all results including records with
groupName:Infrastructure“. Something is wrong only with this
“Infrastructure” value in the fq. 

Any idea what wrong could be happening. Clearly this is only related to
value Infrastructure“ in the filter query.

Thanks,

-- 
View this message in context: 
http://old.nabble.com/Any-idea-what-could-be-wrong-with-this-fq-value--tp27437723p27437723.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Any idea what could be wrong with this fq value?

2010-02-03 Thread Erik Hatcher
is groupName a string field?  If not, it probably should be.  My  
hunch is that you're analyzing that field and it is lowercased in the  
index, and maybe even stemmed.


Try q=*:*facet=onfacet.field=groupName to see all the *indexed*  
values of the groupName field.


Erik

On Feb 3, 2010, at 10:05 AM, javaxmlsoapdev wrote:



Following is my solr URL.

http://hostname:port/solr/entities/select/? 
version=2.2start=0indent=onqt=dismaxrows=60fq=statusName:(Open

OR Cancelled)debugQuery=trueq=devfq=groupName:Infrastructure“

“groupName” is one of the attributes I create fq (filterQuery) on.  
This

field(groupName) is being indexed and stored.

When I search for anything else other than “Infrastructure” in fq  
groupName
Solr brings me back correct results. When I pass in “Infrastructure”  
in the
fq=groupName:Infrastructure“ it never brings me anything back. If I  
remove

“fq” completely it will bring me all results including records with
groupName:Infrastructure“. Something is wrong only with this
“Infrastructure” value in the fq.

Any idea what wrong could be happening. Clearly this is only related  
to

value Infrastructure“ in the filter query.

Thanks,

--
View this message in context: 
http://old.nabble.com/Any-idea-what-could-be-wrong-with-this-fq-value--tp27437723p27437723.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: wildcards in stopword list

2010-02-03 Thread Lukas Kahwe Smith

On 03.02.2010, at 14:34, Ahmet Arslan wrote:

 
 Actually I plan to write a bigger blog post about the
 approach. In order to match the different fields I actually
 have a separate core with an index dedicated to auto suggest
 alone where I merge all fields together via some javascript
 code:
 
 This way I can then use terms for a single word entered and
 a facet prefix search with the last term as the prefix and
 the rest as the query for multi term entries into the auto
 suggest box.
 
 The idea is that I can then enter any part of any of the
 fields, but I will then be suggested the entire phrase in
 that field:
 
 So if I have a field:
 Foo Bar Ding Dong
 
 and I enter ding into the search box, I would get a
 suggestion of Foo Bar Ding Dong
 
 If I am not wrong you have a list of suggestion candidates indexed in a 
 separate core dedicated to auto suggest alone.
 
 I think you can use this field type for suggestion.
 
 fieldType name=prefix_token class=solr.TextField 
 positionIncrementGap=1
 analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=20 
 / 
  /analyzer
 analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory / 
  filter class=solr.LowerCaseFilterFactory / 
  /analyzer
  /fieldType


hmm .. not sure yet if i like this approach better.
it seems i cannot use dismax here, at least its not finding matches, which 
means i need to parse the query to prevent people from going crazy stuff.
also in the old approach i was suggesting single words first and then slowly 
help people towards a full phrase. with this approach i immediately end up with 
full phrases, which severely limits the usefulness.

at the same time i am not sure if the index will really be significantly 
smaller with this approach than with my hack. and since there can also be 
matches inside words that have no real meaning i am also not sure if this 
really gets me better quality on this level either.

will play around with this some more tough.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: Any idea what could be wrong with this fq value?

2010-02-03 Thread javaxmlsoapdev

thanks Erik for the pointer. I had this field as text and after changing it
to string it started working as expected. 

I am still not sure why this particular value(Infrastructure) was failing
to bring back results. other values like Network, Information etc worked
fine when field was of type text as well.

I tried(when groupName was of type text) 
q=*:*facet=onfacet.field=groupName and it brought back 
Infrascture correctly.

Can you explain internally how solr indexed this attribute differently and
changing to string from text started working?

Thanks,

Erik Hatcher-4 wrote:
 
 is groupName a string field?  If not, it probably should be.  My  
 hunch is that you're analyzing that field and it is lowercased in the  
 index, and maybe even stemmed.
 
 Try q=*:*facet=onfacet.field=groupName to see all the *indexed*  
 values of the groupName field.
 
   Erik
 
 On Feb 3, 2010, at 10:05 AM, javaxmlsoapdev wrote:
 

 Following is my solr URL.

 http://hostname:port/solr/entities/select/? 
 version=2.2start=0indent=onqt=dismaxrows=60fq=statusName:(Open
 OR Cancelled)debugQuery=trueq=devfq=groupName:Infrastructure“

 “groupName” is one of the attributes I create fq (filterQuery) on.  
 This
 field(groupName) is being indexed and stored.

 When I search for anything else other than “Infrastructure” in fq  
 groupName
 Solr brings me back correct results. When I pass in “Infrastructure”  
 in the
 fq=groupName:Infrastructure“ it never brings me anything back. If I  
 remove
 “fq” completely it will bring me all results including records with
 groupName:Infrastructure“. Something is wrong only with this
 “Infrastructure” value in the fq.

 Any idea what wrong could be happening. Clearly this is only related  
 to
 value Infrastructure“ in the filter query.

 Thanks,

 -- 
 View this message in context:
 http://old.nabble.com/Any-idea-what-could-be-wrong-with-this-fq-value--tp27437723p27437723.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 

-- 
View this message in context: 
http://old.nabble.com/Any-idea-what-could-be-wrong-with-this-fq-value--tp27437723p27439279.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing an oracle warehouse table

2010-02-03 Thread caman

Thanks. I will give this a shot.

Alexey-34 wrote:
 
 What would be the right way to point out which field contains the term
 searched for.
 I would use highlighting for all of these fields and then post process
 Solr response in order to check highlighting tags. But I don't have so
 many fields usually and don't know if it's possible to configure Solr
 to highlight fields using '*' as dynamic fields.
 
 On Wed, Feb 3, 2010 at 2:43 AM, caman aboxfortheotherst...@gmail.com
 wrote:

 Thanks all. I am on track.
 Another question:
 What would be the right way to point out which field contains the term
 searched for.
 e.g. If I search for SOLR and if the term exist in field788 for a
 document,
 how do I pinpoint that which field has the term.
 I copied all the fields in field called 'body' which makes searching
 easier
 but would be nice to show the field which has that exact term.

 thanks

 caman wrote:

 Hello all,

 hope someone can point me to right direction. I am trying to index an
 oracle warehouse table(TableA) with 850 columns. Out of the structure
 about 800 fields are CLOBs and are good candidate to enable full-text
 searching. Also have few columns which has relational link to other
 tables. I am clean on how to create a root entity and then pull data
 from
 other relational link as child entities.  Most columns in TableA are
 named
 as field1,field2...field800.
 Now my question is how to organize the schema efficiently:
 First option:
 if my query is 'select * from TableA', Do I  define field name=attr1
 column=FIELD1 / for each of those 800 columns?   Seems cumbersome.
 May
 be can write a script to generate XML instead of handwriting both in
 data-config.xml and schema.xml.
 OR
 Dont define any field name=attr1 column=FIELD1 / so that column in
 SOLR will be same as in the database table. But questions are 1)How do I
 define unique field in this scenario? 2) How to copy all the text fields
 to a common field for easy searching?

 Any helpful is appreciated. Please feel free to suggest any alternative
 way.

 Thanks







 --
 View this message in context:
 http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27429352.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27439611.html
Sent from the Solr - User mailing list archive at Nabble.com.



SOLR Performance Tuning: Fuzzy Search

2010-02-03 Thread Fuad Efendi
I was lucky to contribute an excellent solution: 
http://issues.apache.org/jira/browse/LUCENE-2230

Even 2nd edition of Lucene in Action advocates to use fuzzy search only in
exceptional cases.


Another solution would be 2-step indexing (it may work for many use cases),
but it is not spellchecker

1. Create a regular index
2. Create a dictionary of terms
3. For each term, find nearest terms (for instance, stick with distance=2)
4. Use copyField in SOLR, or smth similar to synonym dictionary; or, for
instance, generate specific Query Parser...
5. Of course, custom request handler
and etc.

It may work well (but only if query contains term from dictionary; it can't
work as a spellchecker)

Combination 2 algos can boost performance extremely...


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search






Re: distributed search and failed core

2010-02-03 Thread Ian Connor
My only suggestion is to put haproxy in front of two replicas and then have
haproxy do the failover. If a shard fails, the whole search will fail unless
you do something like this.

On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.comwrote:

 hello *, in distributed search when a shard goes down, an error is
 returned and the search fails, is there a way to avoid the error and
 return the results from the shards that are still up?

 thx much

 --joe




-- 
Regards,

Ian Connor


Re: Solr response extremely slow

2010-02-03 Thread Rajat Garg

Here you go -

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
11:06:47
Lucene Specification Version: 2.4-dev
Lucene Implementation Version: 2.4-dev 691741 - 2008-09-03 15:25:16


-- 
View this message in context: 
http://old.nabble.com/Solr-response-extremely-slow-tp27432229p27441205.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: distributed search and failed core

2010-02-03 Thread Yonik Seeley
On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.com wrote:
 hello *, in distributed search when a shard goes down, an error is
 returned and the search fails, is there a way to avoid the error and
 return the results from the shards that are still up?

The SolrCloud branch has load-balancing capabilities for distributed
search amongst shard replicas.
http://wiki.apache.org/solr/SolrCloud

-Yonik
http://www.lucidimagination.com


Re: distributed search and failed core

2010-02-03 Thread Joe Calderon
thx guys, i ended up using a mix of code from the solr-1143 and
solr-1537 patches, now whenever there is an exception theres is a
section in the results indicating the result is partial and also lists
the failed core(s), weve added some monitoring to check for that
output as well to alert us when a shard has failed

On Wed, Feb 3, 2010 at 10:55 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.com wrote:
 hello *, in distributed search when a shard goes down, an error is
 returned and the search fails, is there a way to avoid the error and
 return the results from the shards that are still up?

 The SolrCloud branch has load-balancing capabilities for distributed
 search amongst shard replicas.
 http://wiki.apache.org/solr/SolrCloud

 -Yonik
 http://www.lucidimagination.com



autosuggest via solr.EdgeNGramFilterFactory (was: Re: wildcards in stopword list)

2010-02-03 Thread Lukas Kahwe Smith
Hi Ahmet,

Well after some more testing I am now convinced that you rock :)
I like the solution because its obviously way less hacky and more importantly I 
expect this to be a lot faster and less memory intensive, since instead of a 
facet prefix or terms search, I am doing an equality comparison on tokens 
(albeit a fair number of them, but each much smaller). I can also have more 
control over the ordering of the results. I can also make full use of the 
stopword filter, which again should improve the sort order (like if I have a 
stopword ag and a word starts with ag it will not be overpowered by tons 
of strings containing ag as a single word). Obviously there is one limitation 
if people enter search terms longer than 20, but I think I can safely ignore 
this case. Even with long german words 15 letters should be enough to find what 
the user is looking for. and if a word needs more characters, then its probably 
a meaningless post fix like versicherungsgesellschaft which just means 
insurance agency and the user is just being stupid.

I do loose the nice numbers telling the user how often a given term matched, 
which has some merit for street/city names, less so for the names of people and 
close to none for company names. There is also a minor niggle with how the data 
is returned which I discuss at the end of the email.

I am using the following in my schema.xml

fieldType name=prefix_token class=solr.TextField 
positionIncrementGap=1
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=20 /
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
  /analyzer
/fieldType

   field name=name type=prefix_token indexed=true stored=true /
   field name=firstname type=prefix_token indexed=true stored=true /
   field name=email type=prefix_token indexed=true stored=true /
   field name=city type=prefix_token indexed=true stored=true /
   field name=street type=prefix_token indexed=true stored=true /
   field name=telefon type=prefix_token indexed=true stored=true /
   field name=id type=string indexed=true stored=true required=true 
/

and finally the following in my solrconfig.xml

  requestHandler name=auto class=solr.SearchHandler default=true
lst name=defaults
 str name=defTypedismax/str
 str name=echoParamsexplicit/str
 int name=rows10/int
 str name=qfname firstname email^0.5 telefon^0.5 city^0.6 
street^0.6/str
 str name=flname,firstname,telefon,email,city,street/str
/lst
  /requestHandler

This all works well. There is just one minor uglyness, which might still be 
solveable inside solr, but I fixed it in the php frontend logic. The issue is 
that I obviously get all the fields for each document returned and I need to 
figure out for which I actually had a match to be presented in the autosuggest. 
Is there some Solr magic that will do this work for me?

$query = new SolrQuery($searchstring);
$response = $this-solrClientAuto-query($query);
$numFound = empty($response-response-numFound) ? 0 : 
$response-response-numFound;
$data = array('results' = array(), 'numFound' = $numFound);

if (!empty($response-response-docs)) {
$p = str_replace('', '', substr($searchstring, 
strpos($searchstring, ' ')));
foreach ($response-response-docs as $doc) {
foreach ((array)$doc as $value) {
if (stripos($value, $p) === 0 || stripos($value, ' '.$p)) {
$data['results'][$value] = 1;
}
}
}
}

Then again I have to review with the UI guys if we will always just show the 
name anyways and replace the entire user entered term with the name which 
should be sufficiently unique in most cases to get a small enough result set.

regards,
Lukas

The Riddle of the Underscore and the Dollar Sign

2010-02-03 Thread Christopher Ball
I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters
with regard to Underscores.

 

1) I am trying to get rid of them when shingling, but seem unable to do so
with a Stopwords Filter.

 

And yet they are being removed when I am not even trying to by the
WordDelimiter Filter.

 

2) Conversely, I would like to retain '$' symbols when they adjacent to
numbers, but seem unable to without having to accept all forms of other
syntax. 

 

My simple example configuration and test data and results are below.

 

Most grateful for any guidance,

 

Christopher

 

 

Test Data:

 

doc

field name=idStopWordTestData/field
field name=conSubSec-text_dcPreShingled ThisIsNotAStopWord
ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma,
beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000
$3m - # -#- --#-- Yes X No _ __ ___ a and also about/field

/doc

 


 


Field 1 - Delimited_text:


Index Analyzer: org.apache.solr.analysis.TokenizerChain

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters: 

1.   org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 }


org.apache.solr.analysis.LowerCaseFilterFactory args:{}


 


Field 1 - Resulting Index Terms:


 



Term


#



100


2



1000


2



200


2



3


2



300


2



3m


2



a


2



about


2



also


2



and


2



beforeacollan


2



beforeacomma


2



beforeaperiod


2



dont


2



m


2



no


2



peter


2



preshingled


2



s


2



thisisalsoastopword


2



thisisastopword


2



thisisnotastopword


2



underscore


2



x


2



yes


2



1


2


Field2 - Shingled_Text:


Index Analyzer: org.apache.solr.analysis.TokenizerChain 

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

2.  1. org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0
catenateNumbers: 1 }

3.  2. org.apache.solr.analysis.StopFilterFactory args:{words:
StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true }

4.  3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}

5.  4. org.apache.solr.analysis.ShingleFilterFactory
args:{outputUnigrams: false maxShingleSize: 5 }


 


File: StopWords-PreShingled.txt


s


_


PreShingled


__


ThisIsAStopWord


ThisIsAlsoAStopWord


 


Field2 - Resulting Index Terms (Sample):


 



Term


#



_ 100


1



_ 100 1 1000


1



_ _


1



_ _ beforeaperiod beforeacomma


1



_ beforeaperiod


1



_ beforeaperiod beforeacomma beforeacollan


1



_ thisisnotastopword


1



_ thisisnotastopword _ _


1




 

 

 



Re: Search wihthout diacritics

2010-02-03 Thread Grant Ingersoll

On Feb 2, 2010, at 8:53 PM, Olala wrote:

 
 Hi all!
 
 I have problem with Solr, and I hope everyboby in there can help me :)
 
 I want to search text without diacritic but Solr will response diacritic
 text and without diacritic text.
 
 For example, I query solr index, it will response solr index, sôlr
 index, sòlr index, sólr indèx,...
 
 I was tried ASCIIFoldingFilter and ISOLatin1AccentFilterFactory but it is
 not correct :(

What's not correct?  Can you provide more detail?  Is it not indexed correctly? 
 You might look at the Analysis tool under the Solr admin area to see how it is 
processing your content during indexing and searching.

 
 My schema config:
 
 fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/   
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
 filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
  /analyzer
/fieldType

You probably should strip diacritics during query time, too.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Guidance on Solr errors

2010-02-03 Thread Grant Ingersoll
Inline below.

On Feb 2, 2010, at 8:40 PM, Vauthrin, Laurent wrote:

 Hello,
 
 
 
 I'm trying to troubleshoot a problem that occurred on a few Solr slave
 Tomcat instances and wanted to run it by the list to see if I'm on the
 right track.
 
 
 
 The setup involves 1 master replicating to three slaves (I don't know
 what the replication interval is at this time).  These instances have
 been running fine for a while (from what I understand) but ran into
 problems just today during peak site usage.
 
 
 
 The following two exceptions were observed (partially stripped stack
 traces):
 
 
 
 WARNING: [] Error opening new searcher. exceeded limit of
 maxWarmingSearchers=2, try again later.
 
 Feb 1, 2010 10:00:31 AM org.apache.solr.common.SolrException log
 
 SEVERE: org.apache.solr.common.SolrException: Error opening new
 searcher. exceeded limit of maxWarmingSearchers=2, try again later.
 
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:941)
 
at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.
 java:368)
 
at
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpd
 ateProcessorFactory.java:77)
 
 
 
 Feb 1, 2010 10:29:36 AM org.apache.solr.common.SolrException log
 
 SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
 
at
 org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:734)
 
at
 org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.termDocs(MultiS
 egmentReader.java:612)
 
at
 org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.termDocs(MultiS
 egmentReader.java:605)
 
at
 org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.read(MultiSegme
 ntReader.java:570)
 
at org.apache.lucene.search.TermScorer.next(TermScorer.java:106)
 
at
 org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(Disjunc
 tionSumScorer.java:105)
 
at
 org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.
 java:144)
 
at
 org.apache.lucene.search.BooleanScorer2.next(BooleanScorer2.java:352)
 
at
 org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(Disjunc
 tionSumScorer.java:105)
 
at
 org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.
 java:144)
 
at
 org.apache.lucene.search.BooleanScorer2.next(BooleanScorer2.java:352)
 
at
 org.apache.lucene.search.ConjunctionScorer.init(ConjunctionScorer.java:8
 0)
 
at
 org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:4
 8)
 
at
 org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:319)
 
at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:137)
 
at org.apache.lucene.search.Searcher.search(Searcher.java:126)
 
at org.apache.lucene.search.Searcher.search(Searcher.java:105)
 
at
 org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.
 java:920)
 
at
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.j
 ava:838)
 
at
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:2
 69)
 
 
 
 Here's the config for the caches:
 
 
 
 filterCache: size=15000 initialSize=5000 autowarmCount=5000
 
 queryResultCache: size=15000 initialSize=5000 autowarmCount=15000
 
 documentCache: size=15000 initialSize=5000
 
 
 
 From what I understand, the first exception indicates that multiple
 replications are being processed at the same time.  Is that correct or
 could it be something else?

You are probably committing/replicating faster than Solr can open up the new 
index and warm the new searcher.  

 
 Does the second exception indicate that Solr is having problems handling
 the query load (possibly due to a commit happening at the same time)?

This is likely caused by the first problem b/c you are running out of memory

 
 
 
 Does anyone have any insight that might help here?  I sort of suspect
 that the autowarm counts are too large but I may be off there.  I can
 provide more details (as I get them) about this if needed.

You probably should start smaller, yes.  Bigger is not always better when it 
comes to caches, especially when GC is factored in.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Slow QueryComponent.process() when queries have numbers in them

2010-02-03 Thread Simon Wistow
According to my logs 

org.apache.solr.handler.component.QueryComponent.process()

takes a significant amount of time (5s but I've seen up to 15s) when a 
query has an odd pattern of numbers in e.g

neodymium megagauss-oersteds (MGOe) (1 MG·Oe = 7,958·10³ T·A/m = 7,958 
kJ/m³

myers 8e psychology chapter 9

JOHN PIPER 1 TIMOTEO 3:1?

lab 2.6.2: using wireshark to view protocol data units

malha de aço 3x3 6mm - peso m2

or even looks like it could be a query

An experiment has two outcomes, A and A. If A is three time as likely to occur 
as , what is P(A)?

other params were

fl:
*,score
fq:
+num_pages:[2 TO *] AND +language:1
hl:
true
hl.fl:
content title description
hl.simple.post:
/strong
hl.simple.pre:
strong
hl.snippets:
2
qf:
title^1.5 content^0.8
qs:
2
qt:
dismax
rows:
10
sort:
score desc
start:
0
wt:
json 


is this just something I'm going to have to put up with? Or is there 
something I can do to mitigate it. If it's a bug any suggestions on how 
to start patching it?




need help with feb 3/2010 trunk and solr-236

2010-02-03 Thread gdeconto

I got latest trunk (feb3/2010) using svn and applied solr-236.

did an ant clean and it seems to build fine with no errors or warnings.

however, when I start solr I get an error (here is a snippet):

SEVERE: org.apache.solr.common.SolrException: Error loading class
'org.apache.so
lr.handler.component.CollapseComponent'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:422)
at
org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:444)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1499)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1493)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1526)
at
org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:824)
...

I can see CollapseComponent.class in apache-solr-1.5-dev.war inside
apache-solr-core-1.5-dev.jar\org\apache\solr\handler\component, so it seems
to be finding and building the java file ok.

any ideas?
-- 
View this message in context: 
http://old.nabble.com/need-help-with-feb-3-2010-trunk-and-solr-236-tp27446001p27446001.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: need help with feb 3/2010 trunk and solr-236

2010-02-03 Thread Koji Sekiguchi

gdeconto wrote:

I got latest trunk (feb3/2010) using svn and applied solr-236.

  

I tried the latest patch with Solr trunk yesterday, no problems there.


did an ant clean and it seems to build fine with no errors or warnings.

  

Did you ant example? I think ant clean will delete everything...

Koji

--
http://www.rondhuit.com/en/



Using solr to store data

2010-02-03 Thread AJ Asver
Hi all,

I work on search at Scoopler.com, a real-time search engine which uses Solr.
 We current use solr for indexing but then fetch data from our couchdb
cluster using the IDs solr returns.  We are now considering storing a larger
portion of data in Solr's index itself so we don't have to hit the DB too.
 Assuming that we are still storing data on the db (for backend and back up
purposes) are there any significant disadvantages to using solr as a data
store too?

We currently run a master-slave setup on EC2 using x-large slave instances
to allow for the disk cache to use as much memory as possible.  I imagine we
would definitely have to add more slave instances to accomodate the extra
data we're storing (and make sure it stays in memory).

Any tips would be really helpful.
--
AJ Asver
Co-founder, Scoopler.com

+44 (0) 7834 609830 / +1 (415) 670 9152
a...@scoopler.com


Follow me on Twitter: http://www.twitter.com/_aj
Add me on Linkedin: http://www.linkedin.com/in/ajasver
or YouNoodle: http://younoodle.com/people/ajmal_asver

My Blog: http://ajasver.com


Re: Solr response extremely slow

2010-02-03 Thread Erik Hatcher


On Feb 3, 2010, at 1:38 PM, Rajat Garg wrote:

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll -  
2008-09-12

11:06:47


There's the problem right there... that grantingersoll guy :)

(kidding)


Sounds like you're just hitting cache warming which can take a while.

Have you tried Solr 1.4?  Faceting performance, for example, is  
dramatically improved, among many other improvements.


Erik



Re: Solr usage with Auctions/Classifieds?

2010-02-03 Thread Lance Norskog
This field type allows you to have an external file that gives a float
value for a field. You can only use functions on it.

On Sat, Jan 30, 2010 at 7:05 AM, Jan Høydahl / Cominvent
jan@cominvent.com wrote:
 A follow-up on the auction use case.

 How do you handle the need for frequent updates of only one field, such as 
 the last bid field (needed for sort on price, facets or range)?
 For high traffic sites, the document update rate becomes very high if you 
 re-send the whole document every time the bid price changes.

 --
 Jan Høydahl  - search architect
 Cominvent AS - www.cominvent.com

 On 10. des. 2009, at 19.52, Grant Ingersoll wrote:


 On Dec 8, 2009, at 6:37 PM, regany wrote:


 hello!

 just wondering if anyone is using Solr as their search for an auction /
 classified site, and if so how have you managed your setup in general? ie.
 searching against listings that may have expired etc.


 I know several companies using Solr for classifieds/auctions.  Some remove 
 the old listings while others leave them in and filter them or even allow 
 users to see old stuff (but often for reasons other than users finding them, 
 i.e. SEO).  For those that remove, it's typically a batch operation that 
 takes place at night.

 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
 Solr/Lucene:
 http://www.lucidimagination.com/search






-- 
Lance Norskog
goks...@gmail.com


Solr 1.4: Full import FileNotFoundException

2010-02-03 Thread ranjitr

Hello,

I have noticed that when I run concurrent full-imports using DIH in Solr
1.4, the index ends up getting corrupted. I see the following in the log
files (a snippet):


record
  date2010-02-03T17:54:24/date
  millis1265248464553/millis
  sequence764/sequence
  loggerorg.apache.solr.handler.dataimport.SolrWriter/logger
  levelSEVERE/level
  classorg.apache.solr.handler.dataimport.SolrWriter/class
  methodcommit/method
  thread25/thread
  messageException while solr commit./message
  exception
messagejava.io.FileNotFoundException:
/solrserver/apache-solr-1.3.0/exampl
e/multicore/RET/data/index/_5.cfs (No such file or directory)/message
frame
  classjava.io.RandomAccessFile/class
  methodopen/method
/frame
frame
  classjava.io.RandomAccessFile/class
  methodlt;initgt;/method
  line212/line
/frame
frame
 
classorg.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor/class
  methodlt;initgt;/method
  line552/line
/frame
frame
  classorg.apache.lucene.store.FSDirectory$FSIndexInput/class
  methodlt;initgt;/method
  line582/line
/frame
frame
  classorg.apache.lucene.store.FSDirectory/class
  methodopenInput/method
  line488/line
/frame
frame
  classorg.apache.lucene.index.CompoundFileReader/class
  methodlt;initgt;/method
  line70/line
/frame
frame
  classorg.apache.lucene.index.SegmentReader/class
  methodinitialize/method
  line319/line
/frame
frame
  classorg.apache.lucene.index.SegmentReader/class
  methodget/method
  line304/line
/frame
frame
  classorg.apache.lucene.index.SegmentReader/class
  methodget/method
  line234/line
/frame
frame
  classorg.apache.solr.handler.dataimport.DataImporter$1/class
  methodrun/method
  line377/line
/frame
  /exception
/record


Could this be because the concurrent full-imports are stepping on each
other's toes? It seems like one full-import request ends up deleting
another's segment files.

Is there a way to avoid this? Perhaps a config option? I would like to
retain the flexibility to issue concurrent full-import requests.

I found some documentation on this issue at:
http://old.nabble.com/FileNotFoundException-on-index-td25717530.html

But I looked at:
http://old.nabble.com/dataimporthandler-and-multiple-delta-import-td19160129.html

and was under the impression that this issue was fixed in Solr 1.4.

Kindly advise.

Ranjit.
-- 
View this message in context: 
http://old.nabble.com/Solr-1.4%3A-Full-import-FileNotFoundException-tp27446982p27446982.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: weird behabiour when setting negative boost with bq using dismax

2010-02-03 Thread Lance Norskog
In the standard query parser, this means remove all entries in which
field_a = 54.

 bq=-field_a:54^1

Generally speaking, by convention boosts in Lucene have unity at 1.0,
not 0.0. So, a negative boost is usually done with boosts between 0
and 1. For this case, maybe a boost of 0.1 is what you want?

On Mon, Feb 1, 2010 at 8:04 AM, Marc Sturlese marc.sturl...@gmail.com wrote:

 I already asked about this long ago but the answer doesn't seem to work...
 I am trying to set a negative query boost to send the results that match
 field_a: 54 to a lower position. I have tried it in 2 different ways:

 bq=(*:* -field_a:54^1)
 bq=-field_a:54^1

 None of them seem to work. What seems to happen is that results that match
 field_a:54 are excluded. Just like doing:

 fq=-field_a:54

 Any idea what could be happening? Has anyone experienced this behaviour
 before?
 Thnaks in advance
 --
 View this message in context: 
 http://old.nabble.com/weird-behabiour-when-setting-negative-boost-with-bq-using-dismax-tp27406614p27406614.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: ClassCastException setting date.formats in ExtractingRequestHandler

2010-02-03 Thread Lance Norskog
Please file a bug for this in the JIRA:

http://issues.apache.org/jira/browse/SOLR

Please add all details.

Thanks!

-- Forwarded message --
From: Christoph Brill christoph.br...@chamaeleon.de
Date: Tue, Feb 2, 2010 at 4:11 AM
Subject: ClassCastException setting date.formats in ExtractingRequestHandler
To: solr-user@lucene.apache.org


Hi list,

I tried to add the following to my solrconfig.xml (to the
'requestHandler name=/update/extract ...' block)

 lst name=date.formats
   str-MM-dd/str
 /lst

which is described on the wiki page of the ExtractingRequestHandler[1].
After doing so I always get a ClassCastException once the lazy init of
the handler is happening. This is a stock solr 1.4 with no
modifications. The exception is like this:

org.apache.solr.common.util.NamedList$1$1 cannot be cast to java.lang.String

Is this a known bug? Or a I doing something wrong?

Thanks in advance,
 Chris

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Configuration



-- 
Lance Norskog
goks...@gmail.com


Re: Search wihthout diacritics

2010-02-03 Thread Lance Norskog
You need to add AsciiFoldingFilter to the query path as well as the
indexing path.

The solr/admin/analysis.jsp page lets you explore how these analysis
stacks work.

On Tue, Feb 2, 2010 at 5:53 PM, Olala hthie...@gmail.com wrote:

 Hi all!

 I have problem with Solr, and I hope everyboby in there can help me :)

 I want to search text without diacritic but Solr will response diacritic
 text and without diacritic text.

 For example, I query solr index, it will response solr index, sôlr
 index, sòlr index, sólr indèx,...

 I was tried ASCIIFoldingFilter and ISOLatin1AccentFilterFactory but it is
 not correct :(

 My schema config:

 fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
         filter class=solr.ASCIIFoldingFilterFactory/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
      /analyzer
    /fieldType


 --
 View this message in context: 
 http://old.nabble.com/Search-wihthout-diacritics-tp27430345p27430345.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: Using solr to store data

2010-02-03 Thread Lance Norskog
If you're happy with disk sizes and indexingsearch performance, there
are still holes:

Documents update instead of fields, so when you have a million
documents that say German and should say French, you have to
reindex a million documents.

There are no tools for managing distributed indexes, so you're on your own.

Distributed TF/IDF is coming, but will never be perfect. So managing
your own distributed relevance strategies is a must.

On Wed, Feb 3, 2010 at 5:41 PM, AJ Asver a...@scoopler.com wrote:
 Hi all,

 I work on search at Scoopler.com, a real-time search engine which uses Solr.
  We current use solr for indexing but then fetch data from our couchdb
 cluster using the IDs solr returns.  We are now considering storing a larger
 portion of data in Solr's index itself so we don't have to hit the DB too.
  Assuming that we are still storing data on the db (for backend and back up
 purposes) are there any significant disadvantages to using solr as a data
 store too?

 We currently run a master-slave setup on EC2 using x-large slave instances
 to allow for the disk cache to use as much memory as possible.  I imagine we
 would definitely have to add more slave instances to accomodate the extra
 data we're storing (and make sure it stays in memory).

 Any tips would be really helpful.
 --
 AJ Asver
 Co-founder, Scoopler.com

 +44 (0) 7834 609830 / +1 (415) 670 9152
 a...@scoopler.com


 Follow me on Twitter: http://www.twitter.com/_aj
 Add me on Linkedin: http://www.linkedin.com/in/ajasver
 or YouNoodle: http://younoodle.com/people/ajmal_asver

 My Blog: http://ajasver.com




-- 
Lance Norskog
goks...@gmail.com


Re: Solr response extremely slow

2010-02-03 Thread Lance Norskog
Is it possible that the virtual machine does not give clean system
millisecond numbers?

On Wed, Feb 3, 2010 at 5:43 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 On Feb 3, 2010, at 1:38 PM, Rajat Garg wrote:

 Solr Specification Version: 1.3.0
 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
 11:06:47

 There's the problem right there... that grantingersoll guy :)

 (kidding)


 Sounds like you're just hitting cache warming which can take a while.

 Have you tried Solr 1.4?  Faceting performance, for example, is dramatically
 improved, among many other improvements.

        Erik





-- 
Lance Norskog
goks...@gmail.com


Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-03 Thread Lance Norskog
I just tested this with a DIH that does not use database input.

If the DataImportHandler JDBC code does not support a schema that has
optional fields, that is a major weakness. Noble/Shalin, is this true?

On Tue, Feb 2, 2010 at 8:50 AM, Sascha Szott sz...@zib.de wrote:
 Hi,

 since some of the fields used in your DIH configuration aren't mandatory
 (e.g., keywords and tags are defined as nullable in your db table schema),
 add a default value to all optional fields in your schema configuration
 (e.g., default = ). Note, that Solr does not understand the db-related
 concept of null values.

 Solr's log output

 SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
 amp; Gabbana Damp;G Neckties designer Tie for men 543},
 productID=productID(1.0)={220213}}]

 indicates that there aren't any tags or descriptions stored for the item
 with productId 220213. Since no default value is specified, Solr raises an
 error when creating the index document.

 -Sascha

 Jean-Michel Philippon-Nadeau wrote:

 Hi,

 Thanks for the reply.

 On Tue, 2010-02-02 at 16:57 +0100, Sascha Szott wrote:

 * the output of MySQL's describe command for all tables/views referenced
 in your DIH configuration

 mysql  describe products;

 ++--+--+-+-++
 | Field          | Type             | Null | Key | Default | Extra
 |

 ++--+--+-+-++
 | productID      | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | skuCode        | varchar(320)     | YES  | MUL | NULL    |
 |
 | upcCode        | varchar(320)     | YES  | MUL | NULL    |
 |
 | name           | varchar(320)     | NO   |     | NULL    |
 |
 | description    | text             | NO   |     | NULL    |
 |
 | keywords       | text             | YES  |     | NULL    |
 |
 | disqusThreadID | varchar(50)      | NO   |     | NULL    |
 |
 | tags           | text             | YES  |     | NULL    |
 |
 | createdOn      | int(10) unsigned | NO   |     | NULL    |
 |
 | lastUpdated    | int(10) unsigned | NO   |     | NULL    |
 |
 | imageURL       | varchar(320)     | YES  |     | NULL    |
 |
 | inStock        | tinyint(1)       | YES  | MUL | 1       |
 |
 | active         | tinyint(1)       | YES  |     | 1       |
 |

 ++--+--+-+-++
 13 rows in set (0.00 sec)

 mysql  describe product_soldby_vendor;
 +-+--+--+-+-+---+
 | Field           | Type             | Null | Key | Default | Extra |
 +-+--+--+-+-+---+
 | productID       | int(10) unsigned | NO   | MUL | NULL    |       |
 | productVendorID | int(10) unsigned | NO   | MUL | NULL    |       |
 | price           | double           | NO   |     | NULL    |       |
 | currency        | varchar(5)       | NO   |     | NULL    |       |
 | buyURL          | varchar(320)     | NO   |     | NULL    |       |
 +-+--+--+-+-+---+
 5 rows in set (0.00 sec)

 mysql  describe products_vendors_subcategories;

 ++--+--+-+-++
 | Field                      | Type             | Null | Key | Default |
 Extra          |

 ++--+--+-+-++
 | productVendorSubcategoryID | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | productVendorCategoryID    | int(10) unsigned | NO   |     | NULL    |
 |
 | labelEnglish               | varchar(320)     | NO   |     | NULL    |
 |
 | labelFrench                | varchar(320)     | NO   |     | NULL    |
 |

 ++--+--+-+-++
 4 rows in set (0.00 sec)

 mysql  describe products_vendors_categories;

 +-+--+--+-+-++
 | Field                   | Type             | Null | Key | Default |
 Extra          |

 +-+--+--+-+-++
 | productVendorCategoryID | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | labelEnglish            | varchar(320)     | NO   |     | NULL    |
 |
 | labelFrench             | varchar(320)     | NO   |     | NULL    |
 |

 +-+--+--+-+-++
 3 rows in set (0.00 sec)

 mysql  describe product_vendor_in_subcategory;
 +---+--+--+-+-+---+
 | Field             | Type             | Null | Key | Default | Extra |
 +---+--+--+-+-+---+
 | productVendorID   | int(10) unsigned | NO   | MUL | NULL    |       |
 | productCategoryID | int(10) unsigned | NO   | MUL | NULL    |       |
 

Re: Using solr to store data

2010-02-03 Thread Tommy Chheng

Hey AJ,
For simplicity sake, I am using Solr to serve as storage and search for 
http://researchwatch.net.
The dataset is 110K  NSF grants from 1999 to 2009. The faceting is all 
dynamic fields and I use a catch all to copy all fields to a default 
text field. All fields are also stored and used for individual grant view.
The performance seems fine for my purposes. I haven't done any extensive 
benchmarking with it. The site was built using a light ROR/rsolr layer 
on a small EC2 instance.


Feel free to bang against the site with jmeter if you want to stress 
test a sample server to failure.  :)


--
Tommy Chheng
Developer  UC Irvine Graduate Student
http://tommy.chheng.com

On 2/3/10 5:41 PM, AJ Asver wrote:

Hi all,

I work on search at Scoopler.com, a real-time search engine which uses Solr.
  We current use solr for indexing but then fetch data from our couchdb
cluster using the IDs solr returns.  We are now considering storing a larger
portion of data in Solr's index itself so we don't have to hit the DB too.
  Assuming that we are still storing data on the db (for backend and back up
purposes) are there any significant disadvantages to using solr as a data
store too?

We currently run a master-slave setup on EC2 using x-large slave instances
to allow for the disk cache to use as much memory as possible.  I imagine we
would definitely have to add more slave instances to accomodate the extra
data we're storing (and make sure it stays in memory).

Any tips would be really helpful.
--
AJ Asver
Co-founder, Scoopler.com

+44 (0) 7834 609830 / +1 (415) 670 9152
a...@scoopler.com


Follow me on Twitter: http://www.twitter.com/_aj
Add me on Linkedin: http://www.linkedin.com/in/ajasver
or YouNoodle: http://younoodle.com/people/ajmal_asver

My Blog: http://ajasver.com

   


Re: query all filled field?

2010-02-03 Thread Lance Norskog
Queries that start with minus or NOT don't work. You have to do this:
 *:* AND -fieldX:[* TO *]

On Wed, Feb 3, 2010 at 5:04 AM, Frederico Azeiteiro
frederico.azeite...@cision.com wrote:
 Hum, strange.. I reindexed some docs with the field corrected.

 Now I'm sure the field is filled because:

 fieldX:(*a*) returns docs.

 But fieldX:[* TO *] is returning the same as *.* (all results)

 I tried with -fieldX:[* TO *] and I get no results at all.

 I wonder if someone has tried this before with success?

 The field is indexed as string, indexed=true and stored=true.

 Thanks,
 Frederico

 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: quarta-feira, 3 de Fevereiro de 2010 11:48
 To: solr-user@lucene.apache.org
 Subject: Re: query all filled field?


 Is it possible to query some field in order to get only not
 empty
 documents?



 All documents where field x is filled?

 Yes. q=x:[* TO *] will bring documents that has non-empty x field.







-- 
Lance Norskog
goks...@gmail.com


Re: The Riddle of the Underscore and the Dollar Sign

2010-02-03 Thread Lance Norskog
Please reframe how you give the various fields and tests - i'ts hard
to follow in this email.

On Wed, Feb 3, 2010 at 12:50 PM, Christopher Ball
christopher.b...@metaheuristica.com wrote:
 I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters
 with regard to Underscores.



 1) I am trying to get rid of them when shingling, but seem unable to do so
 with a Stopwords Filter.



 And yet they are being removed when I am not even trying to by the
 WordDelimiter Filter.



 2) Conversely, I would like to retain '$' symbols when they adjacent to
 numbers, but seem unable to without having to accept all forms of other
 syntax.



 My simple example configuration and test data and results are below.



 Most grateful for any guidance,



 Christopher





 Test Data:



 doc

 field name=idStopWordTestData/field
 field name=conSubSec-text_dcPreShingled ThisIsNotAStopWord
 ThisIsAStopWord ThisIsAlsoAStopWord beforeaperiod. beforeacomma,
 beforeacollan: under_Score don't Peter's s $1.00 $1 $1,000 $200 $3,000,000
 $3m - # -#- --#-- Yes X No _ __ ___ a and also about/field

 /doc







 Field 1 - Delimited_text:


 Index Analyzer: org.apache.solr.analysis.TokenizerChain

 Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

 Filters:

 1.       org.apache.solr.analysis.WordDelimiterFilterFactory
 args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
 generateWordParts: 0 catenateAll: 1 catenateNumbers: 1 }


 org.apache.solr.analysis.LowerCaseFilterFactory args:{}





 Field 1 - Resulting Index Terms:






 Term


 #



 100


 2



 1000


 2



 200


 2



 3


 2



 300


 2



 3m


 2



 a


 2



 about


 2



 also


 2



 and


 2



 beforeacollan


 2



 beforeacomma


 2



 beforeaperiod


 2



 dont


 2



 m


 2



 no


 2



 peter


 2



 preshingled


 2



 s


 2



 thisisalsoastopword


 2



 thisisastopword


 2



 thisisnotastopword


 2



 underscore


 2



 x


 2



 yes


 2



 1


 2


 Field2 - Shingled_Text:


 Index Analyzer: org.apache.solr.analysis.TokenizerChain

 Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

 Filters:

 2.          1. org.apache.solr.analysis.WordDelimiterFilterFactory
 args:{splitOnCaseChange: 1 generateNumberParts: 0 catenateWords: 1
 stemEnglishPossessive: 0 generateWordParts: 0 catenateAll: 0
 catenateNumbers: 1 }

 3.          2. org.apache.solr.analysis.StopFilterFactory args:{words:
 StopWords-PreShingled.txt ignoreCase: true enablePositionIncrements: true }

 4.          3. org.apache.solr.analysis.LowerCaseFilterFactory args:{}

 5.          4. org.apache.solr.analysis.ShingleFilterFactory
 args:{outputUnigrams: false maxShingleSize: 5 }





 File: StopWords-PreShingled.txt


 s


 _


 PreShingled


 __


 ThisIsAStopWord


 ThisIsAlsoAStopWord





 Field2 - Resulting Index Terms (Sample):






 Term


 #



 _ 100


 1



 _ 100 1 1000


 1



 _ _


 1



 _ _ beforeaperiod beforeacomma


 1



 _ beforeaperiod


 1



 _ beforeaperiod beforeacomma beforeacollan


 1



 _ thisisnotastopword


 1



 _ thisisnotastopword _ _


 1














-- 
Lance Norskog
goks...@gmail.com


Re: java.lang.NullPointerException with MySQL DataImportHandler

2010-02-03 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Feb 4, 2010 at 10:50 AM, Lance Norskog goks...@gmail.com wrote:
 I just tested this with a DIH that does not use database input.

 If the DataImportHandler JDBC code does not support a schema that has
 optional fields, that is a major weakness. Noble/Shalin, is this true?
The problem is obviously not with DIH. DIH blindly passes on all the
fields it could obtain from the DB. if some field is missing DIH does
not do anything

 On Tue, Feb 2, 2010 at 8:50 AM, Sascha Szott sz...@zib.de wrote:
 Hi,

 since some of the fields used in your DIH configuration aren't mandatory
 (e.g., keywords and tags are defined as nullable in your db table schema),
 add a default value to all optional fields in your schema configuration
 (e.g., default = ). Note, that Solr does not understand the db-related
 concept of null values.

 Solr's log output

 SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce
 amp; Gabbana Damp;G Neckties designer Tie for men 543},
 productID=productID(1.0)={220213}}]

 indicates that there aren't any tags or descriptions stored for the item
 with productId 220213. Since no default value is specified, Solr raises an
 error when creating the index document.

 -Sascha

 Jean-Michel Philippon-Nadeau wrote:

 Hi,

 Thanks for the reply.

 On Tue, 2010-02-02 at 16:57 +0100, Sascha Szott wrote:

 * the output of MySQL's describe command for all tables/views referenced
 in your DIH configuration

 mysql  describe products;

 ++--+--+-+-++
 | Field          | Type             | Null | Key | Default | Extra
 |

 ++--+--+-+-++
 | productID      | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | skuCode        | varchar(320)     | YES  | MUL | NULL    |
 |
 | upcCode        | varchar(320)     | YES  | MUL | NULL    |
 |
 | name           | varchar(320)     | NO   |     | NULL    |
 |
 | description    | text             | NO   |     | NULL    |
 |
 | keywords       | text             | YES  |     | NULL    |
 |
 | disqusThreadID | varchar(50)      | NO   |     | NULL    |
 |
 | tags           | text             | YES  |     | NULL    |
 |
 | createdOn      | int(10) unsigned | NO   |     | NULL    |
 |
 | lastUpdated    | int(10) unsigned | NO   |     | NULL    |
 |
 | imageURL       | varchar(320)     | YES  |     | NULL    |
 |
 | inStock        | tinyint(1)       | YES  | MUL | 1       |
 |
 | active         | tinyint(1)       | YES  |     | 1       |
 |

 ++--+--+-+-++
 13 rows in set (0.00 sec)

 mysql  describe product_soldby_vendor;
 +-+--+--+-+-+---+
 | Field           | Type             | Null | Key | Default | Extra |
 +-+--+--+-+-+---+
 | productID       | int(10) unsigned | NO   | MUL | NULL    |       |
 | productVendorID | int(10) unsigned | NO   | MUL | NULL    |       |
 | price           | double           | NO   |     | NULL    |       |
 | currency        | varchar(5)       | NO   |     | NULL    |       |
 | buyURL          | varchar(320)     | NO   |     | NULL    |       |
 +-+--+--+-+-+---+
 5 rows in set (0.00 sec)

 mysql  describe products_vendors_subcategories;

 ++--+--+-+-++
 | Field                      | Type             | Null | Key | Default |
 Extra          |

 ++--+--+-+-++
 | productVendorSubcategoryID | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | productVendorCategoryID    | int(10) unsigned | NO   |     | NULL    |
 |
 | labelEnglish               | varchar(320)     | NO   |     | NULL    |
 |
 | labelFrench                | varchar(320)     | NO   |     | NULL    |
 |

 ++--+--+-+-++
 4 rows in set (0.00 sec)

 mysql  describe products_vendors_categories;

 +-+--+--+-+-++
 | Field                   | Type             | Null | Key | Default |
 Extra          |

 +-+--+--+-+-++
 | productVendorCategoryID | int(10) unsigned | NO   | PRI | NULL    |
 auto_increment |
 | labelEnglish            | varchar(320)     | NO   |     | NULL    |
 |
 | labelFrench             | varchar(320)     | NO   |     | NULL    |
 |

 +-+--+--+-+-++
 3 rows in set (0.00 sec)

 mysql  describe product_vendor_in_subcategory;
 +---+--+--+-+-+---+
 | Field             | Type             | Null | Key | Default |