timezone DIH and dataimport.properties

2011-04-26 Thread stockii
Hello.

How can i set the timezone oft java in my java properties ? 

my problem is, that in the dataimport-properties is a wrong timezone and i
dont know how to set the correct timezone ... !?!? thx

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/timezone-DIH-and-dataimport-properties-tp2864928p2864928.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to concatenate two nodes of xml with xpathentityprocessor

2011-04-26 Thread Stefan Matheis
Vishal,

i don't really understand what you're trying to achieve? indexing what
(complete/sample documents, valid if possible)? And getting what
exactly as result?

Regards
Stefan

On Mon, Apr 25, 2011 at 5:01 PM, vrpar...@gmail.com vrpar...@gmail.com wrote:
 hello ,

 i am using Xpathentityprocessor to do index xml files

 below is my xml file

 Full
   Customer name=a id=1 .. other attributes 
CustomerA/Customer
   Customer name=b id=2 .. other attributes  ThisB/Customer
   Customer name=c id=3 .. other attributes  AnyC/Customer
 /Full

 now i want to concatenate in index so that when i search it gives below
 result

 CData with id attribute---  like str id=1CustomerA/strstr
 id=2ThisB/str or something like that

 is it possible by RegexTransformer or templatetransformer? i did googling
 little for both but could not get excat/useful solution

 Thanks

 Vishal Parekh

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-concatenate-two-nodes-of-xml-with-xpathentityprocessor-tp2861260p2861260.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: timezone DIH and dataimport.properties

2011-04-26 Thread Stefan Matheis
java -Duser.timezone=UTC -jar start.jar ?

On Tue, Apr 26, 2011 at 9:54 AM, stockii stock.jo...@googlemail.com wrote:
 Hello.

 How can i set the timezone oft java in my java properties ?

 my problem is, that in the dataimport-properties is a wrong timezone and i
 dont know how to set the correct timezone ... !?!? thx

 -
 --- System 
 

 One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
 1 Core with 31 Million Documents other Cores  100.000

 - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
 - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/timezone-DIH-and-dataimport-properties-tp2864928p2864928.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Problem with autogeneratePhraseQueries

2011-04-26 Thread Solr Beginner
Hi,

I'm new to solr. My solr instance version is:

Solr Specification Version: 3.1.0
Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
18:00:07
Lucene Specification Version: 3.1.0
Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
Current Time: Tue Apr 26 08:01:09 CEST 2011
Server Start Time:Tue Apr 26 07:59:05 CEST 2011

I have following definition for textgen type:

 fieldType name=textgen class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=false
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=15
side=front preserveOriginal=1/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 catenateNumbers=0
catenateAll=0 preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


I'm using this type for name field in my index. As you can see I'm
using autoGeneratePhraseQueries=false but for query sony vaio 4gb I'm
getting following query in debug:

lst name=debug
  str name=rawquerystringsony vaio 4gb/str
  str name=querystringsony vaio 4gb/str
  str name=parsedquery+name:sony +name:vaio +MultiPhraseQuery(name:(4gb
4) gb)/str
  str name=parsedquery_toString+name:sony +name:vaio +name:(4gb 4)
gb/str

Do you have any idea how can I avoid this MultiPhraseQuery?

Best Regards,
solr_beginner


Re: Problem with autogeneratePhraseQueries

2011-04-26 Thread Robert Muir
What do you have in solrconfig.xml for luceneMatchVersion?

If you don't set this, then its going to default to Lucene 2.9
emulation so that old solr 1.4 configs work the same way. I tried your
example and it worked fine here, and I'm guessing this is probably
whats happening.

the default in the example/solrconfig.xml looks like this:

!-- Controls what version of Lucene various components of Solr
 adhere to.  Generally, you want to use the latest version to
 get all bug fixes and improvements. It is highly recommended
 that you fully re-index after changing this setting as it can
 affect both how text is indexed and queried.
  --
luceneMatchVersionLUCENE_31/luceneMatchVersion

On Tue, Apr 26, 2011 at 6:51 AM, Solr Beginner solr_begin...@onet.pl wrote:
 Hi,

 I'm new to solr. My solr instance version is:

 Solr Specification Version: 3.1.0
 Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
 18:00:07
 Lucene Specification Version: 3.1.0
 Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
 Current Time: Tue Apr 26 08:01:09 CEST 2011
 Server Start Time:Tue Apr 26 07:59:05 CEST 2011

 I have following definition for textgen type:

  fieldType name=textgen class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=false
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=15
 side=front preserveOriginal=1/
  /analyzer
  analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
  /analyzer
 /fieldType


 I'm using this type for name field in my index. As you can see I'm
 using autoGeneratePhraseQueries=false but for query sony vaio 4gb I'm
 getting following query in debug:

 lst name=debug
  str name=rawquerystringsony vaio 4gb/str
  str name=querystringsony vaio 4gb/str
  str name=parsedquery+name:sony +name:vaio +MultiPhraseQuery(name:(4gb
 4) gb)/str
  str name=parsedquery_toString+name:sony +name:vaio +name:(4gb 4)
 gb/str

 Do you have any idea how can I avoid this MultiPhraseQuery?

 Best Regards,
 solr_beginner



Re: Query regarding solr plugin.

2011-04-26 Thread Erick Erickson
Sorry, but there's too much here to debug remotely. I strongly advise you
back wy up. Undo (but save) all your changes. Start by doing
the simplest thing you can, just get a dummy class in place and
get it called. Perhaps create a really dumb logger method that
opens a text file, writes a message, and closes the file. Inefficient
I know, but this is just to find out the problem. Debugging by println is
an ancient technique...

Once you're certain the dummy class is called, gradually build it up
to the complex class you eventually want.

One problem here is that you've changed a bunch of moving parts, copied
jars around (it's unclear whether you have two copies of solr-core in your
classpath, for instance). So knowing exactly which one of those is the issue
is very difficult, especially since you may have forgotten one of the things
you did. I know when I've been trying to do something for days, lots of
details get lost.

Try to avoid changing the underlying Solr code, can you do what you want
by subclassing instead and calling your new class? That would avoid
a bunch of problems.  If you can't subclass, copy the whole thing and
rename it to something new and call *that* rather than re-use the
synonymfilterfactory. The only jar you should copy to the lib directory
would be the one you put your new class in.

I can't emphasize strongly enough that you'll save yourself lots of grief if
you start with a fresh install and build up gradually rather than try to
unravel the current code. It feels wasteful, but winds up being faster in
my experience...

Good Luck!
Erick

On Tue, Apr 26, 2011 at 12:41 AM, rajini maski rajinima...@gmail.com wrote:
 Thanks Erick. I have added my replies to the points you did mention. I am
 somewhere going wrong. I guess do I need to club both the jars or something
 ? If yes, how do i do that? I have no much idea about java and jar files.
 Please guide me here.

 A couple of things to try.

 1 when you do a 'jar -tfv yourjar, you should see
 output like:
  1183 Sun Jun 06 01:31:14 EDT 2010
 org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizer.class
 and your filter statement may need the whole path, in this example...
 filter class=org.apache.lucene.analysis.sinks.TokenTypeSink/ (note,
 this
 is just an example of the pathing, this class has nothing to do with
 your filter)...

 I could see this output..

 2 But I'm guessing your path is actually OK, because I'd expect to be
 seeing a
 class not found error. So my guess is that your class depends on
 other jars that
 aren't packaged up in your jar and if you find which ones they are and copy
 them
 to your lib directory you'll be OK. Or your code is throwing an error
 on load. Or
 something like that...

 There is jar - apache-solr-core-1.4.1.jar this has the
 BaseTokenFilterFacotry class and the Synonymfilterfactory class..I made the
 changes in second class file and created it as new. Now i created a jar of
 that java file and placed this in solr home/lib and also placed
 apache-solr-core-1.4.1.jar file in lib folder of solr home.  [solr home -
 c:\orch\search\solr  lib path - c:\orch\search\solr\lib]

 3 to try to understand what's up, I'd back up a step. Make a really
 stupid class
 that doesn't do anything except derive from BaseTokenFilterFacotry and see
 if
 you can load that. If you can, then your process is OK and you need to
 find out what classes your new filter depend on. If you still can't, then we
 can
 see what else we can come up with..


 I am perhaps doing same. In the synonymfilterfactory class, there is a
 function parse rules which takes delimiters as one of the input parameter.
 Here i changed  comma ',' to '~' tilde symbol and  thats it.


 Regards,
 Rajani


 On Mon, Apr 25, 2011 at 6:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Looking at things more carefully, it may be one of your dependent classes
 that's not being found.

 A couple of things to try.

 1 when you do a 'jar -tfv yourjar, you should see
 output like:
  1183 Sun Jun 06 01:31:14 EDT 2010
 org/apache/lucene/analysis/sinks/TokenTypeSinkTokenizer.class
 and your filter statement may need the whole path, in this example...
 filter class=org.apache.lucene.analysis.sinks.TokenTypeSink/ (note,
 this
 is just an example of the pathing, this class has nothing to do with
 your filter)...

 2 But I'm guessing your path is actually OK, because I'd expect to be
 seeing a
 class not found error. So my guess is that your class depends on
 other jars that
 aren't packaged up in your jar and if you find which ones they are and copy
 them
 to your lib directory you'll be OK. Or your code is throwing an error
 on load. Or
 something like that...

 3 to try to understand what's up, I'd back up a step. Make a really
 stupid class
 that doesn't do anything except derive from BaseTokenFilterFacotry and see
 if
 you can load that. If you can, then your process is OK and you need to
 find out what classes your new filter depend on. If you 

Re: Problem with autogeneratePhraseQueries

2011-04-26 Thread Solr Beginner
Thank you very much for answer.

You were right. There was no luceneMatchVersion in solrconfig.xml of our dev
core. We thought that values not present in core configuration are copied
from main solrconfig.xml. I will investigate if our administrators did
something wrong during upgrade to 3.1.

On Tue, Apr 26, 2011 at 1:35 PM, Robert Muir rcm...@gmail.com wrote:

 What do you have in solrconfig.xml for luceneMatchVersion?

 If you don't set this, then its going to default to Lucene 2.9
 emulation so that old solr 1.4 configs work the same way. I tried your
 example and it worked fine here, and I'm guessing this is probably
 whats happening.

 the default in the example/solrconfig.xml looks like this:

 !-- Controls what version of Lucene various components of Solr
 adhere to.  Generally, you want to use the latest version to
 get all bug fixes and improvements. It is highly recommended
 that you fully re-index after changing this setting as it can
 affect both how text is indexed and queried.
  --
 luceneMatchVersionLUCENE_31/luceneMatchVersion

 On Tue, Apr 26, 2011 at 6:51 AM, Solr Beginner solr_begin...@onet.pl
 wrote:
  Hi,
 
  I'm new to solr. My solr instance version is:
 
  Solr Specification Version: 3.1.0
  Solr Implementation Version: 3.1.0 1085815 - grantingersoll - 2011-03-26
  18:00:07
  Lucene Specification Version: 3.1.0
  Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
  Current Time: Tue Apr 26 08:01:09 CEST 2011
  Server Start Time:Tue Apr 26 07:59:05 CEST 2011
 
  I have following definition for textgen type:
 
   fieldType name=textgen class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=false
   analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=1 catenateWords=1 catenateNumbers=1
  preserveOriginal=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=15
  side=front preserveOriginal=1/
   /analyzer
   analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory
  ignoreCase=true
  words=stopwords.txt
  enablePositionIncrements=true/
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=1 catenateWords=0 catenateNumbers=0
  catenateAll=0 preserveOriginal=1/
  filter class=solr.LowerCaseFilterFactory/
   /analyzer
  /fieldType
 
 
  I'm using this type for name field in my index. As you can see I'm
  using autoGeneratePhraseQueries=false but for query sony vaio 4gb I'm
  getting following query in debug:
 
  lst name=debug
   str name=rawquerystringsony vaio 4gb/str
   str name=querystringsony vaio 4gb/str
   str name=parsedquery+name:sony +name:vaio
 +MultiPhraseQuery(name:(4gb
  4) gb)/str
   str name=parsedquery_toString+name:sony +name:vaio +name:(4gb 4)
  gb/str
 
  Do you have any idea how can I avoid this MultiPhraseQuery?
 
  Best Regards,
  solr_beginner
 



Re: how to concatenate two nodes of xml with xpathentityprocessor

2011-04-26 Thread vrpar...@gmail.com
Thanks Stefan 

currently in dataconfig file part of xPathEntityProcessor

entity name=x processor=XPathEntityProcessor forEach=/FULL
url=D:\Files\${_FileName} dataSource=FD
field column=id xpath=/FULL/Customer/@id /
field column=Customer xpath=/Full/Customer /
/entity

and when i do make search i get following search result

result name=response numFound=1 start=0
 doc
 arr name=Customer
strCustomerA/str
strAnyC/str
/arr
 arr name=id
str1/str
str3/str
/arr
   /doc
/result

but i want following result

result name=response numFound=1 start=0
 doc
 arr name=Combine
str1,CustomerA/str
str3,AnyC/str
/arr
   /doc
/result OR

result name=response numFound=1 start=0
 doc
 arr name=Combine
str id=1CustomerA/str
str id=3AnyC/str
/arr
   /doc
/result

or any other format but i want both combine,

Thanks

Vishal Parekh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-concatenate-two-nodes-of-xml-with-xpathentityprocessor-tp2861260p2865508.html
Sent from the Solr - User mailing list archive at Nabble.com.


What initialize new searcher?

2011-04-26 Thread Solr Beginner
Hi,

I'm reading solr cache documentation -
http://wiki.apache.org/solr/SolrCaching I found there The current
Index Searcher serves requests and when a new searcher is opened
Could you explain when new searcher is opened? Does it have something
to do with index commit?

Best Regards,
Solr Beginner


TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread mdz-munich
Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of unique terms. 
No we try to obtain the first 400 most common words for CommonGramsFilter
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks  best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.


org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread vrpar...@gmail.com
Hello,

i got following source

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.dataimport.DataImportHandler' at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .

actually this error comes in solr 3.1 only in solr 1.4.1 it works fine

how to solve this problem?

Thanks

Vishal Parekh


--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread Stefan Matheis
http://www.lucidimagination.com/blog/2011/04/01/solr-powered-isfdb-part-8/

On Tue, Apr 26, 2011 at 3:34 PM, vrpar...@gmail.com vrpar...@gmail.com wrote:
 Hello,

 i got following source

 org.apache.solr.common.SolrException: Error loading class
 'org.apache.solr.handler.dataimport.DataImportHandler' at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
 org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .

 actually this error comes in solr 3.1 only in solr 1.4.1 it works fine

 how to solve this problem?

 Thanks

 Vishal Parekh


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir
On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 But somehow this feels bad (well, so does sticking word variations in what's
 supposed to be a synonyms file), partly because it means that the person 
 adding
 new synonyms would need to know what they stem to (or always check it against
 Solr before editing the file).

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.


WhitespaceTokenizer and scoring(field length)

2011-04-26 Thread roySolr
Hello,

I have a problem with the whitespaceTokenizer and scoring. An example:

id Titel
1  Manchester united
2  Manchester

With the whitespaceTokenizer Manchester united will be splitted to
Manchester and united. When
i search for manchester i get id 1 and 2 in my results. What i want is
that id 2 scores higher(field length).
How can i fix this?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread Burton-West, Tom
Don't know your use case, but if you just want a list of the 400 most common 
words you can use the lucene contrib. HighFreqTerms.java with the - t flag.  
You have to point it at your lucene index.  You also probably don't want Solr 
to be running and want to give the JVM running HighFreqTerms a lot of memory.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log

Tom
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] 
Sent: Tuesday, April 26, 2011 9:29 AM
To: solr-user@lucene.apache.org
Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of unique terms. 
No we try to obtain the first 400 most common words for CommonGramsFilter
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks  best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.


Apache Solr 3.1.0

2011-04-26 Thread Wodek Siebor
I'm trying to tokenize email  and IP addresses using
StandardTokenizerFactory.
It does correctly tokenize IP address but it divides email address into two
tokens one with value before '@' and the other with value after that. 

It works correctly under Solr 1.4.1

Has anybody else tried similar thing on Solr 3.1.0 successfully or is it a
potential bug?

Thanks,
Wlodek S.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-3-1-0-tp2866007p2866007.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Apache Solr 3.1.0

2011-04-26 Thread Steven A Rowe
Hi Wodek,

UAX29URLEmailTokenizer includes all of StandardTokenizer's rules and adds rules 
to tokenize URLs and Emails:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory

Steve

 -Original Message-
 From: Wodek Siebor [mailto:siebor_wlo...@bah.com]
 Sent: Tuesday, April 26, 2011 11:29 AM
 To: solr-user@lucene.apache.org
 Subject: Apache Solr 3.1.0
 
 I'm trying to tokenize email  and IP addresses using
 StandardTokenizerFactory.
 It does correctly tokenize IP address but it divides email address into
 two
 tokens one with value before '@' and the other with value after that.
 
 It works correctly under Solr 1.4.1
 
 Has anybody else tried similar thing on Solr 3.1.0 successfully or is it
 a
 potential bug?
 
 Thanks,
 Wlodek S.
 
 
 
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Apache-
 Solr-3-1-0-tp2866007p2866007.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Problems with Spellchecker in 3.1

2011-04-26 Thread Bob Sandiford
Hi, all.

Sorry for any duplication - seems like what I sent yesterday never made it 
through...


We're having some troubles with the Solr Spellcheck Response.  We're running 
version 3.1.



Overview:  If we search for something really ugly like:



  kljhklsdjahfkljsdhf book rck



then when we get back the response, there's a suggestions list for 'rck', but 
no suggestions list for the other two words.  For 'book', that's fine, because 
it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be 
any suggestions.  For the ugly thing, though, there aren't any hits.



The problem is that when we're handling the result, we can't tell the 
difference between no suggestions for a 'correctly spelled' term, and no 
suggestions for something that's odd like this.



(Now - this is happening with searches that aren't as obviously garbage - i.e. 
words that are real words, just that just don't show up in the index and have 
no suggestions - this was just to illustrate the point).



Our setup:

We're running multiple shards, which may be part of the issue.  For example, 
'book' might be found in one of the shards, but not another.



I don't *think* this has anything to do with our schema, since it's really how 
the Search Suggestions are being returned to us.  But, here are some bits and 
pieces:

From schema.xml:



   !-- Text field for spell checking --

   field   name=textSpelltype=text indexed=true  
stored=false   multiValued=true omitNorms=true/





From solrconfig.xml:



   !-- The spell check component can return a list of alternative spelling

  suggestions.  --

  searchComponent name=spellcheck class=solr.SpellCheckComponent



str name=queryAnalyzerFieldTypetextSpell/str



lst name=spellchecker

  str name=namedefault/str

  str name=fieldtextSpell/str

  str name=spellcheckIndexDir./spellchecker/str

/lst



  /searchComponent



What we'd really like to see is the response coming back with an indication 
that a word wasn't found / had no suggestions.  We've hacked around in the code 
a little bit to do this, but were wondering if anyone has come across this, and 
what approaches you've taken.



We created new classes which extend IndexBasedSpellChecker and 
SpellCheckComponent, as follows (package and imports excluded for (sort of) 
brevity).  The methods are as taken from the overridden classes, with changes 
noted by SD type comments...





/**

* This has a slight modification of Solr's 
AbstractLuceneSpellChecker.getSuggestions(SpellingOptions).

* The modification allows correctly spelled words to be returned in the 
suggestion.  This modification working in tandem

* with the SirsiDynixSpellCheckComponent allows words with no suggestions to be 
returned from the spell check component

* even in a sharded search.

* Changes are marked with SD in the comments.

*/

public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{

  @Override

  public SpellingResult getSuggestions(SpellingOptions options) throws 
IOException {

  boolean shardRequest = false;

  SolrParams params = options.customParams;

  if(params!=null)

  {

shardRequest = true.equals(params.get(ShardParams.IS_SHARD));

  }

SpellingResult result = new SpellingResult(options.tokens);

IndexReader reader = determineReader(options.reader);

Term term = field != null ? new Term(field, ) : null;

float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? 
spellChecker.getAccuracy() : options.accuracy;



int count = Math.max(options.count, 
AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);

for (Token token : options.tokens) {

  String tokenText = new String(token.buffer(), 0, token.length());

  String[] suggestions = spellChecker.suggestSimilar(tokenText,

  count,

field != null ? reader : null, //workaround LUCENE-1295

field,

options.onlyMorePopular, theAccuracy);

  if (suggestions.length == 1  suggestions[0].equals(tokenText)) {

//These are spelled the same, continue on

ListString suggList = Arrays.asList(suggestions); //SD added

result.add(token, suggList);//SD added

continue;

  }



  if (options.extendedResults == true  reader != null  field != null) {

term = term.createTerm(tokenText);

result.add(token, reader.docFreq(term));

int countLimit = Math.min(options.count, suggestions.length);

if(countLimit0)

{

  for (int i = 0; i  countLimit; i++) {

term = term.createTerm(suggestions[i]);

result.add(token, suggestions[i], reader.docFreq(term));

  }

} else if(shardRequest) {

ListString suggList = Collections.emptyList();

result.add(token, suggList);

}

  } else {

if 

Ebay Kleinanzeigen and Auto Suggest

2011-04-26 Thread Eric Grobler
Hi

Someone told me that ebay is using solr.
I was looking at their Auto Suggest implementation and I guess they are
using Shingles and the TermsComponent.

I managed to get a satisfactory implementation but I have a problem with
category specific filtering.
Ebay suggestions are sensitive to categories like Cars and Pets.

As far as I understand it is not possible to using filters with a term
query.
Unless one uses multiple fields or special prefixes for the words to index I
cannot think how to implement this.

Is their perhaps a workaround for this limitation?

Best Regards
EricZ

---

I am have a shingle type like:
fieldType name=shingle_text class=solr.TextField
positionIncrementGap=100
analyzer
  tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.ShingleFilterFactory minShingleSize=2
maxShingleSize=4 /
   filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType



and a query like
http://localhost:8983/solr/terms?q=*%3A*terms.fl=suggest_textterms.sort=countterms.prefix=audi


Solr Newbie: Starting embedded server with multicore

2011-04-26 Thread Simon, Richard T

I'm just starting with Solr. I'm using Solr 3.1.0, and I want to use 
EmbeddedSolrServer with a multicore setup, even though I currently have only 
one core (various documents I read suggest starting that way even if you have 
one core, to get the better administrative tools supported by mutlicore).

I have two questions:

1.   Does the first code sample below start the server with multicore or 
not?

2.   Why is it the first sample work and the second does not?

My solr.xml looks like this:

solr persistent=true
  cores adminPath=/admin/cores defaultCoreName=mycore sharedLib=lib
core name=mycore instanceDir=mycore /
  /cores
/solr

It's in a directory called solrhome in war/WEB-INF.

I can get the server to come up cleanly if I follow an example in the Packt 
Solr book (p. 231), but I'm not sure if this enables multi-core or not:


  File solrXML = new File(war/WEB-INF/solrhome/solr.xml);

  String solrHome = solrXML.getParentFile().getAbsolutePath();
  String dataDir = solrHome + /data;

coreContainer = new CoreContainer(solrHome);

SolrConfig solrConfig = new SolrConfig(solrHome, solrconfig.xml, 
null);

CoreDescriptor coreDescriptor = new CoreDescriptor(coreContainer, 
mycore,
solrHome);

SolrCore solrCore = new SolrCore(mycore,
dataDir + / + mycore, solrConfig, null, 
coreDescriptor);

coreContainer.register(solrCore, false);

  embeddedSolr = new EmbeddedSolrServer(coreContainer, 
mycore);


The documentation on the Solr wiki says I should configure the 
EmbeddedSolrServer for multicore like this:

  File home = new File( /path/to/solr/home );
File f = new File( home, solr.xml );
CoreContainer container = new CoreContainer();
container.load( /path/to/solr/home, f );

EmbeddedSolrServer server = new EmbeddedSolrServer( container, core name 
as defined in solr.xml );


When I try to do this, I get an error saying that it cannot find solrconfig.xml:


  File solrXML = new File(war/WEB-INF/solrhome/solr.xml);

  String solrHome = solrXML.getParentFile().getAbsolutePath();

  coreContainer = new CoreContainer();


coreContainer.load(solrHome, solrXML);

  embeddedSolr = new EmbeddedSolrServer(coreContainer, 
mycore);



The message says it is looking in an odd place (I removed my user name from 
this). Why is it looking in solrhome/mycore/conf for solrconfig.xml? Both that 
and my schema.xml are in solrhome/conf. How can I point it at the right place? 
I tried adding 
REMOVED\workspace-Solr\institution-webapp\war\WEB-INF\solrhome\conf to the 
classpath, but got the same result:


SEVERE: java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in 
classpath or 
'REMOVED\workspace-Solr\institution-webapp\war\WEB-INF\solrhome\mycore\conf/',
 cwd=REMOVED\workspace-Solr\institution-webapp
  at 
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
  at 
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:234)
  at org.apache.solr.core.Config.init(Config.java:141)
  at org.apache.solr.core.SolrConfig.init(SolrConfig.java:132)
  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:430)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)





RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread mdz-munich
Thanks for your suggestion. It seems to be the use of shards and
TermsComponent together. Now we simple requesting shard-by-shard without
shard and shard.qt params and merge the results via XSLT.

Sebastian 



 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2866499.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What initialize new searcher?

2011-04-26 Thread Erick Erickson
You're on the right track. In a system where the indexing process and
search process are on the same machine, commits by the index
process cause a new searcher to opened.

In a master/slave situation (assuming you are indexing on the
master and searching on the slave), then the searchers are
reopened on the slaves after a replication. Replications happen
after 1 a commit happens on the master and 2 the slave
polls the master and pulls down the new commits.

Hope that helps
Erick

On Tue, Apr 26, 2011 at 8:50 AM, Solr Beginner solr_begin...@onet.pl wrote:
 Hi,

 I'm reading solr cache documentation -
 http://wiki.apache.org/solr/SolrCaching I found there The current
 Index Searcher serves requests and when a new searcher is opened
 Could you explain when new searcher is opened? Does it have something
 to do with index commit?

 Best Regards,
 Solr Beginner



Re: WhitespaceTokenizer and scoring(field length)

2011-04-26 Thread Erick Erickson
First, you can give us some more data to work with G...

In particular, attach debugQuery=on to your http request and post
the results. That will show how the documents got their score.

Also, show us the fieldType definition and field definition for the field
in question.

Best
Erick

On Tue, Apr 26, 2011 at 10:27 AM, roySolr royrutten1...@gmail.com wrote:
 Hello,

 I have a problem with the whitespaceTokenizer and scoring. An example:

 id                     Titel
 1                      Manchester united
 2                      Manchester

 With the whitespaceTokenizer Manchester united will be splitted to
 Manchester and united. When
 i search for manchester i get id 1 and 2 in my results. What i want is
 that id 2 scores higher(field length).
 How can i fix this?


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/WhitespaceTokenizer-and-scoring-field-length-tp2865784p2865784.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Question on Batch process

2011-04-26 Thread Charles Wardell
I am sure that this question has been asked a few times, but I can't seem to 
find the sweetspot for indexing.

I have about 100,000 files each containing 1,000 xml documents ready to be 
posted to Solr. My desire is to have it index as quickly as possible and then 
once completed the daily stream of ADDs will be small in comparison.

The individual documents are small. Essentially web postings from the net. 
Title, postPostContent, date. 

What would be the ideal configuration? For RamBufferSize, mergeFactor, 
MaxbufferedDocs, etc..

My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
I have 16GB of available ram.


Thanks in advance.
Charlie

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Suppose your analysis stack includes lower-casing, but your synonyms are 
only supposed to apply to upper-case tokens.  For example, PET might 
be a synonym of positron emission tomography, but pet wouldn't be.


-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com  wrote:

   

But somehow this feels bad (well, so does sticking word variations in what's
supposed to be a synonyms file), partly because it means that the person adding
new synonyms would need to know what they stem to (or always check it against
Solr before editing the file).
 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.
   


Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir
Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter - no lowercasing of tokens are done as it analyzes
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter - the synonyms are lowercased, as it analyzes
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the tokenizer.

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov soko...@ifactory.com wrote:
 Suppose your analysis stack includes lower-casing, but your synonyms are
 only supposed to apply to upper-case tokens.  For example, PET might be a
 synonym of positron emission tomography, but pet wouldn't be.

 -Mike

 On 04/26/2011 09:51 AM, Robert Muir wrote:

 On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:



 But somehow this feels bad (well, so does sticking word variations in
 what's
 supposed to be a synonyms file), partly because it means that the person
 adding
 new synonyms would need to know what they stem to (or always check it
 against
 Solr before editing the file).


 when creating the synonym map from your input file, currently the
 factory actually uses your Tokenizer only to pre-process the synonyms
 file.

 One idea would be to use the tokenstream up to the synonymfilter
 itself (including filters). This way if you put a stemmer before the
 synonymfilter, it would stem your synonyms file, too.

 I haven't totally thought the whole thing through to see if theres a
 big reason why this wouldn't work (the synonymsfilter is complicated,
 sorry). But it does seem like it would produce more consistent
 results... and perhaps the inconsistency isnt so obvious since in the
 default configuration the synonymfilter is directly after the
 tokenizer.




Re: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread Scott Bigelow
I experienced the same issue. With Solr 1.x, I was copying out the
'example' directory to make my solr installation. However, for the
Solr 3.x distributions, the DataImportHandler class exists in a
directory that is at the same level as example: dist, not a
directory within.

You'll either want to take the entire apache 3.1 directory, or modify
solrconfig to point to the new place you've copied it:

  lib dir=../../dist/ regex=apache-solr-dataimporthandler-\d.*\.jar /



On Tue, Apr 26, 2011 at 6:38 AM, Stefan Matheis
matheis.ste...@googlemail.com wrote:
 http://www.lucidimagination.com/blog/2011/04/01/solr-powered-isfdb-part-8/

 On Tue, Apr 26, 2011 at 3:34 PM, vrpar...@gmail.com vrpar...@gmail.com 
 wrote:
 Hello,

 i got following source

 org.apache.solr.common.SolrException: Error loading class
 'org.apache.solr.handler.dataimport.DataImportHandler' at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:423) at
 org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:459) .

 actually this error comes in solr 3.1 only in solr 1.4.1 it works fine

 how to solve this problem?

 Thanks

 Vishal Parekh


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Error-loading-class-org-apache-solr-handler-dataimport-DataImpo-tp2865625p2865625.html
 Sent from the Solr - User mailing list archive at Nabble.com.




RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen
OK this is even more weird... everything is working much better except
for one thing: I was testing use cases with our top query terms to make
sure the below query settings wouldn't break any existing behavior, and
got this most unusual result.  The analyzer stack completely eliminated
the word McAfee from the query terms!  I'm like huh?  Here is the
analyzer page output for that search term:

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=query_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position
term text
term type
source start,end
payload
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
{protected=protwords.txt}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position
term text
term type
source start,end
payload



-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Monday, April 25, 2011 11:27 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: term position question from analyzer stack for
WordDelimiterFilterFactory

Aha!  I knew something must be awry, but when I looked at the analysis
page output, well it sure looked like it should match.  :)

OK here is the query side WDF that finally works, I just turned
everything off.  (yay)  First I tried just completely removeing WDF from
the query side analyzer stack but that didn't work.  So anyway I suppose
I should turn off the catenate all plus the preserve original settings,
reindex, and see if I still get a match huh?  (PS  thank you very much
for the help!!!)

  filter class=solr.WordDelimiterFilterFactory
generateWordParts=0
generateNumberParts=0
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
/  



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Monday, April 25, 2011 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com
wrote:
 The search and index analyzer stack are the same.

Ahhh, they should not be!
Using both generate and catenate in WDF at query time is a no-no.
Same reason you can't have multi-word synonyms at query time:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
FilterFactory

I'd recommend going back to the WDF settings in the solr example
server as a starting point.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


Re: What initialize new searcher?

2011-04-26 Thread Otis Gospodnetic
Hi,

Yes, typically after your index has been replicated from master to a slave a 
commit will be issued and the new searcher will be opened.  Before being 
exposed 
to regular clients it's a good practice to warm things up.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Solr Beginner solr_begin...@onet.pl
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 8:50:21 AM
 Subject: What initialize new searcher?
 
 Hi,
 
 I'm reading solr cache documentation -
 http://wiki.apache.org/solr/SolrCaching I found there The current
 Index  Searcher serves requests and when a new searcher is opened
 Could you  explain when new searcher is opened? Does it have something
 to do with index  commit?
 
 Best Regards,
 Solr Beginner
 


Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Yes, I see.  Makes sense.  It is a bit hard to see a bad case for your 
proposal in that light. Here is one other example; I'm not sure whether 
it presents difficulties or not, and may be a bit contrived, but hey, 
food for thought at least:


Say you have set up synonyms between names and commonly-used pseudonyms 
or alternate names that should not be stemmed:


Malcolm X = Malcolm Little
Prince = Rogers Nelson Prince
Little Kim = Kimberly Denise Jones
Biggy Smalls etc.

You don't want Malcolm Littler or Littlest Kim or Big Small to 
match anything. And Princely shouldn't bring up the artist.


But you also have regular linguistic synonyms (not names) that *should* 
be stemmed (as in the original example).  So little = small should 
imply littler = smaller and so on via stemming.


Ideally  you could put one SynonymFilter before the stemming and the 
other one after.  In that case do the SynonymFilters get composed?  I 
can't think of a believable example where that would cause a problem, 
but maybe you can?


-Mike


On 04/26/2011 04:25 PM, Robert Muir wrote:

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter -  no lowercasing of tokens are done as it analyzes
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -  the synonyms are lowercased, as it analyzes
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the tokenizer.

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolovsoko...@ifactory.com  wrote:
   

Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens.  For example, PET might be a
synonym of positron emission tomography, but pet wouldn't be.

-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:
 

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.comwrote:


   

But somehow this feels bad (well, so does sticking word variations in
what's
supposed to be a synonyms file), partly because it means that the person
adding
new synonyms would need to know what they stem to (or always check it
against
Solr before editing the file).

 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

   
 


Re: Ebay Kleinanzeigen and Auto Suggest

2011-04-26 Thread Otis Gospodnetic
Hi Eric,

Before using the terms component, allow me to point out:

* http://sematext.com/products/autocomplete/index.html (used on 
http://search-lucene.com/ for example)

* http://wiki.apache.org/solr/Suggester


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Eric Grobler impalah...@googlemail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 1:11:11 PM
 Subject: Ebay Kleinanzeigen and Auto Suggest
 
 Hi
 
 Someone told me that ebay is using solr.
 I was looking at their  Auto Suggest implementation and I guess they are
 using Shingles and the  TermsComponent.
 
 I managed to get a satisfactory implementation but I have  a problem with
 category specific filtering.
 Ebay suggestions are sensitive  to categories like Cars and Pets.
 
 As far as I understand it is not  possible to using filters with a term
 query.
 Unless one uses multiple  fields or special prefixes for the words to index I
 cannot think how to  implement this.
 
 Is their perhaps a workaround for this  limitation?
 
 Best  Regards
 EricZ
 
 ---
 
 I am have  a shingle type like:
 fieldType name=shingle_text  class=solr.TextField
 positionIncrementGap=100
 analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter  class=solr.ShingleFilterFactory minShingleSize=2
 maxShingleSize=4  /
filter class=solr.LowerCaseFilterFactory /
/analyzer
 /fieldType
 
 
 
 and a query like
http://localhost:8983/solr/terms?q=*%3A*terms.fl=suggest_textterms.sort=countterms.prefix=audi
i
 


SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen
So if there is a hit in the synonym filter factory, do I need to put the
various case changes for a term so that the following
WordDelimiterFilter analyzer can do its 'split on case changes' work?
Here we see SynonymFilterFactory makes all terms lowercase because this
is what is in my synonmyms.txt file and I have ignoreCase=true:
macafee, mcafee 

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   macafee
mcafee
term type   word
word
source start,end0,6
0,6
payload 



Re: Question on Batch process

2011-04-26 Thread Otis Gospodnetic
Charlie,

How's this:
* -Xmx2g
* ramBufferSizeMB 512
* mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
* ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
* use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
or send batches of say 1000 docs with the other SolrServer impl using N threads 
(N=# of your CPU cores)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Charles Wardell charles.ward...@bcsolution.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 2:32:29 PM
 Subject: Question on Batch process
 
 I am sure that this question has been asked a few times, but I can't seem to  
find the sweetspot for indexing.
 
 I have about 100,000 files each  containing 1,000 xml documents ready to be 
posted to Solr. My desire is to have  it index as quickly as possible and then 
once completed the daily stream of ADDs  will be small in comparison.
 
 The individual documents are small.  Essentially web postings from the net. 
Title, postPostContent, date. 

 
 What would be the ideal configuration? For RamBufferSize, mergeFactor,  
MaxbufferedDocs, etc..
 
 My machine is a quad core hyper-threaded. So it  shows up as 8 cpu's in TOP
 I have 16GB of available ram.
 
 
 Thanks in  advance.
 Charlie


Re: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Otis Gospodnetic
Hi Robert,

I'm no WDFF expert, but all these zero look suspicious:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}

A quick visit to 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 makes me think you want:

splitOnCaseChange=1  (if you want Mc Afee for some reason?)
generateWordParts=1 (if you want Mc Afee for some reason?)
preserveOriginal=1


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Sent: Tue, April 26, 2011 4:39:49 PM
 Subject: RE: term position question from analyzer stack for 
WordDelimiterFilterFactory
 
 OK this is even more weird... everything is working much better except
 for  one thing: I was testing use cases with our top query terms to make
 sure the  below query settings wouldn't break any existing behavior, and
 got this most  unusual result.  The analyzer stack completely eliminated
 the word  McAfee from the query terms!  I'm like huh?  Here is the
 analyzer  page output for that search term:
 
 Query  Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term  position 1
 term text McAfee
 term  type word
 source start,end  0,6
 payload 
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
 term position 1
 term  text McAfee
 term type word
 source  start,end 0,6
 payload 
 org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
 ignoreCase=true}
 term position  1
 term text McAfee
 term type  word
 source start,end 0,6
 payload 
 org.apache.solr.analysis.WordDelimiterFilterFactory  {preserveOriginal=0,
 generateNumberParts=0, catenateWords=0,  generateWordParts=0,
 catenateAll=0, catenateNumbers=0}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.LowerCaseFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 
 
 
 -Original Message-
 From: Robert  Petersen [mailto:rober...@buy.com] 
 Sent: Monday, April 25,  2011 11:27 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject:  RE: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 Aha!  I knew something must be  awry, but when I looked at the analysis
 page output, well it sure looked like  it should match.  :)
 
 OK here is the query side WDF that finally  works, I just turned
 everything off.  (yay)  First I tried just  completely removeing WDF from
 the query side analyzer stack but that didn't  work.  So anyway I suppose
 I should turn off the catenate all plus the  preserve original settings,
 reindex, and see if I still get a match  huh?  (PS  thank you very much
 for the help!!!)
 
filter  class=solr.WordDelimiterFilterFactory
  generateWordParts=0
  generateNumberParts=0
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  preserveOriginal=0
  /
 
 
 
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
 Seeley
 Sent: Monday, April 25, 2011 9:24 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen rober...@buy.com
 wrote:
  The  search and index analyzer stack are the same.
 
 Ahhh, they should not  be!
 Using both generate and catenate in WDF at query time is a no-no.
 Same  reason you can't have multi-word synonyms at query time:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
 FilterFactory
 
 I'd  recommend going back to the WDF settings in the solr example
 server as a  starting point.
 
 
 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User  Conference, May
 25-26, San Francisco
 


RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen
Yeah I am about to try turning one on at a time and see what happens.  I
had a meeting so couldn't do it yet...  (darn those meetings)  (lol)


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, April 26, 2011 2:37 PM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

Hi Robert,

I'm no WDFF expert, but all these zero look suspicious:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}

A quick visit to 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
imiterFilterFactory
 makes me think you want:

splitOnCaseChange=1  (if you want Mc Afee for some reason?)
generateWordParts=1 (if you want Mc Afee for some reason?)
preserveOriginal=1


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Sent: Tue, April 26, 2011 4:39:49 PM
 Subject: RE: term position question from analyzer stack for 
WordDelimiterFilterFactory
 
 OK this is even more weird... everything is working much better except
 for  one thing: I was testing use cases with our top query terms to
make
 sure the  below query settings wouldn't break any existing behavior,
and
 got this most  unusual result.  The analyzer stack completely
eliminated
 the word  McAfee from the query terms!  I'm like huh?  Here is the
 analyzer  page output for that search term:
 
 Query  Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term  position 1
 term text McAfee
 term  type word
 source start,end  0,6
 payload 
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
 term position 1
 term  text McAfee
 term type word
 source  start,end 0,6
 payload 
 org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
 ignoreCase=true}
 term position  1
 term text McAfee
 term type  word
 source start,end 0,6
 payload 
 org.apache.solr.analysis.WordDelimiterFilterFactory
{preserveOriginal=0,
 generateNumberParts=0, catenateWords=0,  generateWordParts=0,
 catenateAll=0, catenateNumbers=0}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.LowerCaseFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 
 
 
 -Original Message-
 From: Robert  Petersen [mailto:rober...@buy.com] 
 Sent: Monday, April 25,  2011 11:27 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject:  RE: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 Aha!  I knew something must be  awry, but when I looked at the
analysis
 page output, well it sure looked like  it should match.  :)
 
 OK here is the query side WDF that finally  works, I just turned
 everything off.  (yay)  First I tried just  completely removeing WDF
from
 the query side analyzer stack but that didn't  work.  So anyway I
suppose
 I should turn off the catenate all plus the  preserve original
settings,
 reindex, and see if I still get a match  huh?  (PS  thank you very
much
 for the help!!!)
 
filter  class=solr.WordDelimiterFilterFactory
  generateWordParts=0
  generateNumberParts=0
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  preserveOriginal=0
  /
 
 
 
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
 Seeley
 Sent: Monday, April 25, 2011 9:24 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen rober...@buy.com
 wrote:
  The  search and index analyzer stack are the same.
 
 Ahhh, they should not  be!
 Using both generate and catenate in WDF at query time is a no-no.
 Same  reason you can't have multi-word synonyms at query time:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
 FilterFactory
 
 I'd  recommend going back to the WDF settings in the solr example
 server as a  starting point.
 
 
 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User  Conference, May
 25-26, San Francisco
 


Reader per query request

2011-04-26 Thread cyang2010
Hi,

I was wondering if solr open a new lucene IndexReader for every query
request?  

From performance point of view, is there any problem of opening a lot of
IndexReaders concurrently, or application shall have some logic to reuse the
same IndexReader?


Thanks,


cy




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867778.html
Sent from the Solr - User mailing list archive at Nabble.com.


Field Length and Highlight

2011-04-26 Thread Alejandro Delgadillo
Hi,

I¹ve been using solr with Coldfusion9,  I¹ve made a couple of adjustment to
it in order to fulfill my needs of my client, I¹m using solr as a document
search engine for a online library which has documents larger then 20MB and
some of them have more than 20 pages.

The thing is that... At first the solr didn¹t indexed all the text, I
already fix it by changing the number of the maxfieldlength in the
collections, now when I search for some word at the end of a document that
has like 150 pages, it shows me the document but won¹t highlight the words
that are almost at the end.

Any ideas?


Re: SynonymFilterFactory case changes

2011-04-26 Thread Erick Erickson
Yes, order does matter.  You're right, putting, say, lowercase in front
of WordDelimiter... will mess up the operations of WDFF.

The admin/analysis page is *extremely* useful for understanding what
happens in the analysis of input. Make sure to check the verbose
checkbox.

Best
Erick

On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 So if there is a hit in the synonym filter factory, do I need to put the
 various case changes for a term so that the following
 WordDelimiterFilter analyzer can do its 'split on case changes' work?
 Here we see SynonymFilterFactory makes all terms lowercase because this
 is what is in my synonmyms.txt file and I have ignoreCase=true:
 macafee, mcafee

 Index Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term position   1
 term text       McAfee
 term type       word
 source start,end        0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
 term position   1
 term text       macafee
 mcafee
 term type       word
 word
 source start,end        0,6
 0,6
 payload




Re: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Erick Erickson
I second Otis' comments. Is it possible that you've gotten twisted
around by trying to modify these settings and would be better off
going back to the WDDF settings in the example schema? I've
sometimes found that to be very useful.

Also (although I don't think it applies in this case) be aware that
the analysis page may introduce it's own errors, so when you see
something really wonky, try a query with debugQuery=on and see
if the parsed query squares with the results on the analysis page...

 Best
Erick

On Tue, Apr 26, 2011 at 5:44 PM, Robert Petersen rober...@buy.com wrote:
 Yeah I am about to try turning one on at a time and see what happens.  I
 had a meeting so couldn't do it yet...  (darn those meetings)  (lol)


 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Tuesday, April 26, 2011 2:37 PM
 To: solr-user@lucene.apache.org
 Subject: Re: term position question from analyzer stack for
 WordDelimiterFilterFactory

 Hi Robert,

 I'm no WDFF expert, but all these zero look suspicious:

 org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
 generateNumberParts=0, catenateWords=0, generateWordParts=0,
 catenateAll=0, catenateNumbers=0}

 A quick visit to
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
 imiterFilterFactory
  makes me think you want:

 splitOnCaseChange=1  (if you want Mc Afee for some reason?)
 generateWordParts=1 (if you want Mc Afee for some reason?)
 preserveOriginal=1


 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Sent: Tue, April 26, 2011 4:39:49 PM
 Subject: RE: term position question from analyzer stack for
WordDelimiterFilterFactory

 OK this is even more weird... everything is working much better except
 for  one thing: I was testing use cases with our top query terms to
 make
 sure the  below query settings wouldn't break any existing behavior,
 and
 got this most  unusual result.  The analyzer stack completely
 eliminated
 the word  McAfee from the query terms!  I'm like huh?  Here is the
 analyzer  page output for that search term:

 Query  Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term  position     1
 term text     McAfee
 term  type     word
 source start,end      0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
 term position     1
 term  text     McAfee
 term type     word
 source  start,end     0,6
 payload
 org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
 ignoreCase=true}
 term position      1
 term text     McAfee
 term type      word
 source start,end     0,6
 payload
 org.apache.solr.analysis.WordDelimiterFilterFactory
 {preserveOriginal=0,
 generateNumberParts=0, catenateWords=0,  generateWordParts=0,
 catenateAll=0, catenateNumbers=0}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.LowerCaseFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload



 -Original Message-
 From: Robert  Petersen [mailto:rober...@buy.com]
 Sent: Monday, April 25,  2011 11:27 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject:  RE: term position question from analyzer stack  for
 WordDelimiterFilterFactory

 Aha!  I knew something must be  awry, but when I looked at the
 analysis
 page output, well it sure looked like  it should match.  :)

 OK here is the query side WDF that finally  works, I just turned
 everything off.  (yay)  First I tried just  completely removeing WDF
 from
 the query side analyzer stack but that didn't  work.  So anyway I
 suppose
 I should turn off the catenate all plus the  preserve original
 settings,
 reindex, and see if I still get a match  huh?  (PS  thank you very
 much
 for the help!!!)

            filter  class=solr.WordDelimiterFilterFactory
                  generateWordParts=0
                  generateNumberParts=0
                  catenateWords=0
                  catenateNumbers=0
                  catenateAll=0
                  preserveOriginal=0
                  /



 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
 Seeley
 Sent: Monday, April 25, 2011 9:24 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: term position question from analyzer stack  for
 WordDelimiterFilterFactory

 On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen rober...@buy.com
 wrote:
  The  search and index analyzer stack 

Re: Reader per query request

2011-04-26 Thread Erick Erickson
See below

On Tue, Apr 26, 2011 at 6:15 PM, cyang2010 ysxsu...@hotmail.com wrote:
 Hi,

 I was wondering if solr open a new lucene IndexReader for every query
 request?

no, absolutely not. Solr only opens a reader when the underlying index
has changed, say a commit or a replication happens.

 From performance point of view, is there any problem of opening a lot of
 IndexReaders concurrently, or application shall have some logic to reuse the
 same IndexReader?

Every time you open a reader, a whole new set of caches are initiated.
I have a hard
time imagining a situation in which opening a new searcher for each
request would
be a good idea. Opening a new reader, especially for a large index is
a very expensive
operation and should be done as rarely as possible. But Solr will do
this automatically
for you, by and large you don't have to think about it.

Best
Erick



 Thanks,


 cy




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867778.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Too many open files exception related to solrj getServer too often?

2011-04-26 Thread cyang2010
Just pushing up the topic and look for answers.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-open-files-exception-related-to-solrj-getServer-too-often-tp2808718p2867976.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reader per query request

2011-04-26 Thread cyang2010
Thanks a lot.  That makes sense.  -- CY

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reader-per-query-request-tp2867778p2867995.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen
But in this case lowercase is after WDF.  The question is that when you get a 
hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
file are all in lower case do I need to add the case changing versions to make 
WDF work on case changes because it appears the synonym text is replaced 
verbatim by what is in the txt file and so that defeats the WDF filter.  In 
fact, adding the case changing versions of this term to the synonyms.txt file 
makes this use case work.  (yay)

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 26, 2011 3:39 PM
To: solr-user@lucene.apache.org
Subject: Re: SynonymFilterFactory case changes

Yes, order does matter.  You're right, putting, say, lowercase in front
of WordDelimiter... will mess up the operations of WDFF.

The admin/analysis page is *extremely* useful for understanding what
happens in the analysis of input. Make sure to check the verbose
checkbox.

Best
Erick

On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 So if there is a hit in the synonym filter factory, do I need to put the
 various case changes for a term so that the following
 WordDelimiterFilter analyzer can do its 'split on case changes' work?
 Here we see SynonymFilterFactory makes all terms lowercase because this
 is what is in my synonmyms.txt file and I have ignoreCase=true:
 macafee, mcafee

 Index Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term position   1
 term text       McAfee
 term type       word
 source start,end        0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
 term position   1
 term text       macafee
 mcafee
 term type       word
 word
 source start,end        0,6
 0,6
 payload




Re: Field Length and Highlight

2011-04-26 Thread Koji Sekiguchi

(11/04/27 7:35), Alejandro Delgadillo wrote:

Hi,

I¹ve been using solr with Coldfusion9,  I¹ve made a couple of adjustment to
it in order to fulfill my needs of my client, I¹m using solr as a document
search engine for a online library which has documents larger then 20MB and
some of them have more than 20 pages.

The thing is that... At first the solr didn¹t indexed all the text, I
already fix it by changing the number of the maxfieldlength in the
collections, now when I search for some word at the end of a document that
has like 150 pages, it shows me the document but won¹t highlight the words
that are almost at the end.

Any ideas?



So your maxAnalyzedChars is too small?
http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

Koji
--
http://www.rondhuit.com/en/


Re: Question on Batch process

2011-04-26 Thread Charles Wardell
Thank you Otis.
Without trying to appear to stupid, when you refer to having the params 
matching your # of CPU cores, you are talking about the # of threads I can 
spawn with the StreamingUpdateSolrServer object?
Up until now, I have been just utilizing post.sh or post.jar. Are these capable 
of that or do I need to write some code to collect a bunch of files into the 
buffer and send it off?

Also, Do you have a sense for how long it should take to index 100,000 files or 
in my case 100,000,000 documents?
StreamingUpdateSolrServer
public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
threadCount) throws MalformedURLException

Thanks again,
Charlie

-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
 Charlie,
 
 How's this:
 * -Xmx2g
 * ramBufferSizeMB 512
 * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
 * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
 * use SolrStreamingUpdateServer (with params matching your number of CPU 
 cores) 
 or send batches of say 1000 docs with the other SolrServer impl using N 
 threads 
 (N=# of your CPU cores)
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
  From: Charles Wardell charles.ward...@bcsolution.com
  To: solr-user@lucene.apache.org
  Sent: Tue, April 26, 2011 2:32:29 PM
  Subject: Question on Batch process
  
  I am sure that this question has been asked a few times, but I can't seem 
  to 
  find the sweetspot for indexing.
  
  I have about 100,000 files each containing 1,000 xml documents ready to be 
  posted to Solr. My desire is to have it index as quickly as possible and 
  then 
  once completed the daily stream of ADDs will be small in comparison.
  
  The individual documents are small. Essentially web postings from the net. 
  Title, postPostContent, date. 
  
  
  What would be the ideal configuration? For RamBufferSize, mergeFactor, 
  MaxbufferedDocs, etc..
  
  My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
  I have 16GB of available ram.
  
  
  Thanks in advance.
  Charlie
 


Re: SynonymFilterFactory case changes

2011-04-26 Thread Erick Erickson
Ahhh, I mis-read your post..

First, it's not the synonymfilterfactory that's lowercasing anything. The
ingorecase=true affects the matching, not the output. The output is
probably lowercased because you have it that way in the synonyms.txt
file. At least that's what I just saw using the analysis page from the
Solr admin page.

So yes, if you want the WDF to do anything on tokens put into the input
stream by SynonymFilterFactory, you need to make the
replacement be the accurate case.

But I think you already figured all that out

Best
Erick

On Tue, Apr 26, 2011 at 7:19 PM, Robert Petersen rober...@buy.com wrote:
 But in this case lowercase is after WDF.  The question is that when you get a 
 hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
 file are all in lower case do I need to add the case changing versions to 
 make WDF work on case changes because it appears the synonym text is replaced 
 verbatim by what is in the txt file and so that defeats the WDF filter.  In 
 fact, adding the case changing versions of this term to the synonyms.txt file 
 makes this use case work.  (yay)

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, April 26, 2011 3:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SynonymFilterFactory case changes

 Yes, order does matter.  You're right, putting, say, lowercase in front
 of WordDelimiter... will mess up the operations of WDFF.

 The admin/analysis page is *extremely* useful for understanding what
 happens in the analysis of input. Make sure to check the verbose
 checkbox.

 Best
 Erick

 On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 So if there is a hit in the synonym filter factory, do I need to put the
 various case changes for a term so that the following
 WordDelimiterFilter analyzer can do its 'split on case changes' work?
 Here we see SynonymFilterFactory makes all terms lowercase because this
 is what is in my synonmyms.txt file and I have ignoreCase=true:
 macafee, mcafee

 Index Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term position   1
 term text       McAfee
 term type       word
 source start,end        0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
 term position   1
 term text       macafee
 mcafee
 term type       word
 word
 source start,end        0,6
 0,6
 payload





Suggester or spellcheck return stored fields

2011-04-26 Thread wakemaster 39
Hello all,

I am trying to build an autocomplete solution for a website that I run. The
current implementation of it is going to be used on who you want to send
PM's too. I have it basically working up to this point, The UI is done and
the suggester is working in returning possible solutions without any major
problems. The problem I am currently running into is that the suggestions it
is returning are not necessarily unique.

To solve this, I would like to return the user ID (a stored field) along
with the suggestion. This would help in other areas but would ensure things
are unique. Is it possible to make suggester to return these other fields or
is it strictly returning text as I assume is the case. I know I am likely
stretching what the suggester is suppose to do, so I am ok rolling back to a
different plan using normal queries. But would prefer to be able to use
suggester if possible.

Thanks for the help,

Cameron


Re: How to Update Value of One Field of a Document in Index?

2011-04-26 Thread Peter Spam
My schema: id, name, checksum, body, notes, date

I'd like for a user to be able to add notes to the notes field, and not have to 
re-index the document (since the body field may contain 100MB of text).  Some 
ideas:

1) How about creating another core which only contains id, checksum, and notes? 
 Then, updating (delete followed by add) wouldn't be that painful?

2) What about using a multValued field?  Could you just keep adding values as 
the user enters more notes?


Pete

On Sep 9, 2010, at 11:06 PM, Liam O'Boyle wrote:

 Hi Savannah,
 
 You can only reindex the entire document; if you only have the ID,
 then do a search to retrieve the rest of the data, then reindex.  This
 assumes that all of the fields you need to index are stored (so that
 you can retrieve them) and not just indexed.
 
 Liam
 
 On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 savannah_becket...@yahoo.com wrote:
 
 I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
 update the value of one of the fields of a document in the solr index after 
 the
 document was already indexed, and I have only the document id.  How do I do
 that?
 
 Thanks.
 
 
 



Re: What initialize new searcher?

2011-04-26 Thread Solr Beginner
Thank you for the answers. I'm moving forward and have few more
questions but for separate threads.

On Tue, Apr 26, 2011 at 10:47 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi,

 Yes, typically after your index has been replicated from master to a slave a
 commit will be issued and the new searcher will be opened.  Before being 
 exposed
 to regular clients it's a good practice to warm things up.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Solr Beginner solr_begin...@onet.pl
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 8:50:21 AM
 Subject: What initialize new searcher?

 Hi,

 I'm reading solr cache documentation -
 http://wiki.apache.org/solr/SolrCaching I found there The current
 Index  Searcher serves requests and when a new searcher is opened
 Could you  explain when new searcher is opened? Does it have something
 to do with index  commit?

 Best Regards,
 Solr Beginner




fieldCache only on stats page

2011-04-26 Thread Solr Beginner
Hi,

I can see only fieldCache (nothing about filter, query or document
cache) on stats page. What I'm doing wrong? We have two servers with
replication. There are two cores(prod, dev) on each server. Maybe I
have to add something to solrconfig.xml of cores?

Best Regards,
Solr Beginner


DataImportHandler in Solr 3.1.0: not updating dataimport.properties last_index_time on delta-import?

2011-04-26 Thread Scott Bigelow
Title pretty much says it all; I've configured the DIH in 3.1.0, and
it works great, except the delta-imports are always from the last time
a full-import happened, not a delta-import. After a delta-import,
dataimport.properties is completely untouched. The documentation
implies that the delta-import should update the last_index_time:

The DataImportHandler exposes a variable called last_index_time which
is a timestamp value denoting the last time full-import 'or'
delta-import was run
- http://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example

Is there a configuration preventing delta-import from updating
dataimport.properties? It updates properly on each full-import.