date:20100107

How are these fields defined in your schema.xml? Note
that String types are indexed without tokenization, so
if str is defined as a String field type, that may be
part of your problem (try text type if so).

If this is irrelevant, please show us the relevant parts
of your schema and the query you're submitting...

Erick

On Thu, Jan 7, 2010 at 6:17 AM, gunjan_versata gunjanga...@gmail.comwrote:



 Hi All,

 i have a document indexed in solr, which is as follow :

 doc
 str name=idP-E-HE-Philips-32PFL5409-98-Black-32/str
 arr name=keywords
 strPhilips/str
 strLCD TVs/str
 /arr
 str name=title
 Philips 32PFL5409-98  32 LCDTV withPixel Plus HD (Black,32)
 /str
 /doc

 now when i search for lcd tvs, i dont the above doc in search results.. on
 doing explain other, i got the following output..

   P-E-HE-Philips-32PFL5409-98-Black-32:
 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited
 clause(s)
  0.0 = no match on required clause (((subKeywords:lcd^0.1 |
 keywords:lcd^0.5 | defaultKeywords:lcd | contributors:lcd^0.5 | title:lcd)
 (subKeywords:televis^0.1 | keywords:tvs^0.5 | defaultKeywords:tvs |
 contributors:tvs^0.5 | (title:televis title:tv title:tvs)))~1)
0.0 = (NON-MATCH) Failure to match minimum number of optional clauses: 1
  0.91647065 = (MATCH) max of:
0.91647065 = (MATCH) weight(keywords:lcd tvs^0.5 in 40), product of:
  0.13178125 = queryWeight(keywords:lcd tvs^0.5), product of:
0.5 = boost
11.127175 = idf(docFreq=34, maxDocs=875476)
0.023686381 = queryNorm
  6.9544845 = (MATCH) fieldWeight(keywords:lcd tvs in 40), product of:
1.0 = tf(termFreq(keywords:lcd tvs)=1)
11.127175 = idf(docFreq=34, maxDocs=875476)
0.625 = fieldNorm(field=keywords, doc=40)

 i am not sure of what it means? and if i can tweak it or not?  please not,
 this score was more than the results which showed up...

 Regards,
 Gunjan

 --
 View this message in context:
 http://old.nabble.com/Meaning-of-this-error%3A-Failure-to-meet-condition%28s%29-of-required-prohibited-clause%28s%29tp27058008p27058008.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Field highlighting

It's really hard to provide any response with so little information,
could you show us the difference between a field that works
and one that doesn't? Especially the relevant schema.xml entries
and the query that fails to highlight

Erick

On Thu, Jan 7, 2010 at 7:47 AM, Xavier Schepler 
xavier.schep...@sciences-po.fr wrote:

 Hi,

 I'm trying to highlight short text values. The field they came from has a
 type shared with other fields. I have highlighting working on other fields
 but not on this one.
 Why ?

Re: Field highlighting

2010-01-07 Thread Xavier Schepler


Erick Erickson a écrit :

It's really hard to provide any response with so little information,
could you show us the difference between a field that works
and one that doesn't? Especially the relevant schema.xml entries
and the query that fails to highlight

Erick

On Thu, Jan 7, 2010 at 7:47 AM, Xavier Schepler 
xavier.schep...@sciences-po.fr wrote:

  

Hi,

I'm trying to highlight short text values. The field they came from has a
type shared with other fields. I have highlighting working on other fields
but not on this one.
Why ?




  

Thanks for your response.
Here are some extracts from my schema.xml :

fieldtype name=textFr class=solr.TextField
 analyzer
   !-- suppression des mots vides de sens --
   filter class=solr.StopFilterFactory 
words=french-stopwords.txt ignoreCase=true/

   !-- decoupage en jetons --
   tokenizer class=solr.StandardTokenizerFactory/
   !-- suppression des accents --
   filter class=solr.ISOLatin1AccentFilterFactory/
   !-- suppression des points a la fin des accronymes --
   filter class=solr.StandardFilterFactory/
   !-- passage en miniscules --
   filter class=solr.LowerCaseFilterFactory/
   !-- lexemisation avec le filtre porter --
   filter class=solr.SnowballPorterFilterFactory language=French/
   !-- synonymes --
   filter class=solr.SynonymFilterFactory 
synonyms=test-synonyms.txt ignoreCase=true expand=true/

 /analyzer
   /fieldtype

Here's a field on which highlighting works :

field name=questionsLabelsFr
   required=false
   type=textFr
   multiValued=true
   indexed=true
   stored=true
   compressed=false
   omitNorms=false
   termVectors=true
   termPositions=true
   termOffsets=true
   /

Here's the field on which it doesn't :

  field name=modalitiesLabelsFr
   required=false
   type=textFr
   multiValued=true
   indexed=true
   stored=true
   compressed=false
   omitNorms=false
   termVectors=true
   termPositions=true
   termOffsets=true
   /

They are kinda the same.

But modalitiesLabelFr contains mostly short strings like :

Côtes-d Armor
Creuse
Dordogne
Doubs
Drôme
Eure
Eure-et-Loir
Finistère

When matches are found in them, I get a list like this, with no text :

lst name=highlighting
lst name=dbbd3642-db1d-4b35-9280-11582523903d/

lst name=f1d8be2d-1070-4111-b16e-94d16c8c0bc6/
/lst

The name attribute is the uid of the document.

I tryed several values for hl.fragsize (0, 1, 2, ...) with no success at 
all.

Combining frange with other query parameters?

2010-01-07 Thread Oliver Beattie

Hey,

I'm doing a query which involves using an frange in the filter query — and I
was wondering if there is a way of combing the frange with other parameters.
Something like ({!frange l=x u=y)*do stuff*) AND *field:param*) — but
obviously this doesn't work. Is there a way of doing this?

—Oliver

Re: Strange Behavior When Using CSVRequestHandler

2010-01-07 Thread danben


Erick - thanks very much, all of this makes sense.  But the one thing I still
find puzzling is the fact that re-adding the file a second, third, fourth
etc time causes numDocs to increase, and ALWAYS by the same amount
(141,645).  Any ideas as to what could cause that?

Dan


Erick Erickson wrote:
 
 I think the root of your problem is that unique fields should NOT
 be multivalued. See
 http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)
 
 http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)In
 this case, since you're tokenizing, your query field is
 implicitly multi-valued, I don't know what the behavior will be.
 
 But there's another problem:
 All the filters in your analyzer definition will mess up the
 correspondence between the Unix uniq and numDocs even
 if you got by the above. I.e
 
 StopFilter would make the lines a problem and the problem identical.
 WordDelimiter would do all kinds of interesting things
 LowerCaseFilter would make Myproblem and myproblem identical.
 RemoveDuplicatesFilter would make interesting interesting and
 interesting identical
 
 You could define a second field, make *that* one unique and NOT analyzer
 it in any way...
 
 You could hash your sentences and define the hash as your unique key.
 
 You could
 
 HTH
 Erick
 
 On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote:
 

 The problem:

 Not all of the documents that I expect to be indexed are showing up in
 the
 index.

 The background:

 I start off with an empty index based on a schema with a single field
 named
 'query', marked as unique and using the following analyzer:

 analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer

 My input is a utf-8 encoded file with one sentence per line.  Its total
 size
 is about 60MB.  I would like each line of the file to correspond to a
 single
 document in the solr index.  If I print the number of unique lines in the
 file (using cat | sort | uniq | wc -l), I get a little over 2M.  Printing
 the total number of lines in the file gives me around 2.7M.

 I use the following to start indexing:

 curl
 '
 http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape=
 \'

 When this command completes, I see numDocs is approximately 470k (which
 is
 what I find strange) and maxDocs is approximately 890k (which is fine
 since
 I know I have around 700k duplicates).  Even more confusing is that if I
 run
 this exact command a second time without performing any other operations,
 numDocs goes up to around 610k, and a third time brings it up to about
 750k.

 Can anyone tell me what might cause Solr not to index everything in my
 input
 file the first time, and why it would be able to index new documents the
 second and third times?

 I also have this line in solrconfig.xml, if it matters:

 requestParsers enableRemoteStreaming=true
 multipartUploadLimitInKB=2048 /

 Thanks,
 Dan

 --
 View this message in context:
 http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: High Availability

2010-01-07 Thread Matthew Inger

I've tried having two servers set up to replicate each other, and it is not a 
pretty thing.  It seems that SOLR doesn't really do any checking of the version 
# to see if the version # on the master is  the version # on the slave before 
deciding to replicate.  It only looks to see if it's different.  As a result, 
what ends up happening is this:

1.  Both servers at same revision, say revision 100
2.  Update Master 1 to revision 101
3.  Master 2 starts pull of revision 101
4.  Master 1 sees master 2 has different revision and starts pull of revision 
100

See where it's going?  Eventually, both servers seem to end up back at revision 
100, and my updates get lost, so my sequencing might be a little out of wack 
here, but nonetheless having two servers setup as slaves to each other does not 
work properly.  I would think though that with a small code change to check to 
see if the revision # has increased before pulling the file, that would solve 
the issue.

In the mean time, my plan is to:
1.  Setup two index update servers as masters behind an F5 load balancer with a 
VIP in an active/passive configuration. 
2.  Setup N search servers as slaves behind an F5 load balancer with a VIP in 
an round robin configuration. Replication would be from the master's VIP, 
instead of any one particular master.
3.  Index update servers would have a handler would would do delta updates 
every so often to keep both servers in sync with the database (i'm only 
indexing a complex database here, which doesn't lend itself well to sql 
querying on the fly).

Ideally, i'd love to be able to force the master servers to update if either 
one of them switches from passive to active state, but am not sure how to 
accomplish that.


 
mattin...@yahoo.com
Once you start down the dark path, forever will it
dominate your destiny.  Consume you it will  - Yoda



- Original Message 
From: r...@intelcompute.com r...@intelcompute.com
To: solr-user@lucene.apache.org
Sent: Mon, January 4, 2010 11:37:22 AM
Subject: Re: High Availability


Even when Master 1 is alive again, it shouldn't get the floating IP until 
Master 2 actually fails.

So you'd ideally want them replicating to eachother, but since one will only be 
updated/Live at a time, it shouldn't cause an issue with cobbling data (?).

Just a suggestion tho, not done it myself on Solr, only with DB servers.




On Mon 04/01/10 16:28 , Matthew Inger mattin...@yahoo.com wrote:

 So, when the masters switch back, does that mean, we have to force a
 full delta update, correct?
 
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 - Original Message 
 From:  
 To: 
 Sent: Mon, January 4, 2010 11:17:40 AM
 Subject: Re: High Availability
 Have you looked into a basic floating IP setup?
 Have the master also replicate to another hot-spare master.
 Any downtime during an outage of the 'live' master would be minimal
 as the hot-spare takes up the floating IP.
 On Mon 04/01/10 16:13 , Matthew Inger  wrote:
  I'm kind of stuck and looking for suggestions for high
 availability
  options.  I've figured out without much trouble how to get the
  master-slave replication working.  This eliminates any single
 points
  of failure in the application in terms of the application's
 searching
  capability.
  I would setup a master which would create the index and several
  slaves to act as the search servers, and put them behind a load
  balancer to distribute the requests.  This would ensure that if a
  slave node goes down, requests would continue to get serviced by
 the
  other nodes that are still up.
  The problem I have is that my particular application also has the
  capability to trigger index updates from the user interface.  This
  means that the master now becomes a single point of failure for
 the
  user interface.  
  The basic idea of the app is that there are multiple oracle
  instances contributing to a single document.  The volume and
  organization of the data (database links, normalization, etc...)
  prevents any sort of fast querying via SQL to do querying of the
  documents.  The solution is to build a lucene index (via solr),
 and
  use that for searching.  When updates are made in the UI, we will
  also send the updates directly to the solr server as well (we
 don't
  want to wait some arbitrary interval for a delta query to run).  
  So you can see the problem here is that if the master is down, the
  sending of the updates to the master solr server will fail, thus
  causing an application exception.
  I have tried configuring multiple solr servers which are both
 setup
  as masters and slaves to each other, but they keep clobber each
  other's index updates and rolling back each other's delta updates.
 
  It seems that the replication doesn't take the generation # into
  account and check that the generation it's fetching is  the
  generation it already has before it applies it.
  I thought of maybe

Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.

if I use an analyzer, Solr stores it's output two ways?
One public output, which is similar to the original input
and one hidden or internal output, which is based on the analyzer's work?
Did I understand that right?

If yes, I have got another problem:
I don't want to waste any diskspace. Does the copyfield-order stores the
same data two times?
I mean: I have got originalField and copiedField. originalField gets indexed
with text_analyzer and copiedField with a stemmer. Does this mean, I am
storing the original data two times public and once analyzed per analyzer?
Or does Solr stores the original input only once and makes a reference to
the public data of the originalField?

Thank you
Mitch

Erik Hatcher-4 wrote:

Mitch,

Again, I think you're misunderstanding what analysis does. You must
be expecting we think, though you've not provided exact duplication
steps to be sure, that the value you get back from Solr is the
analyzer processed output. It's not, it's exactly what you provide.
Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.

There's some thinking going on implementing it such that analyzed
output is stored.

You can, however, use the analysis request handler componentry to get
analyzed stuff back as you see it in analysis.jsp on a per-document or
per-field text basis - if you're looking to leverage the analyzer
output in that fashion from a client.

Erik

On Jan 7, 2010, at 1:21 AM, MitchK wrote:

Hello Erick,

thank you for answering.

I can do whatever I want - Solr does nothing.
For example: If I use the textgen-fieldtype which is predefined,
nothing
happens to the text. Even the stopFilter is not working - no
stopword from
stopword.txt was replaced. I think, that this only affects the index,
because, if I query for for he returns nothing, which is quietly
correct,
due to the work of the stopFilter.

Everything works fine on analysis.jsp, but not in reality.

If you have got any testcase-data you want me to add, please, tell
me and I
will show you the saved data afterwards.

Thank you.

Mitch

Erick Erickson wrote:

Well, I have noticed that Solr isn't using ANY analyzer

How do you know this? Because it's highly unlikely that SOLR
is completely broken on that level.

Erick

On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote:

I have tested a lot and all the time I thought I set wrong options
for my
custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or
stemmer.
It seems like it only stores the original input.

I am using the example-configuration of the current Solr 1.4
release.
What's wrong?

Thank you!
--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27062080.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: replication -- missing field data file

2010-01-07 Thread Giovanni Fernandez-Kincade

Right, but if you want to take periodic backups and ship them to tape or some 
DR site, you need to be able to tell when the backup is actually complete. 

It's seems very strange to me that you can actually track the replication 
progress on a slave, but you can't track the backup progress on a master. 

To me that suggests that the only reliable way of performing backups is to set 
up replication to some slave without a regular polling interval. Then force a 
poll, wait for the sync to complete, and ship the slave's index to redundant 
storage. Seems like a pretty backwards way of doing things...


-Original Message-
From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble 
Paul ??? ??
Sent: Thursday, January 07, 2010 5:56 AM
To: solr-user
Subject: Re: replication -- missing field data file

actually it does not.
BTW, FYI, backup is just to take periodics backups not necessary for
the Replicationhandler to work

On Thu, Jan 7, 2010 at 2:37 AM, Giovanni Fernandez-Kincade
gfernandez-kinc...@capitaliq.com wrote:
 How can you tell when the backup is done?

 -Original Message-
 From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble 
 Paul ??? ??
 Sent: Wednesday, January 06, 2010 12:23 PM
 To: solr-user
 Subject: Re: replication -- missing field data file

 the index dir is in the name index others will be stored as
 indexdate-as-number

 On Wed, Jan 6, 2010 at 10:31 PM, Giovanni Fernandez-Kincade
 gfernandez-kinc...@capitaliq.com wrote:
 How can you differentiate between the backup and the normal index files?

 -Original Message-
 From: noble.p...@gmail.com [mailto:noble.p...@gmail.com] On Behalf Of Noble 
 Paul ??? ??
 Sent: Wednesday, January 06, 2010 11:52 AM
 To: solr-user
 Subject: Re: replication -- missing field data file

 On Wed, Jan 6, 2010 at 9:49 PM, Giovanni Fernandez-Kincade
 gfernandez-kinc...@capitaliq.com wrote:
 I set up replication between 2 cores on one master and 2 cores on one 
 slave. Before doing this the master was working without issues, and I 
 stopped all indexing on the master.

 Now that replication has synced the index files, an .FDT field is suddenly 
 missing on both the master and the slave. Pretty much every operation (core 
 reload, commit, add document) fails with an error like the one posted below.

 How could this happen? How can one recover from such an error? Is there any 
 way to regenerate the FDT file without re-indexing everything?

 This brings me to a question about backups. If I run the 
 replication?command=backup command, where is this backup stored? I've tried 
 this a few times and get an OK response from the machine, but I don't see 
 the backup generated anywhere.
 The backup is done asynchronously. So it always gives an OK response 
 immedietly.
 The backup is created in the data dir itself

 Thanks,
 Gio.

 org.apache.solr.common.SolrException: Error handling 'reload' action
       at 
 org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:412)
       at 
 org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:142)
       at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
       at 
 org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:298)
       at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
       at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
       at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
       at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
       at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
       at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
       at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
       at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
       at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
       at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
       at 
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
       at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
       at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
       at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
       at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: 
 Y:\solrData\FilingsCore2\index\_a0r.fdt (The system cannot find the file 
 specified)
       at

ontology support

2010-01-07 Thread Claudio Martella

hello,

i'm trying to use an ontology (homegrown :) ) to support the search.
i.e. i'd like my search engine to report search results for barack
obama even if i look for president. I see there's some support in
Nutch API (org.apache.nutch.ontology) so (if it does what i'm looking
for) i'm guessing if something like that comes with solr too.

any ideas?


Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research  Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.

Re: No Analyzer, tokenizer or stemmer works at Solr



On Jan 7, 2010, at 10:50 AM, MitchK wrote:



Eric,

you mean, everything is okay, but I do not see it?


Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.


if I use an analyzer, Solr stores it's output two ways?
One public output, which is similar to the original input
and one hidden or internal output, which is based on the  
analyzer's work?

Did I understand that right?


yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are  
based on indexed fields)


Take a look at Lucene in Action for a better description of what is  
happening.  The best tool to get your head around what is happening is  
probably luke (http://www.getopt.org/luke/)





If yes, I have got another problem:
I don't want to waste any diskspace.


You have control over what is stored and what is indexed -- how that  
is configured is up to you.


ryan

Re: Strange Behavior When Using CSVRequestHandler

It puzzles me too. I don't know the internals of that code
well enough to speculate, but once you're into undefined
behavior, I have great faith in *many* inexplicable things
happening.

Erick

On Thu, Jan 7, 2010 at 9:45 AM, danben dan...@gmail.com wrote:

Erick - thanks very much, all of this makes sense. But the one thing I
still
find puzzling is the fact that re-adding the file a second, third, fourth
etc time causes numDocs to increase, and ALWAYS by the same amount
(141,645). Any ideas as to what could cause that?

Dan

Erick Erickson wrote:

I think the root of your problem is that unique fields should NOT
be multivalued. See

http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)

http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=(unique)|(key)
In
this case, since you're tokenizing, your query field is
implicitly multi-valued, I don't know what the behavior will be.

But there's another problem:
All the filters in your analyzer definition will mess up the
correspondence between the Unix uniq and numDocs even
if you got by the above. I.e

StopFilter would make the lines a problem and the problem identical.
WordDelimiter would do all kinds of interesting things
LowerCaseFilter would make Myproblem and myproblem identical.
RemoveDuplicatesFilter would make interesting interesting and
interesting identical

You could define a second field, make *that* one unique and NOT analyzer
it in any way...

You could hash your sentences and define the hash as your unique key.

You could

HTH
Erick

On Wed, Jan 6, 2010 at 1:06 PM, danben dan...@gmail.com wrote:

The problem:

Not all of the documents that I expect to be indexed are showing up in
the
index.

The background:

I start off with an empty index based on a schema with a single field
named
'query', marked as unique and using the following analyzer:

analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer

My input is a utf-8 encoded file with one sentence per line. Its total
size
is about 60MB. I would like each line of the file to correspond to a
single
document in the solr index. If I print the number of unique lines in
the
file (using cat | sort | uniq | wc -l), I get a little over 2M.
Printing
the total number of lines in the file gives me around 2.7M.

I use the following to start indexing:

curl
'

http://localhost:8983/solr/update/csv?commit=trueseparator=%09stream.file=/home/gkropitz/querystage2map/file1stream.contentType=text/plain;charset=utf-8fieldnames=queryescape=
\'

When this command completes, I see numDocs is approximately 470k (which
is
what I find strange) and maxDocs is approximately 890k (which is fine
since
I know I have around 700k duplicates). Even more confusing is that if I
run
this exact command a second time without performing any other
operations,
numDocs goes up to around 610k, and a third time brings it up to about
750k.

Can anyone tell me what might cause Solr not to index everything in my
input
file the first time, and why it would be able to index new documents the
second and third times?

I also have this line in solrconfig.xml, if it matters:

requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=2048 /

Thanks,
Dan

--
View this message in context:

http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-tp27026926p27026926.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://old.nabble.com/Strange-Behavior-When-Using-CSVRequestHandler-%28Solr-1.4%29-tp27026926p27061086.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the value and
on the other hand the indexed version of the value.
How can I fullfill such needs? Doing copyfield on indexed-only fields?

ryantxu wrote:

On Jan 7, 2010, at 10:50 AM, MitchK wrote:

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.

yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are
based on indexed fields)

Take a look at Lucene in Action for a better description of what is
happening. The best tool to get your head around what is happening is
probably luke (http://www.getopt.org/luke/)

If yes, I have got another problem:
I don't want to waste any diskspace.

You have control over what is stored and what is indexed -- how that
is configured is up to you.

ryan

--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolJ and query parameters

2010-01-07 Thread Jon Poulton

Hi there,
I'm trying to understand how the query syntax specified on the Solr Wiki ( 
http://wiki.apache.org/solr/SolrQuerySyntax ) fits in with the usage of the 
SolJ class SolrQuery. There are not too many examples of usage to be found.

For example. Say I wanted to replicate the following query using SolrQuery.

q={!lucene q.op=AND df=text}myfield:foo +bar -baz

How would I do it so that q.op was set to OR instead of AND? There is no 
method I can see on SolrQuery to set q.op, only a query string, which is 
presumably in this case is the text +bar -baz, as the rest can be specified 
by calling set methods on SolrQuery.

Thanks in advance for any help.

Jon

Re: SolJ and query parameters

2010-01-07 Thread Ahmet Arslan

--- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote:

 From: Jon Poulton jon.poul...@vyre.com
 Subject: SolJ and query parameters
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, January 7, 2010, 7:25 PM
 Hi there,
 I'm trying to understand how the query syntax specified on
 the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in
 with the usage of the SolJ class SolrQuery. There are not
 too many examples of usage to be found.

 For example. Say I wanted to replicate the following query
 using SolrQuery.

 q={!lucene q.op=AND df=text}myfield:foo +bar -baz

Whole string is the value of the parameter q.

SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz);

 How would I do it so that q.op was set to OR instead of
 AND? There is no method I can see on SolrQuery to set
 q.op, only a query string, which is presumably in this
 case is the text +bar -baz, as the rest can be specified
 by calling set methods on SolrQuery.

if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND);

hope this helps.

Sharding and Index Update

2010-01-07 Thread Jae Joo

All,

I have two indices - one has 23 M document and the other has less than 1000.
The small index is for real time update.

Does updating small index (with commit) hurt the overall performance?
(We can not update realtime for 23M big index because of heavy traffic and
size).

Thanks,

Jae Joo

RE: SolJ and query parameters

2010-01-07 Thread Jon Poulton

Thanks for the reply. 

Using SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz}); 
would make more sense if it were not for the other methods available on 
SolrQuery. 

For example, there is a setFields(String..) method. So what happens if I call 
setFields(title, description) after having set the query to the above 
value? What do I end up with? Something like this:

{!lucene q.op=AND df=text}title:(foo +bar -baz) description:(foo +bar baz)}

I'm still having trouble understanding how the class is intended to be used. 

Cheers

Jon


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: 07 January 2010 17:38
To: solr-user@lucene.apache.org
Subject: Re: SolJ and query parameters



--- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote:

 From: Jon Poulton jon.poul...@vyre.com
 Subject: SolJ and query parameters
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, January 7, 2010, 7:25 PM
 Hi there,
 I'm trying to understand how the query syntax specified on
 the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in
 with the usage of the SolJ class SolrQuery. There are not
 too many examples of usage to be found.
 
 For example. Say I wanted to replicate the following query
 using SolrQuery.
 
 q={!lucene q.op=AND df=text}myfield:foo +bar -baz

Whole string is the value of the parameter q.

SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz);

 How would I do it so that q.op was set to OR instead of
 AND? There is no method I can see on SolrQuery to set
 q.op, only a query string, which is presumably in this
 case is the text +bar -baz, as the rest can be specified
 by calling set methods on SolrQuery.

if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND);

hope this helps.

RE: SolJ and query parameters

2010-01-07 Thread Jon Poulton

I've also just noticed that QueryParsing is not in the SolrJ API. It's in one 
of the other Solr jar dependencies.

I'm beginning to think that maybe the best approach it to write a query string 
generator which can generate strings of the form: 

q={!lucene q.op=AND df=text}myfield:foo +bar -baz

Then just set this on a SolrQuery instance and send it over the wire. It not 
the kind of string you'd want an end user to have to type out.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: 07 January 2010 17:38
To: solr-user@lucene.apache.org
Subject: Re: SolJ and query parameters



--- On Thu, 1/7/10, Jon Poulton jon.poul...@vyre.com wrote:

 From: Jon Poulton jon.poul...@vyre.com
 Subject: SolJ and query parameters
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, January 7, 2010, 7:25 PM
 Hi there,
 I'm trying to understand how the query syntax specified on
 the Solr Wiki ( http://wiki.apache.org/solr/SolrQuerySyntax ) fits in
 with the usage of the SolJ class SolrQuery. There are not
 too many examples of usage to be found.
 
 For example. Say I wanted to replicate the following query
 using SolrQuery.
 
 q={!lucene q.op=AND df=text}myfield:foo +bar -baz

Whole string is the value of the parameter q.

SolrQuery.setQuery({!lucene q.op=AND df=text}myfield:foo +bar -baz);

 How would I do it so that q.op was set to OR instead of
 AND? There is no method I can see on SolrQuery to set
 q.op, only a query string, which is presumably in this
 case is the text +bar -baz, as the rest can be specified
 by calling set methods on SolrQuery.

if you want to set q.op to AND you can use SolrQuery.set(QueryParsing.OP,AND);

hope this helps.

RE: SolJ and query parameters

2010-01-07 Thread Ahmet Arslan

 Using SolrQuery.setQuery({!lucene q.op=AND
 df=text}myfield:foo +bar -baz}); would make more sense if
 it were not for the other methods available on SolrQuery. 
 
 For example, there is a setFields(String..) method. So
 what happens if I call setFields(title, description)
 after having set the query to the above value? What do I end
 up with? Something like this:

 {!lucene q.op=AND df=text}title:(foo +bar -baz)
 description:(foo +bar baz)}

No. setFields is equvalent to  fl=title,description  It determines which 
fields will be returned as a result.

 I'm still having trouble understanding how the class is
 intended to be used. 

SolrQuery extends ModifiableSolrParams. If you look at the source code of it 
you can understand. For example setQuery method invokes 
this.set(CommonParams.Q, query);

You can set anything in the search url with this class. key=value is equal to 
SolrQuery.set(key, value). There are some multivalued keys like fq and 
facet.field, in those cases you can use add() method.

Re: No Analyzer, tokenizer or stemmer works at Solr

On Jan 7, 2010, at 12:11 PM, MitchK wrote:

Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the
value and

on the other hand the indexed version of the value.
How can I fullfill such needs? Doing copyfield on indexed-only fields?

see erik's response on 'analysis request handler'

ryantxu wrote:

On Jan 7, 2010, at 10:50 AM, MitchK wrote:

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to
the

index in an inverted fashion, but the stored stuff is left alone.

yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are
based on indexed fields)

Take a look at Lucene in Action for a better description of what is
happening. The best tool to get your head around what is happening
is

probably luke (http://www.getopt.org/luke/)

If yes, I have got another problem:
I don't want to waste any diskspace.

You have control over what is stored and what is indexed -- how that
is configured is up to you.

ryan

Re: No Analyzer, tokenizer or stemmer works at Solr

What is your use case for responding sometimes with the indexed value?
Other than reconstructing a field that hasn't been stored, I can't think of
one.

I still think you're missing the point. Indexing and storing are
orthogonal operations that have (almost) nothing to do with each
other, for all that they happen at the same time on the same field.

You never search against the stored data in a field. You *always*
search against the indexed data.

Contrariwise, you never display the indexed form to the user, you
*always* show the stored data (unless you come up with
a really interesting use case).

Step back and consider what happens when you index data,
it gets broken up all kinds of ways. Stop words are removed,
case may change, etc, etc, etc. It makes no sense to
then display this data for a user. Would you really like
to have, say a movie title The Good, The Bad, and The
Ugly. Remove stopwords, puncuation and lowercase
and you index three tokens good, bad, ugly.
Even if you reconstruct this field, the user would see
good bad ugly. Bad, very bad.

Yet I want to display the original title to the user in
response to searching on ugly, so I need the
original, unanalyzed data.

Perhaps it would help to think of it this way.
1 take some data and index it in f1
but do NOT store it in f1. Store it in f2
but do NOT index it in f2.
2 take that same data, index AND store
it in f3.

1 is almost entirely equivalent to 2
in terms of index resources.

Practically though, 1 is harder to use,
because you have to remember
to use f1 for searching and f2 for getting
the raw data.

HTH
Erick

On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:

Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the value and
on the other hand the indexed version of the value.
How can I fullfill such needs? Doing copyfield on indexed-only fields?

ryantxu wrote:

On Jan 7, 2010, at 10:50 AM, MitchK wrote:

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.

yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are
based on indexed fields)

Take a look at Lucene in Action for a better description of what is
happening. The best tool to get your head around what is happening is
probably luke (http://www.getopt.org/luke/)

If yes, I have got another problem:
I don't want to waste any diskspace.

You have control over what is stored and what is indexed -- how that
is configured is up to you.

ryan

Re: SolJ and query parameters



On Jan 7, 2010, at 1:05 PM, Jon Poulton wrote:

I've also just noticed that QueryParsing is not in the SolrJ API.  
It's in one of the other Solr jar dependencies.


I'm beginning to think that maybe the best approach it to write a  
query string generator which can generate strings of the form:


q={!lucene q.op=AND df=text}myfield:foo +bar -baz

Then just set this on a SolrQuery instance and send it over the  
wire. It not the kind of string you'd want an end user to have to  
type out.




Yes, if you need to manipulate the local params, that seems like a  
good approach.


Solrj was written before the local params syntax was introduced.

A patch that adds LocalParams support to solrj would be welcome :)

ryan

Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

The difference between stored and indexed is clear now.

You are right, if you are responsing only to normal users.

Use case:
You got a stored field The good, the bad and the ugly.
And you got a really fantastic analyzer, which is doing some magic to this
movie title.
Let's say, the analyzer translates the title into md5 or into another
abstract expression.
Instead of doing the same magical function on the client's side again and
again, he only needs to take the prepared data from your response.

Another use case could be:
Imagine you have got two categories: cheap and expensive and your document
gots a title-, a label-, an owner- and a price-field.
Imagine you would analyze, index and store them like you normally do and
afterwards you want to set, whether the document belongs to the expensive
item-group or not.
If the price for the item is higher than 500$, it belongs to the expensive
ones, otherwise not.
I think, this would be a job for a special analyzer - and this only makes
sense, if I also store the analyzed data.

I think information retrieval is a really interesting use case.

Erick Erickson wrote:

What is your use case for responding sometimes with the indexed value?
Other than reconstructing a field that hasn't been stored, I can't think
of
one.

I still think you're missing the point. Indexing and storing are
orthogonal operations that have (almost) nothing to do with each
other, for all that they happen at the same time on the same field.

You never search against the stored data in a field. You *always*
search against the indexed data.

Contrariwise, you never display the indexed form to the user, you
*always* show the stored data (unless you come up with
a really interesting use case).

Yet I want to display the original title to the user in
response to searching on ugly, so I need the
original, unanalyzed data.

1 is almost entirely equivalent to 2
in terms of index resources.

Practically though, 1 is harder to use,
because you have to remember
to use f1 for searching and f2 for getting
the raw data.

HTH
Erick

On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:

Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the value
and
on the other hand the indexed version of the value.
How can I fullfill such needs? Doing copyfield on indexed-only fields?

ryantxu wrote:

On Jan 7, 2010, at 10:50 AM, MitchK wrote:

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.

yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are
based on indexed fields)

Take a look at Lucene in Action for a better description of what is
happening. The best tool to get your head around what is happening is
probably luke (http://www.getopt.org/luke/)

If yes, I have got another problem:
I don't want to waste any diskspace.

You have control over what is stored and what is indexed -- how that
is configured is up to you.

ryan

--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-07 Thread Caleb Land

On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, I'll have to defer to the highlighter experts here


I've looked at the source code for the highlighter, and I think I know
what's going on. I haven't had time to play with this yet, so I could be
wrong, but this is my impression.

The highlighter builds a highlighted fragment by reading tokens in, and
appending their contents to a string buffer.

Now, every time a token is appended to a fragment, it adds the whitespace
between the previous token and the current token (this isn't strictly
whitespace, but really anything that was removed from the source text by the
tokenizer, like punctuation etc.).

I believe what is happening in my case is that the leading .  is the
whitespace between the last token (of the previous fragment) and the first
token of the current fragment.

And, of course, the trailing punctuation is being cut off because
the fragment builder doesn't APPEND whitespace after the last token, it
just prepends this whitespace.

You can see the code that does this, from the
Highlighter#getBestTextFragments (line 233 in lucene 3.0.0) here:

http://gist.github.com/271515

If I do what I said in my second email (add preserveOriginal=1 to the
WordDelimiterFilter), things work because the ending punctuation is stored
with the token, and just the real whitespace is prepended by this code.

I'm not sure what the solution is, but currently I'm just trimming leading
punctuation + a space off on the client side, and leaving the sentence
terminator-less.

-- 
Caleb Land

Corrupted Index

2010-01-07 Thread Jake Brownell

Hi all,

Our application uses solrj to communicate with our solr servers. We started a 
fresh index yesterday after upping the maxFieldLength setting in solrconfig. 
Our task indexes content in batches and all appeared to be well until noonish 
today, when after 40k docs, I started seeing errors. I've placed three stack 
traces below, the first occurred once and was the initial error, the second 
occurred a few times before the third started occurring on each request. I'd 
really appreciate any insight into what could have caused this, a missing file 
and then a corrupt index. If you know we'll have to nuke the entire index and 
start over I'd like to know that too-oddly enough searches against the index 
appear to be working.

Thanks!
Jake

#1

January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1
January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No 
such file or directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or 
directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No 
such file or directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file or 
directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
org.benetech.exception.WrappedException  
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)

org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
org.apache.solr.client.solrj.SolrServer#commit(86)
org.apache.solr.client.solrj.SolrServer#commit(75)
org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
org.bookshare.search.solr.SolrSearchEngine#index(232)

org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
org.bookshare.service.task.SearchEngineIndexingTask#run(53)
org.bookshare.service.scheduler.TaskWrapper#run(233)
java.util.TimerThread#mainLoop(512)
java.util.TimerThread#run(462)
Caused by:
solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
org.apache.solr.common.SolrException   
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)

org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
org.apache.solr.client.solrj.SolrServer#commit(86)
org.apache.solr.client.solrj.SolrServer#commit(75)
org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
org.bookshare.search.solr.SolrSearchEngine#index(232)

org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
org.bookshare.service.task.SearchEngineIndexingTask#run(53)
org.bookshare.service.scheduler.TaskWrapper#run(233)
java.util.TimerThread#mainLoop(512)
java.util.TimerThread#run(462)

#2

January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1
January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: 
doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2

org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_hug: fieldsReader shows 8 but segmentInfo shows 2

request: /core0/update org.apache.lucene.index.CorruptIndexException: doc 
counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2

org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_hug: fieldsReader shows 8 but segmentInfo shows 2

request: /core0/update
January 7, 2010 12:10:10 PM CST org.apache.lucene.index.CorruptIndexException: 
doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2

org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_hug: fieldsReader shows 8 but segmentInfo shows 2

request: /core0/update org.apache.lucene.index.CorruptIndexException: doc 
counts differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2

org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_hug: fieldsReader shows 8 but segmentInfo shows 2

request: /core0/update
org.benetech.exception.WrappedException  
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)

org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)

Re: Corrupted Index


what version of solr are you running?


On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote:


Hi all,

Our application uses solrj to communicate with our solr servers. We  
started a fresh index yesterday after upping the maxFieldLength  
setting in solrconfig. Our task indexes content in batches and all  
appeared to be well until noonish today, when after 40k docs, I  
started seeing errors. I've placed three stack traces below, the  
first occurred once and was the initial error, the second occurred a  
few times before the third started occurring on each request. I'd  
really appreciate any insight into what could have caused this, a  
missing file and then a corrupt index. If you know we'll have to  
nuke the entire index and start over I'd like to know that too-oddly  
enough searches against the index appear to be working.


Thanks!
Jake

#1

January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1
January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ 
_fsk_1uj.del (No such file or directory)


solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No  
such file or directory)


solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ 
_fsk_1uj.del (No such file or directory)


solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No  
such file or directory)


solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
org.benetech.exception.WrappedException   
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)

org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)

org 
.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)

   org.apache.solr.client.solrj.SolrServer#commit(86)
   org.apache.solr.client.solrj.SolrServer#commit(75)
   org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
   org.bookshare.search.solr.SolrSearchEngine#index(232)

org 
.bookshare 
.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)

   org.bookshare.service.task.SearchEngineIndexingTask#run(53)
   org.bookshare.service.scheduler.TaskWrapper#run(233)
   java.util.TimerThread#mainLoop(512)
   java.util.TimerThread#run(462)
Caused by:
solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

request: /core0/update
org.apache.solr.common.SolrException
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)

org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)

org 
.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)

   org.apache.solr.client.solrj.SolrServer#commit(86)
   org.apache.solr.client.solrj.SolrServer#commit(75)
   org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
   org.bookshare.search.solr.SolrSearchEngine#index(232)

org 
.bookshare 
.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)

   org.bookshare.service.task.SearchEngineIndexingTask#run(53)
   org.bookshare.service.scheduler.TaskWrapper#run(233)
   java.util.TimerThread#mainLoop(512)
   java.util.TimerThread#run(462)

#2

January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1
January 7, 2010 12:10:10 PM CST  
org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


request: /core0/update  
org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


request: /core0/update
January 7, 2010 12:10:10 PM CST  
org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


request: /core0/update  
org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _hug: fieldsReader shows 8 but segmentInfo shows 2


request: /core0/update
org.benetech.exception.WrappedException   
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)

RE: Corrupted Index

2010-01-07 Thread Jake Brownell

Yes, that would be helpful to include, sorry, the official 1.4.

-Original Message-
From: Ryan McKinley [mailto:ryan...@gmail.com] 
Sent: Thursday, January 07, 2010 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: Corrupted Index

what version of solr are you running?


On Jan 7, 2010, at 3:08 PM, Jake Brownell wrote:

 Hi all,

 Our application uses solrj to communicate with our solr servers. We  
 started a fresh index yesterday after upping the maxFieldLength  
 setting in solrconfig. Our task indexes content in batches and all  
 appeared to be well until noonish today, when after 40k docs, I  
 started seeing errors. I've placed three stack traces below, the  
 first occurred once and was the initial error, the second occurred a  
 few times before the third started occurring on each request. I'd  
 really appreciate any insight into what could have caused this, a  
 missing file and then a corrupt index. If you know we'll have to  
 nuke the entire index and start over I'd like to know that too-oddly  
 enough searches against the index appear to be working.

 Thanks!
 Jake

 #1

 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1
 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ 
 _fsk_1uj.del (No such file or directory)

 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No  
 such file or directory)

 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 request: /core0/update
 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/ 
 _fsk_1uj.del (No such file or directory)

 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No  
 such file or directory)

 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 request: /core0/update
 org.benetech.exception.WrappedException   
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)
 
 org 
 .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
org.apache.solr.client.solrj.SolrServer#commit(86)
org.apache.solr.client.solrj.SolrServer#commit(75)
org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
org.bookshare.search.solr.SolrSearchEngine#index(232)
 
 org 
 .bookshare 
 .service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
org.bookshare.service.task.SearchEngineIndexingTask#run(53)
org.bookshare.service.scheduler.TaskWrapper#run(233)
java.util.TimerThread#mainLoop(512)
java.util.TimerThread#run(462)
 Caused by:
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)

 request: /core0/update
 org.apache.solr.common.SolrException
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)
 
 org 
 .apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
org.apache.solr.client.solrj.SolrServer#commit(86)
org.apache.solr.client.solrj.SolrServer#commit(75)
org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
org.bookshare.search.solr.SolrSearchEngine#index(232)
 
 org 
 .bookshare 
 .service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
org.bookshare.service.task.SearchEngineIndexingTask#run(53)
org.bookshare.service.scheduler.TaskWrapper#run(233)
java.util.TimerThread#mainLoop(512)
java.util.TimerThread#run(462)

 #2

 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1
 January 7, 2010 12:10:10 PM CST  
 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 request: /core0/update  
 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 request: /core0/update
 January 7, 2010 12:10:10 PM CST  
 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

 request: /core0/update  
 org.apache.lucene.index.CorruptIndexException: doc counts differ for  
 segment _hug: fieldsReader shows 8 but segmentInfo shows 2

RE: How to set User.dir or CWD for Solr during Tomcat startup

2010-01-07 Thread Turner, Robbin J

That's just setting the solr/home environment not the user.dir variable.  I 
have that already set.  But when I got to the solr/admin page, at the top it 
shows the Solr Admin(schemaname), hostname, and cwd=/root, SolrHome=/opt/solr.

How do I get cwd=/root not to be that but to be set to /opt/solr?

robbin

-Original Message-
From: Olivier Dobberkau [mailto:olivier.dobber...@dkd.de] 
Sent: Thursday, January 07, 2010 3:28 AM
To: solr-user@lucene.apache.org
Subject: Re: How to set User.dir or CWD for Solr during Tomcat startup


Am 07.01.2010 um 00:07 schrieb Turner, Robbin J:

 I've been doing a bunch of googling and haven't seen if there is a parameter 
 to set within Tomcat other than the solr/home which is setup in the solr.xml 
 under the $CATALINA_HOME/conf/Catalina/localhost/.

Hi.

We set this in solr.xml

Context docBase=/opt/solr-tomcat/apache-tomcat-6.0.20/webapps/solr.war 
debug=0 crossContext=true 
   Environment name=solr/home type=java.lang.String 
value=/opt/solr-tomcat/solr override=true / /Context

http://wiki.apache.org/solr/SolrTomcat#Simple_Example_Install

hope this helps.

olivier

--

Olivier Dobberkau
. . . . . . . . . . . . . .
Je TYPO3, desto d.k.d

Re: Corrupted Index

If you need to fix the index and maybe lose some data (in bad segments), 
check Lucene's CheckIndex (cmd-line app)

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Jake Brownell ja...@benetech.org
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Thu, January 7, 2010 3:08:55 PM
 Subject: Corrupted Index
 
 Hi all,
 
 Our application uses solrj to communicate with our solr servers. We started a 
 fresh index yesterday after upping the maxFieldLength setting in solrconfig. 
 Our 
 task indexes content in batches and all appeared to be well until noonish 
 today, 
 when after 40k docs, I started seeing errors. I've placed three stack traces 
 below, the first occurred once and was the initial error, the second occurred 
 a 
 few times before the third started occurring on each request. I'd really 
 appreciate any insight into what could have caused this, a missing file and 
 then 
 a corrupt index. If you know we'll have to nuke the entire index and start 
 over 
 I'd like to know that too-oddly enough searches against the index appear to 
 be 
 working.
 
 Thanks!
 Jake
 
 #1
 
 January 7, 2010 12:10:06 PM CST Caught error; TaskWrapper block 1
 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No 
 such 
 file or directory)
 
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file 
 or 
 directory)
 
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 request: /core0/update
 January 7, 2010 12:10:07 PM CST solr-home/core0/data/index/_fsk_1uj.del (No 
 such 
 file or directory)
 
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 request: /core0/update solr-home/core0/data/index/_fsk_1uj.del (No such file 
 or 
 directory)
 
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 request: /core0/update
 org.benetech.exception.WrappedException  
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
 org.apache.solr.client.solrj.SolrServer#commit(86)
 org.apache.solr.client.solrj.SolrServer#commit(75)
 org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
 org.bookshare.search.solr.SolrSearchEngine#index(232)
 
 org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
 org.bookshare.service.task.SearchEngineIndexingTask#run(53)
 org.bookshare.service.scheduler.TaskWrapper#run(233)
 java.util.TimerThread#mainLoop(512)
 java.util.TimerThread#run(462)
 Caused by:
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 solr-home/core0/data/index/_fsk_1uj.del (No such file or directory)
 
 request: /core0/update
 org.apache.solr.common.SolrException  
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(424)
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer#request(243)
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest#process(105)
 org.apache.solr.client.solrj.SolrServer#commit(86)
 org.apache.solr.client.solrj.SolrServer#commit(75)
 org.bookshare.search.solr.SolrSearchServerWrapper#add(63)
 org.bookshare.search.solr.SolrSearchEngine#index(232)
 
 org.bookshare.service.task.SearchEngineIndexingTask#initialInstanceLoad(95)
 org.bookshare.service.task.SearchEngineIndexingTask#run(53)
 org.bookshare.service.scheduler.TaskWrapper#run(233)
 java.util.TimerThread#mainLoop(512)
 java.util.TimerThread#run(462)
 
 #2
 
 January 7, 2010 12:10:10 PM CST Caught error; TaskWrapper block 1
 January 7, 2010 12:10:10 PM CST 
 org.apache.lucene.index.CorruptIndexException: 
 doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo 
 shows 2
 
 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
 _hug: fieldsReader shows 8 but segmentInfo shows 2
 
 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc 
 counts 
 differ for segment _hug: fieldsReader shows 8 but segmentInfo shows 2
 
 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
 _hug: fieldsReader shows 8 but segmentInfo shows 2
 
 request: /core0/update
 January 7, 2010 12:10:10 PM CST 
 org.apache.lucene.index.CorruptIndexException: 
 doc counts differ for segment _hug: fieldsReader shows 8 but segmentInfo 
 shows 2
 
 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
 _hug: fieldsReader shows 8 but segmentInfo shows 2
 
 request: /core0/update org.apache.lucene.index.CorruptIndexException: doc 
 counts 
 differ

Re: Sharding and Index Update

Won't hurt the performance - that *is* why people use BIG+small core trick. :)

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Jae Joo jae...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, January 7, 2010 12:40:16 PM
 Subject: Sharding and Index Update
 
 All,
 
 I have two indices - one has 23 M document and the other has less than 1000.
 The small index is for real time update.
 
 Does updating small index (with commit) hurt the overall performance?
 (We can not update realtime for 23M big index because of heavy traffic and
 size).
 
 Thanks,
 
 Jae Joo

Re: ontology support

Claudio,

Check out Solr synonym support:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Claudio Martella claudio.marte...@tis.bz.it
 To: solr-user@lucene.apache.org
 Sent: Thu, January 7, 2010 11:17:54 AM
 Subject: ontology support
 
 hello,
 
 i'm trying to use an ontology (homegrown :) ) to support the search.
 i.e. i'd like my search engine to report search results for barack
 obama even if i look for president. I see there's some support in
 Nutch API (org.apache.nutch.ontology) so (if it does what i'm looking
 for) i'm guessing if something like that comes with solr too.
 
 any ideas?
 
 
 Claudio
 
 -- 
 Claudio Martella
 Digital Technologies
 Unit Research  Development - Analyst
 
 TIS innovation park
 Via Siemens 19 | Siemensstr. 19
 39100 Bolzano | 39100 Bozen
 Tel. +39 0471 068 123
 Fax  +39 0471 068 129
 claudio.marte...@tis.bz.it http://www.tis.bz.it
 
 Short information regarding use of personal data. According to Section 13 of 
 Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
 process your personal data in order to fulfil contractual and fiscal 
 obligations 
 and also to send you information regarding our services and events. Your 
 personal data are processed with and without electronic means and by 
 respecting 
 data subjects' rights, fundamental freedoms and dignity, particularly with 
 regard to confidentiality, personal identity and the right to personal data 
 protection. At any time and without formalities you can write an e-mail to 
 priv...@tis.bz.it in order to object the processing of your personal data for 
 the purpose of sending advertising materials and also to exercise the right 
 to 
 access personal data and other rights referred to in Section 7 of Decree 
 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
 Street n. 19, Bolzano. You can find the complete information on the web site 
 www.tis.bz.it.

Re: High Availability

Your setup with the master behind a LB VIP looks right.
I don't think replication in Solr was meant to be bidirectional.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Matthew Inger mattin...@yahoo.com
 To: solr-user@lucene.apache.org; r...@intelcompute.com
 Sent: Thu, January 7, 2010 10:45:20 AM
 Subject: Re: High Availability
 
 I've tried having two servers set up to replicate each other, and it is not a 
 pretty thing.  It seems that SOLR doesn't really do any checking of the 
 version 
 # to see if the version # on the master is  the version # on the slave 
 before 
 deciding to replicate.  It only looks to see if it's different.  As a result, 
 what ends up happening is this:
 
 1.  Both servers at same revision, say revision 100
 2.  Update Master 1 to revision 101
 3.  Master 2 starts pull of revision 101
 4.  Master 1 sees master 2 has different revision and starts pull of revision 
 100
 
 See where it's going?  Eventually, both servers seem to end up back at 
 revision 
 100, and my updates get lost, so my sequencing might be a little out of wack 
 here, but nonetheless having two servers setup as slaves to each other does 
 not 
 work properly.  I would think though that with a small code change to check 
 to 
 see if the revision # has increased before pulling the file, that would solve 
 the issue.
 
 In the mean time, my plan is to:
 1.  Setup two index update servers as masters behind an F5 load balancer with 
 a 
 VIP in an active/passive configuration. 
 2.  Setup N search servers as slaves behind an F5 load balancer with a VIP in 
 an 
 round robin configuration. Replication would be from the master's VIP, 
 instead 
 of any one particular master.
 3.  Index update servers would have a handler would would do delta updates 
 every 
 so often to keep both servers in sync with the database (i'm only indexing a 
 complex database here, which doesn't lend itself well to sql querying on the 
 fly).
 
 Ideally, i'd love to be able to force the master servers to update if either 
 one 
 of them switches from passive to active state, but am not sure how to 
 accomplish 
 that.
 
 
 
 mattin...@yahoo.com
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 
 
 
 - Original Message 
 From: r...@intelcompute.com 
 To: solr-user@lucene.apache.org
 Sent: Mon, January 4, 2010 11:37:22 AM
 Subject: Re: High Availability
 
 
 Even when Master 1 is alive again, it shouldn't get the floating IP until 
 Master 
 2 actually fails.
 
 So you'd ideally want them replicating to eachother, but since one will only 
 be 
 updated/Live at a time, it shouldn't cause an issue with cobbling data (?).
 
 Just a suggestion tho, not done it myself on Solr, only with DB servers.
 
 
 
 
 On Mon 04/01/10 16:28 , Matthew Inger wrote:
 
  So, when the masters switch back, does that mean, we have to force a
  full delta update, correct?
  
  Once you start down the dark path, forever will it
  dominate your destiny.  Consume you it will  - Yoda
  - Original Message 
  From:  
  To: 
  Sent: Mon, January 4, 2010 11:17:40 AM
  Subject: Re: High Availability
  Have you looked into a basic floating IP setup?
  Have the master also replicate to another hot-spare master.
  Any downtime during an outage of the 'live' master would be minimal
  as the hot-spare takes up the floating IP.
  On Mon 04/01/10 16:13 , Matthew Inger  wrote:
   I'm kind of stuck and looking for suggestions for high
  availability
   options.  I've figured out without much trouble how to get the
   master-slave replication working.  This eliminates any single
  points
   of failure in the application in terms of the application's
  searching
   capability.
   I would setup a master which would create the index and several
   slaves to act as the search servers, and put them behind a load
   balancer to distribute the requests.  This would ensure that if a
   slave node goes down, requests would continue to get serviced by
  the
   other nodes that are still up.
   The problem I have is that my particular application also has the
   capability to trigger index updates from the user interface.  This
   means that the master now becomes a single point of failure for
  the
   user interface.  
   The basic idea of the app is that there are multiple oracle
   instances contributing to a single document.  The volume and
   organization of the data (database links, normalization, etc...)
   prevents any sort of fast querying via SQL to do querying of the
   documents.  The solution is to build a lucene index (via solr),
  and
   use that for searching.  When updates are made in the UI, we will
   also send the updates directly to the solr server as well (we
  don't
   want to wait some arbitrary interval for a delta query to run).  
   So you can see the problem here is that if the master is down, the
   sending of the updates to

Re: No Analyzer, tokenizer or stemmer works at Solr

Well, I'd approach either of these use cases
by simply performing my computations on
the input and storing the result in another
(non-indexed unless I wanted to search it)
field. This wouldn't happen in the Analyzer,
but in the code that populated the document
fields.

Which is a much cleaner solution IMO than creating
some sort of index this but store that capability.
The purpose of analysis is to produce *searchable*
tokens after all.

But we're getting into angels dancing on pins here. Do
you actually have a use case you're trying to implement
or is this mostly theoretical?

Erick

On Thu, Jan 7, 2010 at 2:08 PM, MitchK mitc...@web.de wrote:


 The difference between stored and indexed is clear now.

 You are right, if you are responsing only to normal users.

 Use case:
 You got a stored field The good, the bad and the ugly.
 And you got a really fantastic analyzer, which is doing some magic to this
 movie title.
 Let's say, the analyzer translates the title into md5 or into another
 abstract expression.
 Instead of doing the same magical function on the client's side again and
 again, he only needs to take the prepared data from your response.

 Another use case could be:
 Imagine you have got two categories: cheap and expensive and your document
 gots a title-, a label-, an owner- and a price-field.
 Imagine you would analyze, index and store them like you normally do and
 afterwards you want to set, whether the document belongs to the expensive
 item-group or not.
 If the price for the item is higher than 500$, it belongs to the expensive
 ones, otherwise not.
 I think, this would be a job for a special analyzer - and this only makes
 sense, if I also store the analyzed data.

 I think information retrieval is a really interesting use case.


 Erick Erickson wrote:
 
  What is your use case for responding sometimes with the indexed value?
  Other than reconstructing a field that hasn't been stored, I can't think
  of
  one.
 
  I still think you're missing the point. Indexing and storing are
  orthogonal operations that have (almost) nothing to do with each
  other, for all that they happen at the same time on the same field.
 
  You never search against the stored data in a field. You *always*
  search against the indexed data.
 
  Contrariwise, you never display the indexed form to the user, you
  *always* show the stored data (unless you come up with
  a really interesting use case).
 
  Step back and consider what happens when you index data,
  it gets broken up all kinds of ways. Stop words are removed,
  case may change, etc, etc, etc. It makes no sense to
  then display this data for a user. Would you really like
  to have, say a movie title The Good, The Bad, and The
  Ugly. Remove stopwords, puncuation and lowercase
  and you index three tokens good, bad, ugly.
  Even if you reconstruct this field, the user would see
  good bad ugly. Bad, very bad.
 
  Yet I want to display the original title to the user in
  response to searching on ugly, so I need the
  original, unanalyzed data.
 
  Perhaps it would help to think of it this way.
  1 take some data and index it in f1
  but do NOT store it in f1. Store it in f2
  but do NOT index it in f2.
  2 take that same data, index AND store
  it in f3.
 
  1 is almost entirely equivalent to 2
  in terms of index resources.
 
  Practically though, 1 is harder to use,
  because you have to remember
  to use f1 for searching and f2 for getting
  the raw data.
 
  HTH
  Erick
 
  On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:
 
 
  Thank you, Ryan. I will have a look on lucene's material and luke.
 
  I think I got it. :)
 
  Sometimes there will be the need, to response on the one hand the value
  and
  on the other hand the indexed version of the value.
  How can I fullfill such needs? Doing copyfield on indexed-only fields?
 
 
 
  ryantxu wrote:
  
  
   On Jan 7, 2010, at 10:50 AM, MitchK wrote:
  
  
   Eric,
  
   you mean, everything is okay, but I do not see it?
  
   Internally for searching the analysis takes place and writes to the
   index in an inverted fashion, but the stored stuff is left alone.
  
   if I use an analyzer, Solr stores it's output two ways?
   One public output, which is similar to the original input
   and one hidden or internal output, which is based on the
   analyzer's work?
   Did I understand that right?
  
   yes.
  
   indexed fields and stored fields are different.
  
   Solr results show stored fields in the results (however facets are
   based on indexed fields)
  
   Take a look at Lucene in Action for a better description of what is
   happening.  The best tool to get your head around what is happening is
   probably luke (http://www.getopt.org/luke/)
  
  
  
   If yes, I have got another problem:
   I don't want to waste any diskspace.
  
   You have control over what is stored and what is indexed -- how that
   is configured is up to you.
  
   ryan
  
  
 
  --
  View this

Re: Basic sentence parsing with the regex highlighter fragmenter

Regular expressions won't work well for sentence boundary detection.
If you want something free, you could plug in OpenNLP or GATE.  Or LingPipe, 
but that's not free.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Caleb Land caleb.l...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 5, 2010 2:05:18 PM
 Subject: Basic sentence parsing with the regex highlighter fragmenter
 
 Hello,
 I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
 basic sentences, and I'm running into a problem.
 
 I'm using the default regex specified in the example solr configuration:
 
 [-\w ,/\n\']{20,200}
 
 But I am using a larger fragment size (140) with a slop of 1.0.
 
 Given the passage:
 
 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
 ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
 vitae, molestie quis nunc.
 
 When I search for Nulla (the first word of the second sentence) and grab
 the first highlighted snippet, this is what I get:
 
 . Nulla a neque a ipsum accumsan iaculis at id lacus
 
 As you can see, there's a leading period from the previous sentence and the
 period from the current sentence is missing.
 
 I understand this regex isn't that advanced, but I've tried everything I can
 think of, regex-wise, to get this to work, and I always end up with this
 problem.
 
 For example, I've tried: \w[^.!?]{0,200}[.!?]
 
 Which seems like it should include the ending punctuation, but it doesn't,
 so I think I'm missing something.
 
 Does anybody know a regex that works?
 -- 
 Caleb Land

Re: Removing facets which frequency match the result count

Hi,

Either I don't understand this or this doesn't make much sense.
Are you saying you want to show only facet values whose counts == # of hits?
If so, what would be the value of showing facets -- they wouldn't be narrowing 
down the result set.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: joeMcElroy pho...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 5, 2010 5:25:18 AM
 Subject: Removing facets which frequency match the result count
 
 
 Is there any way to specify to solr only to bring back facet filter options
 where the frequency is less than the total results found? I found facets
 which match the result count are not helpful to the user, and produce noise
 within the UI to filter results.
 
 I can obviously do this within the view but would be better if solr dealt
 with this logic.
 
 Cheers!
 Joe
 -- 
 View this message in context: 
 http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27026359.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Any way to modify result ranking using an integer field?

Not sure if this was answered.
Yes, you can set the default params/values for a request handler in the 
solrconfig.xml .

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Andy angelf...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Mon, January 4, 2010 4:56:14 PM
 Subject: Re: Any way to modify result ranking using an integer field?
 
 Thank you Ahmet.
 
 Is there any way I can configure Solr to always use {!boost 
 b=log(popularity)} 
 as the default for all queries?
 
 I'm using Solr through django-haystack, so all the Solr queries are actually 
 generated by haystack. It'd be much cleaner if I could configure Solr to 
 always 
 use BoostQParserPlugin for all queries instead of manually modifying every 
 single query generated by haystack.
 
 --- On Mon, 1/4/10, Ahmet Arslan wrote:
 
 From: Ahmet Arslan 
 Subject: Re: Any way to modify result ranking using an integer field?
 To: solr-user@lucene.apache.org
 Date: Monday, January 4, 2010, 2:33 PM
 
  Thanks Ahmet.
  
  Do I need to do anything to enable BoostQParserPlugin in
  Solr, or is it already enabled?
 
 I just confirmed that it is already enabled. You can see affect of it by 
 appending debugQuery=on to your search url.

Re: SOLR Performance Tuning: Pagination

Peter - Aaron just commented on a recent Solr issue (reading large result sets) 
and mentioned his patch.
So far he has 2 x +1 from Grant and me to stick his patch in JIRA.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Peter Wolanin peter.wola...@acquia.com
 To: solr-user@lucene.apache.org
 Sent: Sun, January 3, 2010 3:37:01 PM
 Subject: Re: SOLR Performance Tuning: Pagination
 
 At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers
 from Near Infinity (Aaron McCurry I think) mentioned that he had a
 patch for lucene that enabled unlimited depth memory-efficient paging.
 Is anyone in contact with him?
 
 -Peter
 
 On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote:
 
  On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote:
 
  I used pagination for a while till found this...
 
 
  I have filtered query ID:[* TO *] returning 20 millions results (no
  faceting), and pagination always seemed to be fast. However, fast only with
  low values for start=12345. Queries like start=28838540 take 40-60 seconds,
  and even cause OutOfMemoryException.
 
  Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority 
 Queue management.  See http://issues.apache.org/jira/browse/LUCENE-2127 and 
 the 
 linked discussion on java-dev.
 
 
  I use highlight, faceting on nontokenized Country field, standard 
  handler.
 
 
  It even seems to be a bug...
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 
 
 
 
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search
 
 
 
 
 
 -- 
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com

NPE when sharding multiple layers

2010-01-07 Thread Michael

Hi all,

I've got an index split across 28 cores -- 4 cores on each of 7 boxes
(multiple cores per box in order to use more of its CPUs.)

When I configure a toplevel core to fan out to all 28 index cores,
it works, but is slower than I'd have expected:
 Toplevel core == all 28 index cores

In case it is the aggregation of 28 shards that is slow, I wanted to
try 2 layers of sharding.  I changed the toplevel core to shard to 1
midlevel core per box, which in turn shards to the 4 index cores on
localhost:
 Toplevel core == 7 midlevel cores, 1 per box == 4 localhost index cores

If I search for *:*, this works.  If I search for an actual
field:value, the midlevel cores throw an NPE.

I am configuring toplevel and midlevel cores' shards= parameters via
default values in their solrconfigs, so my request URL just looks like
host/solr/toplevel/select/q=field:value.

Is this a known bug, or am I just doing something wrong?

Thanks in advance!
- Michael

PS: The NPE, which is thrown by the midlevel cores:

Jan 7, 2010 4:01:02 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardDoc.java:210)
at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:134)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:255)
at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:114)
at 
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156)
at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:141)
at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:445)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:619)

Re: NPE when sharding multiple layers

2010-01-07 Thread Yonik Seeley

On Thu, Jan 7, 2010 at 4:17 PM, Michael solrco...@gmail.com wrote:
  I wanted to try 2 layers of sharding.

Distrib search was written with multi-level in mind, but it's not supported yet.

-Yonik
http://www.lucidimagination.com

Re: NPE when sharding multiple layers

2010-01-07 Thread Michael

Thanks, Yonik.

Does not supported mean we can't guarantee whether it will work or
not, or you may be able to figure it out on your own?  Apparently I
am able to get *some* queries through, just not those that pass
through the fieldtype that i really need (a complex analyzer).  When I
search for foo:value where foo is an analyzer that uses

  StandardTokenizer
  LowerCaseFilter
  WordDelimeterFilter
  TrimFilter

I *don't* get an NPE.

Thanks,
Michael

On Thu, Jan 7, 2010 at 4:25 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Jan 7, 2010 at 4:17 PM, Michael solrco...@gmail.com wrote:
  I wanted to try 2 layers of sharding.

 Distrib search was written with multi-level in mind, but it's not supported 
 yet.

 -Yonik
 http://www.lucidimagination.com

Re: Using IDF to find Collactions and SIPs . . ?

Christopher,

It's not Lucene or Solr, but have a look at 
http://www.sematext.com/products/key-phrase-extractor/index.html 


There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long 
windows for SIPs):

  http://www.sematext.com/demo/kpe/i.html

(it looks like the CollateFilter option on the left is kaput, so ignore it -- 
though that filter is actually quite useful and without it you may see some 
phrase overlap)

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Subscriptions sub.scripti...@metaheuristica.com
 To: solr-user@lucene.apache.org
 Sent: Sun, December 27, 2009 9:43:56 PM
 Subject: Using IDF to find Collactions and SIPs . . ?
 
 I am trying to write a query analyzer to pull:
 
 
 
 1.Common phrases (also known as Collocations) with in a query
 
 
 
 2.Highly unusual phrases (also known as Statistically Improbable
 Phrases or SIPs) with in a query
 
 
 
 The Collocations would be similar to facets except I am also trying to get
 multi word phrases as well as single terms. So suppose I could write
 something that does a chained query off the facet query looking for words in
 proximity. Conceptually (as I understand it) this should just be a question
 of using the IDF (inverse document frequency i.e. the measure of how often
 the term appears across the index).
 
 
 
 * Has anyone tried to write an analyzer that looks for the words
 that typically occur within a given proximity of another word?
 
 
 
 The highly unusual phrases on the other hand requires getting a handle on
 the IDF which at present only appears to be available via the explain
 function of debugging. 
 
 
 
 * Has anyone written something to go directly after the IDF score
 only?
 
 
 
 * If I do have to go down the path of writing this from scratch is
 the org.apache.lucene.search.Similarity class the one to leverage?
 
 
 
 Most grateful for any feedback or insights,
 
 
 
 Christopher

Re: Any way to modify result ranking using an integer field?

2010-01-07 Thread Andy

Right. But my understanding is that the handler default setting in solrconfig 
doesn't take the parameter {!boost}, it only takes the parameter bf, which adds 
the function query instead of multiply it.

Seems like the only way to have a default for parameter {!boost} is to use 
edismax, which won't be available till 1.5

--- On Thu, 1/7/10, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

From: Otis Gospodnetic otis_gospodne...@yahoo.com
Subject: Re: Any way to modify result ranking using an integer field?
To: solr-user@lucene.apache.org
Date: Thursday, January 7, 2010, 4:07 PM

Not sure if this was answered.
Yes, you can set the default params/values for a request handler in the 
solrconfig.xml .

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Andy angelf...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Mon, January 4, 2010 4:56:14 PM
 Subject: Re: Any way to modify result ranking using an integer field?
 
 Thank you Ahmet.
 
 Is there any way I can configure Solr to always use {!boost 
 b=log(popularity)} 
 as the default for all queries?
 
 I'm using Solr through django-haystack, so all the Solr queries are actually 
 generated by haystack. It'd be much cleaner if I could configure Solr to 
 always 
 use BoostQParserPlugin for all queries instead of manually modifying every 
 single query generated by haystack.
 
 --- On Mon, 1/4/10, Ahmet Arslan wrote:
 
 From: Ahmet Arslan 
 Subject: Re: Any way to modify result ranking using an integer field?
 To: solr-user@lucene.apache.org
 Date: Monday, January 4, 2010, 2:33 PM
 
  Thanks Ahmet.
  
  Do I need to do anything to enable BoostQParserPlugin in
  Solr, or is it already enabled?
 
 I just confirmed that it is already enabled. You can see affect of it by 
 appending debugQuery=on to your search url.

Re: NPE when sharding multiple layers

2010-01-07 Thread Yonik Seeley

On Thu, Jan 7, 2010 at 4:33 PM, Michael solrco...@gmail.com wrote:
 Does not supported mean we can't guarantee whether it will work or
 not, or you may be able to figure it out on your own?

Not implemented, and not expected to work.
For example, some info such as sortFieldValues would need to be merged
and returned as is done for leaf requests.  There are probably other
little things like that, but  I can't list them off the top of my
head.

-Yonik
http://www.lucidimagination.com

Re: Indexing content on Windows file shares?

Matt:

http://sharehound.sourceforge.net/

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Matt Wilkie matt.wil...@gov.yk.ca
 To: solr-user@lucene.apache.org
 Sent: Thu, December 10, 2009 3:06:38 PM
 Subject: Indexing content on Windows file shares?
 
 Hello,
 
 I'm new to Solr, I know nothing about it other than it's been touted in a 
 couple 
 of places as a possible competitor to Google Search Appliance, which is what 
 brought me here. I'm looking for a search engine which can index files on 
 windows shares and websites, and, hopefully, integrate with Active Directory 
 to 
 ensure results are not returned to users who don't have access to those 
 files(s).
 
 Can Solr do this? If so where is the documentation for it? Reconnaisance 
 searches of the mailing list and wiki have not turned up anything, so far.
 
 thanks,
 
 -- matt wilkie
 
 Geomatics Analyst
 Information Management and Technology
 Yukon Department of Environment
 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
 867-667-8133 Tel * 867-393-7003 Fax
 http://environmentyukon.gov.yk.ca/geomatics/

Re: Adaptive search?

Shalin,

- Original Message 

 From: Shalin Shekhar Mangar shalinman...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, December 23, 2009 2:45:21 AM
 Subject: Re: Adaptive search?

 On Wed, Dec 23, 2009 at 4:09 AM, Lance Norskog wrote:

  Nice!

  Siddhant: Another problem to watch out for is the feedback problem:
  someone clicks on a link and it automatically becomes more
  interesting, so someone else clicks, and it gets even more
  interesting... So you need some kind of suppression. For example, as
  individual clicks get older, you can push them down. Or you can put a
  cap on the number of clicks used to rank the query.

 We use clicks/views instead of just clicks to avoid this problem.

Doesn't a click imply a view?  You click to view.  I must be missing 
something...

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: Replicating multiple cores

For me, the Java replication is nice because it's much easier to set up and has 
fewer moving pieces (vs. rsync server, scripts config file, event hook, 
external shell scripts).

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Jason Rutherglen jason.rutherg...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, December 8, 2009 7:44:07 PM
 Subject: Re: Replicating multiple cores
 
  Yes. I'd highly recommend using the Java replication though.
 
 Is there a reason?  I understand it's new etc, however I think one
 issue with it is it's somewhat non-native access to the filesystem.
 Can you illustrate a real world advantage other than the enhanced
 admin screens?
 
 On Mon, Dec 7, 2009 at 11:13 PM, Shalin Shekhar Mangar
 wrote:
  On Tue, Dec 8, 2009 at 11:48 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  If I've got multiple cores on a server, I guess I need multiple
  rsyncd's running (if using the shell scripts)?
 
 
  Yes. I'd highly recommend using the Java replication though.
 
  --
  Regards,
  Shalin Shekhar Mangar.

Re: Retrieving large num of docs

Strange. Ever figured out the source of performance difference?

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

- Original Message
From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com
To: solr-user@lucene.apache.org
Sent: Sat, December 5, 2009 12:05:49 PM
Subject: Re: Retrieving large num of docs

Hi Otis,
I think my experiments are not conclusive about reduction in search time. I
was playing around with various configurations to reduce the time to
retrieve documents from Solr. I am sure that making the two multi valued
text fields from stored to un-stored, retrieval time (query time + time to
load the stored fields) became very fast. I was expecting the
lazyfieldloading setting in solrconfig to take care of this but apparently
it is not working as expected.

Out of curiosity, I removed these 2 fields from the index (this time I am
not even indexing them) and my search time got better (10 times better).
However, I am still trying to isolate the reason for the search time
reduction. It may be either because of 2 less fields to search in or because
of the reduction in size of the index or may be something else. I am not
sure if lazyfieldloading has any part in explaining this.

- Raghu

On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic
wrote:

Hm, hm, interesting. I was looking into something like this the other day
(BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true
in solrconfig and after seeing fl didn't include those big fields, I
though hm, so Lucene/Solr will not be pulling those large fields from disk,
OK.

You are saying that this may not be true based on your experiment?
And what I'm calling your experiment means that you reindexed the same
data, but without the 2 multi-valued text fields... .and that was the only
change you made and got cca x10 search performance improvement?

Sorry for repeating your words, just trying to confirm and understand.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

- Original Message
From: Raghuveer Kancherla
To: solr-user@lucene.apache.org
Sent: Thu, December 3, 2009 8:43:16 AM
Subject: Re: Retrieving large num of docs

Hi Hoss,

I was experimenting with various queries to solve this problem and in one
such test I remember that requesting only the ID did not change the
retrieval time. To be sure, I tested it again using the curl command
today
and it confirms my previous observation.

Also, enableLazyFieldLoading setting is set to true in my solrconfig.

Another general observation (off topic) is that having a moderately large
multi valued text field (~200 entries) in the index seems to slow down
the
search significantly. I removed the 2 multi valued text fields from my
index
and my search got ~10 time faster. :)

- Raghu

On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:

: I think I solved the problem of retrieving 300 docs per request for
now.
The
: problem was that I was storing 2 moderately large multivalued text
fields
: though I was not retrieving them during search time. I reindexed all
my
: data without storing these fields. Now the response time (time for
Solr
to
: return the http response) is very close to the QTime Solr is showing
in
the

Hmmm

two comments:

1) the example URL from your previous mail...

http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

...doesn't match your earlier statemnet that you are only returning hte
id
field (there is no fl param in that URL) ... are you certain you
werent'
returning those large stored fields in teh response?

2) assuming you were actually using an fl param to limit the fields,
make
sure you have this setting in your solrconfig.xml...

true

..that should make it pretty fast to return only a few fields of each
document, even if you do have some jumpto stored fields that aren't
being
returned.

-Hoss

Re: Removing facets which frequency match the result count

2010-01-07 Thread joeMcElroy

Hi there

If i have two documents with a field indexing a taxonomy path for example
doc1: bags/handbags/clutch
doc2: bags/handbags/beach

and that field tokenizes on the forward slash, the facets produced will be :
bags(2), handbags(2),beach(1),clutch(1)

if i select clutch, the facets returned by solr will be handbags (1) and
bags(1). I would like to have no facets returned.

Therefore I want facets only to be returned if the facet frequency is
smaller than the total results found. this will return a more helpful
selection of facets for the user to then refine their search. In this
example the user would not want to select 'bags' when they have selected
handbags as it will not help the user in their search.

We can remove these facets within the view but was asking if there is a more
elegant way to do this in SOLR.

Otis Gospodnetic wrote:

Hi,

Either I don't understand this or this doesn't make much sense.
Are you saying you want to show only facet values whose counts == # of
hits?
If so, what would be the value of showing facets -- they wouldn't be
narrowing down the result set.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

- Original Message
From: joeMcElroy pho...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tue, January 5, 2010 5:25:18 AM
Subject: Removing facets which frequency match the result count

Is there any way to specify to solr only to bring back facet filter
options
where the frequency is less than the total results found? I found facets
which match the result count are not helpful to the user, and produce
noise
within the UI to filter results.

I can obviously do this within the view but would be better if solr dealt
with this logic.

Cheers!
Joe
--
View this message in context:
http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27026359.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-tp27026359p27068209.html
Sent from the Solr - User mailing list archive at Nabble.com.

What is the process to build Lucidworks Solr?

2010-01-07 Thread Micah Koga

I am using LucidWorks Solr v1.4 and I would like to compile in a search
component, however it does not seem like a very straightforward process. The
ant script in the solr directory is that of the stock solr installation
which does not compile out of the box.

Has anyone been able to successfully compile Lucidworks Solr?

Exact matches without field copying?

Hi,

I think the how can I perform both exact and non-exact (no stemming involved) 
searches? is a pretty FAQ, but it looks like we don't have an answer for it on 
the Wiki.  The advice is typically to copy a field and apply different analysis 
to it (one stemmed, the other not stemmed), and then search on the appropriate 
field.

Is there a better way of doing this?

* CASE 1: index time stemming
input word: house == indexed as token: hous

exact-match desired (non-stemmed query):
  house == house == no match   --- so if you want exact matches, you 
can't stem at index-time

stemmed query:
  house == hous == match

* CASE 2: no index-time stemming
input word: house == indexed as token: house

exact-match desired (non-stemmed query):
  house == house == match

stemmed query:
  house == hous == no match   --- so if you don't stem at index-time, 
non-exact matching stops working

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: Custom Analyzer/Tokenizer works but results were not saved

2010-01-07 Thread Lance Norskog

Analysis is called when creating the indexed data for content, but not
when storing the content.
CopyField copies one field's raw values to another field for storage.
The source and target fields can be of any type.

copyField does not analyse the source data and then feed it to
another field's analyzer stack. You will have to copy the raw data and
add you analyzer to the existing analyzer stack for the target field.

On Tue, Jan 5, 2010 at 2:56 PM, MitchK mitc...@web.de wrote:

Hello community,

I wrote another mail today, but I think something goes wrong (I can't find
my post in the mailinglist) - if not, I am sorry for posting a doublepost
- I am using a maillist for the first time.

I have created a custom analyzer, which contains on a LowerCaseTokenizer, a
StopFilter and a custom TestFilter. The analyzer works as expected, when I
test him with analysis.jsp.

However, it does not work, when I try to index or query real data via
post.jar.
I use the analyzer for a testField. This testField gets it's value via
copyfield from the nameField's value.
I am speculating that Solr only does copy the value without analyzing it
afterwards.

Here is some xml from my schema:
The field:
copyField source=name dest=test/
field name=test type=testAnalyzer indexed=true stored=true/

The analyzer:
fieldType name=testAnalyzer class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class = solr.LowerCaseTokenizerFactory /
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class = my.package.solr.analysis.TestFilterFactory
mode = Main
/
/analyzer
/fieldType

How can I force Solr to use my analyzer the way it does, when I test him
with analysis.jsp?
Restart and re-indexing does not solve the problem.

Hopefully you can help me with this.
Thank you!

Kind regards from Germany
Mitch
--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026739.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Lance Norskog
goks...@gmail.com

Re: Tokenizing problem with numbers in query

2010-01-07 Thread Bernd Brod

Hi,


  Did you re-start tomcat and re-index your collection?

Yes


 Do you want to search inside alpanumeric strings? Or you are interested
 only prefix queries. Can you give us more examples like target documents and
 queries.


Searching inside would be required, yes. If the above example would work I
would already be glad.

Bernd

Re: Cached document view in solr

2010-01-07 Thread Lance Norskog

If you index the raw document, that is what is returned by the search.
The analyzers create separate data that is stored in various files,
but is only used in searching. Searching, facets, and sorting use this
analyzed output, but search returns pull the original.

On Thu, Jan 7, 2010 at 2:28 AM, Ramchandra Phadake
ramchandra_phad...@persistent.co.in wrote:
nutch search results provide a link for getting the cached document copy.
It fetches the raw content from segments based on document id. {cached.jsp}
Is it possible to have similar functionality in solr, what can be done to 
achieve this? Any pointers.

 I could retrieve the content using the text filed. 'fl=text' so content can 
 be retrieved.
 But its parsed text with font formatting lost. Can the original content be 
 stored in any field as is?

 Thanks,
 Ram

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the 
 property of Persistent Systems Ltd. It is intended only for the use of the 
 individual or entity to which it is addressed. If you are not the intended 
 recipient, you are not authorized to read, retain, copy, print, distribute or 
 use this message. If you have received this communication in error, please 
 notify the sender and delete all copies of this message. Persistent Systems 
 Ltd. does not accept any liability for virus infected mails.




-- 
Lance Norskog
goks...@gmail.com

Re: SOLR Performance Tuning: Pagination

2010-01-07 Thread Peter Wolanin

Great - this issue?  https://issues.apache.org/jira/browse/LUCENE-2127

Sounds like it would be a real win for lucene.

-Peter

On Thu, Jan 7, 2010 at 4:12 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Peter - Aaron just commented on a recent Solr issue (reading large result 
 sets) and mentioned his patch.
 So far he has 2 x +1 from Grant and me to stick his patch in JIRA.

  Otis
 --
 Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



 - Original Message 
 From: Peter Wolanin peter.wola...@acquia.com
 To: solr-user@lucene.apache.org
 Sent: Sun, January 3, 2010 3:37:01 PM
 Subject: Re: SOLR Performance Tuning: Pagination

 At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers
 from Near Infinity (Aaron McCurry I think) mentioned that he had a
 patch for lucene that enabled unlimited depth memory-efficient paging.
 Is anyone in contact with him?

 -Peter

 On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote:
 
  On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote:
 
  I used pagination for a while till found this...
 
 
  I have filtered query ID:[* TO *] returning 20 millions results (no
  faceting), and pagination always seemed to be fast. However, fast only 
  with
  low values for start=12345. Queries like start=28838540 take 40-60 
  seconds,
  and even cause OutOfMemoryException.
 
  Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority
 Queue management.  See http://issues.apache.org/jira/browse/LUCENE-2127 and 
 the
 linked discussion on java-dev.
 
 
  I use highlight, faceting on nontokenized Country field, standard 
  handler.
 
 
  It even seems to be a bug...
 
 
  Fuad Efendi
  +1 416-993-2060
  http://www.linkedin.com/in/liferay
 
  Tokenizer Inc.
  http://www.tokenizer.ca/
  Data Mining, Vertical Search
 
 
 
 
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem using Solr/Lucene:
 http://www.lucidimagination.com/search
 
 



 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com





-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 - stats page slow

2010-01-07 Thread Peter Wolanin

I recently noticed the same sort of thing.

The attached screenshot shows the transition on a search server
when we updated from a Solr 1.4 dev build (revision 779609 from
2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
cron task to log some of the data from the stats.jsp page from each
core (about 100 cores, most of which are small indexes).

You can see there is a dramatic spiking of the load after the update -
I think due to added reporting on that page such as from the lucene
field cache.  Is this amount of load expected?

-Peter

On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill jayallenh...@gmail.com wrote:
 Also, what is your heap size and the amount of RAM on the machine?

 I've also noticed that, when watching memory usage through JConsole or
 YourKit while loading the stats page, the memory usage spikes dramatically -
 are you seeing this as well?

 -Jay

 On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill jayallenh...@gmail.com wrote:

 I've noticed this as well, usually when working with a large field cache. I
 haven't done in-depth analysis of this yet, but it seems like when the stats
 page is trying to pull data from a large field cache it takes quite a long
 time.

 Are you doing a lot of sorting? If so, what are the field types of the
 fields you're sorting on? How large is the index both in document count and
 file size?

 Another approach to get data from the Solr instance would be to use JMX.
 And I've been working on a request handler (started by Erik Hatcher) that
 will provide the same information as the stats page, but a little more
 efficiently. I may try to put up a patch with this soon.

 -Jay



 On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote:

 We've been using Solr 1.4 for a few days now and one slight downside we've
 noticed is the stats page comes up very slowly for some reason - sometimes
 more than 10 seconds.  We call this programmatically to retrieve the last
 commit date so that we can keep users from committing too frequently.  This
 means some of our administration pages are now taking a long time to load.
  Is there anything we should be doing to ensure that this page comes up
 quickly?  I see some notes on this back in October but it looks like that
 update should already be applied by now.  Or, better yet, is there now a
 better way to just retrieve the last commit date from Solr without pulling
 all of the statistics?

 Thanks in advance.

 --
 Steve







-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: DisMaxRequestHandler bf configuration

2010-01-07 Thread Andy

Thanks.

Can I use the standard request handler for this purpose? So something like:

requestHandler name=standard class=solr.StandardRequestHandler
 lst name=defaults
   str name=q{!boost b=$popularityboost 
v=$qq}popularityboost=log(popularity)/str
 /lst
  /requestHandlerOr do I still need the dismax handler?



--- On Thu, 1/7/10, Erik Hatcher erik.hatc...@gmail.com wrote:

From: Erik Hatcher erik.hatc...@gmail.com
Subject: Re: DisMaxRequestHandler bf configuration
To: solr-user@lucene.apache.org
Date: Thursday, January 7, 2010, 4:56 AM

it wouldn't be q.alt though, just q, in the config file.

q.alt is typically *:*, it's the fall back query when no q is provided.

though, in thinking about it, q.alt would work here, but i'd use q personally.

On Jan 6, 2010, at 9:45 PM, Andy wrote:

 Let me make sure I understand you.
 
 I'd get my regular query from haystack as qq=foo rather than q=foo.
 
 Then I put in solrconfig within the dismax section:
 
 str name=q.alt
       {!boost b=$popularityboost v=$qq}popularityboost=log(popularity)
 /str
 
 Is that what you meant?
 
 
 --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote:
 
 From: Yonik Seeley yo...@lucidimagination.com
 Subject: Re: DisMaxRequestHandler bf configuration
 To: solr-user@lucene.apache.org
 Date: Wednesday, January 6, 2010, 8:42 PM
 
 On Wed, Jan 6, 2010 at 8:24 PM, Andy angelf...@yahoo.com wrote:
 I meant can I do it with dismax without modifying every single query? I'm 
 accessing Solr through haystack and all queries are generated by haystack. 
 I'd much rather not have to go under haystack to modify the generated 
 queries.  Hence I'm trying to find a way to boost every query by default.
 
 If you can get haystack to pass through the user query as something
 like qq, then yes - just use something like the last link I showed at
 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
 and set defaults for everything except qq.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
 
 --- On Wed, 1/6/10, Yonik Seeley yo...@lucidimagination.com wrote:
 
 From: Yonik Seeley yo...@lucidimagination.com
 Subject: Re: DisMaxRequestHandler bf configuration
 To: solr-user@lucene.apache.org
 Date: Wednesday, January 6, 2010, 7:48 PM
 
 On Wed, Jan 6, 2010 at 7:43 PM, Andy angelf...@yahoo.com wrote:
 So if I want to configure Solr to turn every query q=foo into q={!boost 
 b=log(popularity)}foo, dismax wouldn't work but edismax would?
 
 You can do it with dismax it's just that the syntax is slightly
 more convoluted.
 Check out the section on boosting newer documents:
 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

Re: Solr 1.4 - stats page slow

I'd love to see the screenshot, but it didn't come through - got stripped by ML 
manager.  Maybe upload it somewhere?
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Peter Wolanin peter.wola...@acquia.com
 To: solr-user@lucene.apache.org
 Sent: Thu, January 7, 2010 9:32:26 PM
 Subject: Re: Solr 1.4 - stats page slow
 
 I recently noticed the same sort of thing.
 
 The attached screenshot shows the transition on a search server
 when we updated from a Solr 1.4 dev build (revision 779609 from
 2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
 cron task to log some of the data from the stats.jsp page from each
 core (about 100 cores, most of which are small indexes).
 
 You can see there is a dramatic spiking of the load after the update -
 I think due to added reporting on that page such as from the lucene
 field cache.  Is this amount of load expected?
 
 -Peter
 
 On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill wrote:
  Also, what is your heap size and the amount of RAM on the machine?
 
  I've also noticed that, when watching memory usage through JConsole or
  YourKit while loading the stats page, the memory usage spikes dramatically -
  are you seeing this as well?
 
  -Jay
 
  On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill wrote:
 
  I've noticed this as well, usually when working with a large field cache. I
  haven't done in-depth analysis of this yet, but it seems like when the 
  stats
  page is trying to pull data from a large field cache it takes quite a long
  time.
 
  Are you doing a lot of sorting? If so, what are the field types of the
  fields you're sorting on? How large is the index both in document count and
  file size?
 
  Another approach to get data from the Solr instance would be to use JMX.
  And I've been working on a request handler (started by Erik Hatcher) that
  will provide the same information as the stats page, but a little more
  efficiently. I may try to put up a patch with this soon.
 
  -Jay
 
 
 
  On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss wrote:
 
  We've been using Solr 1.4 for a few days now and one slight downside we've
  noticed is the stats page comes up very slowly for some reason - sometimes
  more than 10 seconds.  We call this programmatically to retrieve the last
  commit date so that we can keep users from committing too frequently.  
  This
  means some of our administration pages are now taking a long time to load.
   Is there anything we should be doing to ensure that this page comes up
  quickly?  I see some notes on this back in October but it looks like that
  update should already be applied by now.  Or, better yet, is there now a
  better way to just retrieve the last commit date from Solr without pulling
  all of the statistics?
 
  Thanks in advance.
 
  --
  Steve
 
 
 
 
 
 
 
 -- 
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com

Re: SOLR Performance Tuning: Pagination