Re: Replication and querying

2010-02-10 Thread Julian Hille
Hi,

its would be possible to add that to the main solr but the problem is:
Lets face it (example):
We have kind of 1.5 million documents in the solr master. These Documents are 
books.
These books have fields like title, ids, numbers and authors and more.
This solr is global.

Now: The slave solr is for a local library which has all these books, but want 
to sort in another way,
and wants to add their own fields. For sorting and output (these fields doesnt 
need to be indexed or searched through).

So we try to replicate the whole database but have a slightly differen 
schema.xml in the slaves.


Secondly we need for another Project to know if its possible to change data 
oninsert, onupdate.
So that the replicationed data gets edited before its really inserted. Is there 
some kind of hook?
As an exmaple lets take the book example from top:
On replication the slave gets a updated document set. But before updated on the 
the slaves db
we like to add fields which come from another database or we like to replace 
strings in some fields and such things.

Is that possible?

Thanks for any answers.



Am 09.02.2010 um 16:53 schrieb Jan Høydahl / Cominvent:

 Hi,
 
 Index replication in Solr makes an exact copy of the original index.
 Is it not possible to add the 6 extra fields to both instances?
 An alternative to replication is to feed two independent Solr instances - 
 full control :)
 Please elaborate on your specific use case if this is not useful answer to 
 you.
 
 --
 Jan Høydahl  - search architect
 Cominvent AS - www.cominvent.com
 
 On 9. feb. 2010, at 13.21, Julian Hille wrote:
 
 Hi,
 
 id like to know if its possible to have a solr Server with a schema and lets 
 say 10 fields indexed.
 I know want to replicate this whole index to another solr server which has a 
 slightly different schema.
 There are additional 6 fields these fields change the sort order for a 
 product which base is our solr database.
 
 Is this kind of replication possible?
 
 Is there another way to interact with data in solr? We'd like to calculate 
 some fields when they will be added.
 I cant seem to find a good documentation about the possible calls in the 
 query itself nor documentaion about queries/calculation  which should be 
 done on update.
 
 
 so far,
 Julian Hille
 
 
 ---
 NetImpact KG
 Altonaer Straße 8
 20357 Hamburg
 
 Tel: 040 / 6738363 2
 Mail: jul...@netimpact.de
 
 Geschäftsführer: Tarek Müller
 

Mit freundlichen Grüßen,
Julian Hille


---
NetImpact KG
Altonaer Straße 8
20357 Hamburg

Tel: 040 / 6738363 2
Mail: jul...@netimpact.de

Geschäftsführer: Tarek Müller



Re: after flush: fdx size mismatch on query durring writes

2010-02-10 Thread Michael McCandless
Yes, more details would be great...

Is this easily repeated?

The exists?=false is particularly spooky.

It means, somehow, a new segment was being flushed, containing 1285
docs, but then after closing the doc stores, the stored fields index
file (_X.fdx) had been deleted.

Can you turn on IndexWriter.setInfoStream, get this error to happen
again, and then post the output?  Thanks.

Mike

On Wed, Feb 10, 2010 at 12:59 AM, Lance Norskog goks...@gmail.com wrote:
 We need more information. How big is the index in disk space? How many
 documents? How many fields? What's the schema? What OS? What Java
 version?

 Do you run this on a local hard disk or is it over an NFS mount?

 Does this software commit before shutting down?

 If you run with asserts on do you get errors before this happens.
    -ea:org.apache.lucene... as a JVM argument

 On Tue, Feb 9, 2010 at 5:08 PM, Acadaca ph...@acadaca.com wrote:

 We are using Solr 1.4 in a multi-core setup with replication.

 Whenever we write to the master we get the following exception:

 java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs 0
 length in bytes of _gqg.fdx file exists?=false
 at
 org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97)
 at
 org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)

 Has anyone had any success debugging this one?

 thx.
 --
 View this message in context: 
 http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com



Solr-JMX/Jetty agentId

2010-02-10 Thread Jan Simon Winkelmann
Hi,

I am (still) trying to get JMX to work. I have finally managed to get a Jetty 
installation running with the right parameters to enable JMX. Now the next 
problem appeared. I need to get Solr to register ist MBeans with the Jetty 
MBeanServer. Using jmx 
serviceUrl=service:jmx:rmi:///jndi/rmi:///jettymbeanserver /, Solr doesn't 
complain on loading, but the MBeans simply don't show up in JConsole, so I 
would like to use jmx agentId=agentId /. But where do I get the agentId? 
And what exactly does this Id represent? Does it change every time I restart 
Jetty?

Thanks in advance!
Jan-Simon Winkelmann


spellcheck

2010-02-10 Thread michaelnazaruk

Hello,all! 
I have some problem with spellcheck! I download,build and connect
dictionary(~500 000 words)!It work fine! But i have suggestions for any word
(even correct word)! Is there possible to get suggestion only for wrong
word? 
-- 
View this message in context: 
http://old.nabble.com/spellcheck-tp27527425p27527425.html
Sent from the Solr - User mailing list archive at Nabble.com.



How to not limit maximum number of documents?

2010-02-10 Thread egon . o
Hi at all,

I'm working with Solr1.4 and came across the point, that Solr limits the number 
of documents retrieved by a solr response. This number can be changed by the 
common query parameter 'rows'.

In my scenario it is very important that the response contains ALL documents in 
the index! I played around with the 'rows'-parameter but couldn't find a way to 
do it.

I was not able to find any hint in the mailing list.
Thanks a lot in advance.

Cheers,
Egon
-- 
NEU: Mit GMX DSL über 1000,- ¿ sparen!
http://portal.gmx.net/de/go/dsl02


Re: Solr-JMX/Jetty agentId

2010-02-10 Thread Tim Terlegård
2010/2/10 Jan Simon Winkelmann winkelm...@newsfactory.de:
 I am (still) trying to get JMX to work. I have finally managed to get a Jetty 
 installation running with the right parameters to enable JMX. Now the next 
 problem appeared. I need to get Solr to register ist MBeans with the Jetty 
 MBeanServer. Using jmx 
 serviceUrl=service:jmx:rmi:///jndi/rmi:///jettymbeanserver /, Solr doesn't 
 complain on loading, but the MBeans simply don't show up in JConsole, so I 
 would like to use jmx agentId=agentId /. But where do I get the 
 agentId? And what exactly does this Id represent? Does it change every time I 
 restart Jetty?

I just have jmx / in solrconfig.xml. On command line I start solr with this:
$ java -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -jar start.jar

In jconsole I can browse the solr beans just fine.

/Tim


RE: analysing wild carded terms

2010-02-10 Thread Fuad Efendi
 hello *, quick question, what would i have to change in the query
 parser to allow wildcarded terms to go through text analysis?

I believe it is illogical. wildcarded terms will go through terms
enumerator.




Getting max/min dates from solr index

2010-02-10 Thread Mark N
How can we get the max and min date from the Solr index ? I would need these
dates to draw a graph ( for example timeline graph )


Also can we use date faceting to show how many documents are indexed every
month  .
Consider I need to draw a timeline graph for current year to show how many
records are indexed for every month  .So i will have months in X axis and no
of document in Y axis.

What should be the better approach to design a schema to achieve this
functionality ?


Any suggestions would be appreciated

thanks


-- 
Nipen Mark


RE: How to not limit maximum number of documents?

2010-02-10 Thread stefan.maric
I was just thinking along similar lines

As far as I can tell you can use the parameters start  rows in combination to 
control the retrieval of query results

So
http://host:port/solr/select/?q=query
Will retrieve up to results 1..10

http://host:port/solr/select/?q=querystart=11rows=10
Will retrieve up results 11..20

So it is up to your application to control result traversal/pagination


Question - does this mean that 
http://host:port/solr/select/?q=querystart=11rows=10
Runs the query a 2nd time

And so on


Regards
Stefan Maric 


Cannot get like exact searching to work

2010-02-10 Thread Aaron Zeckoski
I am using SOLR 1.3 and my server is embedded and accessed using SOLRJ.
I would like to setup my searches so that exact matches are the first
results returned, followed by near matches, and finally token based
matches.
For example, if I have a summary field in schema which is created
using copyField from a bunch of other fields:
My item title, keyword, other, stuff

I want this search to match the item above first and foremost:
1) My item title*

Then this one:
2) my item*

and finally this one should also work:
3) my title

I tried creating a field to hold exact match data (summaryExact) which
actually works if I paste in the precise text but stops working as
soon as I add any wildcard to it. In other words I get no matches for
My item title* but I get 1 match for My item title. I also tried
this:
(summary:my item || summaryExact:my item*^3)
but that results in 0 matches as well.

I could not quite figure out which tokenizer to use if I don't want
any tokens created but just want to trim and lowercase the string so
let me know if you have ideas on this. Basically, I want something
similar to DB like matching without case sensitivity and probably
trimmed as well. I don't really want the field to be tokenized though.

I am attaching my schema in case that helps.
I have spent a few days reading through the SOLR documentation and
forums and trying various things to get this to work but I just end up
making the matching worse when I make changes. I appreciate any
pointers, links, or ideas.
Thanks!
-AZ


--
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
?xml version=1.0 encoding=UTF-8 ?

!--  
 This is the Solr schema file. This file should be named schema.xml and
 should be in the conf directory under the solr home
 (i.e. ./solr/conf/schema.xml by default) 
 or located where the classloader for the Solr webapp can find it.

 For more information, on how to customize this file, please see
 http://wiki.apache.org/solr/SchemaXml
--

!-- Steeple Portal project schema - Aaron Zeckoski (aa...@caret.cam.ac.uk) --
schema name=steeple version=1.1
  !-- this is a unified schema of multiple types since the searches need to be combined,
not completely sure if this is required
--

  types
!-- 
omitNorms -If you have tokenized fields of variable size and you want the field length to 
affect the relevance score, then you do not want to omit norms.  Omitting norms is good for 
fields where length is of no importance (e.g. gender=Male vs. gender=Female).  Omitting 
norms saves you heap/RAM, one byte per doc per field without norms, I believe. 

positionIncrementGap - Used for multivalued fields
With a position increment gap of 0, a phrase query of doe bob would  
be a match.  But often it is undesirable for that kind of match across  
different field values.  A position increment gap controls the virtual  
space between the last token of one field instance and the first token  
of the next instance.  With a gap of 100, this prevents phrase queries  
(even with a modest slop factor) from matching across instances. 

Comma delimited splitter (maybe for keywords if they are delimited)
analyzer class=org.apache.lucene.analysis.PatternTokenizerFactory pattern=, * /
--

!-- The identifier should always be extremely simple so there are no filters on it --
fieldType name=identifier class=solr.StrField sortMissingLast=true omitNorms=true compressed=false indexed=true stored=true /
!-- special field for exact text matches, no processing --
fieldType name=exact class=solr.TextField compressed=false indexed=true stored=true /
!-- name indicates names, titles, and summaries, 
these are not tokenized but are flattened (html and special chars) to make searches easier --
fieldType name=name class=solr.StrField sortMissingLast=true omitNorms=true compressed=false indexed=true stored=true
  analyzer type=index
tokenizer class=solr.HTMLStripStandardTokenizerFactory/
!-- splits things up filter class=solr.StandardFilterFactory/ --
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
fieldType name=year class=solr.SortableIntField sortMissingLast=true omitNorms=true compressed=false indexed=true stored=true /
fieldtype name=keywords class=solr.TextField  positionIncrementGap=10 omitNorms=true
  analyzer
tokenizer class=solr.LowerCaseTokenizerFactory/
  /analyzer
/fieldtype

!-- standard field 

Re: How to not limit maximum number of documents?

2010-02-10 Thread egon . o
Hi Stefan,

you are right. I noticed this page-based result handling too. For web pages it 
is handy to maintain a number-of-results-per-page parameter together with an 
offset to browse result pages. Both can be done be solr's 'start' and 'rows' 
parameters.
But as I don't use Solr in a web context it's important for me to get all 
results in one go.

While waiting for answers I was working on a work-around and came across the 
LukeRequestHandler (http://wiki.apache.org/solr/LukeRequestHandler). It allows 
to query the index and obtain meta information about it. I found a parameter in 
the response called 'numDocs' which seams to contain the current number of 
index rows.

So I was now thinking about first asking for the number of index rows via the 
LukeRequestHandler and then setting the 'rows' parameter to this value. 
Apparently, this is quite expensive as one front-end query always leads to two 
back-end queries. So I'm still searching for a better way to do this!

Cheers,
Egon



 Original-Nachricht 
 Datum: Wed, 10 Feb 2010 13:19:05 +
 Von: stefan.ma...@bt.com
 An: solr-user@lucene.apache.org
 Betreff: RE: How to not limit maximum number of documents?

 I was just thinking along similar lines
 
 As far as I can tell you can use the parameters start  rows in
 combination to control the retrieval of query results
 
 So
 http://host:port/solr/select/?q=query
 Will retrieve up to results 1..10
 
 http://host:port/solr/select/?q=querystart=11rows=10
 Will retrieve up results 11..20
 
 So it is up to your application to control result traversal/pagination
 
 
 Question - does this mean that 
 http://host:port/solr/select/?q=querystart=11rows=10
 Runs the query a 2nd time
 
 And so on
 
 
 Regards
 Stefan Maric 

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01


RE: How to not limit maximum number of documents?

2010-02-10 Thread stefan.maric
Egon

If you first run your query with q=queryrows=0

Then your you get back an indication of the total number of docs 
result name=response numFound=53 start=0/

Now your app can query again to get 1st n rows  manage forward|backward 
traversal of results by subsequent queries



Regards
Stefan Maric 

-Original Message-
From: ego...@gmx.de [mailto:ego...@gmx.de] 
Sent: 10 February 2010 14:08
To: solr-user@lucene.apache.org
Subject: Re: How to not limit maximum number of documents?

Hi Stefan,

you are right. I noticed this page-based result handling too. For web pages it 
is handy to maintain a number-of-results-per-page parameter together with an 
offset to browse result pages. Both can be done be solr's 'start' and 'rows' 
parameters.
But as I don't use Solr in a web context it's important for me to get all 
results in one go.

While waiting for answers I was working on a work-around and came across the 
LukeRequestHandler (http://wiki.apache.org/solr/LukeRequestHandler). It allows 
to query the index and obtain meta information about it. I found a parameter in 
the response called 'numDocs' which seams to contain the current number of 
index rows.

So I was now thinking about first asking for the number of index rows via the 
LukeRequestHandler and then setting the 'rows' parameter to this value. 
Apparently, this is quite expensive as one front-end query always leads to two 
back-end queries. So I'm still searching for a better way to do this!

Cheers,
Egon



 Original-Nachricht 
 Datum: Wed, 10 Feb 2010 13:19:05 +
 Von: stefan.ma...@bt.com
 An: solr-user@lucene.apache.org
 Betreff: RE: How to not limit maximum number of documents?

 I was just thinking along similar lines
 
 As far as I can tell you can use the parameters start  rows in 
 combination to control the retrieval of query results
 
 So
 http://host:port/solr/select/?q=query
 Will retrieve up to results 1..10
 
 http://host:port/solr/select/?q=querystart=11rows=10
 Will retrieve up results 11..20
 
 So it is up to your application to control result traversal/pagination
 
 
 Question - does this mean that
 http://host:port/solr/select/?q=querystart=11rows=10
 Runs the query a 2nd time
 
 And so on
 
 
 Regards
 Stefan Maric

--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01


Re: How to not limit maximum number of documents?

2010-02-10 Thread Ron Chan
just set the rows to a very large number, larger than the number of documents 
available 

useful to set the fl parameter with the fields required to avoid memory 
problems, if each document contains a lot of information 


- Original Message - 
From: stefan maric stefan.ma...@bt.com 
To: solr-user@lucene.apache.org 
Sent: Wednesday, 10 February, 2010 2:14:05 PM 
Subject: RE: How to not limit maximum number of documents? 

Egon 

If you first run your query with q=queryrows=0 

Then your you get back an indication of the total number of docs 
result name=response numFound=53 start=0/ 

Now your app can query again to get 1st n rows  manage forward|backward 
traversal of results by subsequent queries 



Regards 
Stefan Maric 

-Original Message- 
From: ego...@gmx.de [mailto:ego...@gmx.de] 
Sent: 10 February 2010 14:08 
To: solr-user@lucene.apache.org 
Subject: Re: How to not limit maximum number of documents? 

Hi Stefan, 

you are right. I noticed this page-based result handling too. For web pages it 
is handy to maintain a number-of-results-per-page parameter together with an 
offset to browse result pages. Both can be done be solr's 'start' and 'rows' 
parameters. 
But as I don't use Solr in a web context it's important for me to get all 
results in one go. 

While waiting for answers I was working on a work-around and came across the 
LukeRequestHandler (http://wiki.apache.org/solr/LukeRequestHandler). It allows 
to query the index and obtain meta information about it. I found a parameter in 
the response called 'numDocs' which seams to contain the current number of 
index rows. 

So I was now thinking about first asking for the number of index rows via the 
LukeRequestHandler and then setting the 'rows' parameter to this value. 
Apparently, this is quite expensive as one front-end query always leads to two 
back-end queries. So I'm still searching for a better way to do this! 

Cheers, 
Egon 



 Original-Nachricht  
 Datum: Wed, 10 Feb 2010 13:19:05 + 
 Von: stefan.ma...@bt.com 
 An: solr-user@lucene.apache.org 
 Betreff: RE: How to not limit maximum number of documents? 

 I was just thinking along similar lines 
 
 As far as I can tell you can use the parameters start  rows in 
 combination to control the retrieval of query results 
 
 So 
 http://host:port/solr/select/?q=query 
 Will retrieve up to results 1..10 
 
 http://host:port/solr/select/?q=querystart=11rows=10 
 Will retrieve up results 11..20 
 
 So it is up to your application to control result traversal/pagination 
 
 
 Question - does this mean that 
 http://host:port/solr/select/?q=querystart=11rows=10 
 Runs the query a 2nd time 
 
 And so on 
 
 
 Regards 
 Stefan Maric 

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! 
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 


AW: Solr-JMX/Jetty agentId

2010-02-10 Thread Jan Simon Winkelmann
2010/2/10 Jan Simon Winkelmann winkelm...@newsfactory.de:
 I am (still) trying to get JMX to work. I have finally managed to get a Jetty 
 installation running with the right parameters to enable JMX. Now the next 
 problem appeared. I need to get Solr to register ist MBeans with the Jetty 
 MBeanServer. Using jmx 
 serviceUrl=service:jmx:rmi:///jndi/rmi:///jettymbeanserver /, Solr doesn't 
 complain on loading, but the MBeans simply don't show up in JConsole, so I 
 would like to use jmx agentId=agentId /. But where do I get the 
 agentId? And what exactly does this Id represent? Does it change every time I 
 restart Jetty?

I just have jmx / in solrconfig.xml. On command line I start solr with this:
$ java -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -jar start.jar

In jconsole I can browse the solr beans just fine.




Thanks for that, it appears my thinking was just too complicated here. Works 
fine now :)

Best
Jan


Re: How to not limit maximum number of documents?

2010-02-10 Thread egon . o
Setting the 'rows' parameter to a number larger than the number of documents 
available requires that you know how much are available. That's what I intended 
to retrieve via the LukeRequestHandler.

Anyway, nice approach Stefan. I'm afraid I forgot this 'numFound' aspect. :)
But still, it feels like a hack. Originally I was searching more for something 
like:

q=queryrows=-1

Which leaves the API to do the job (efficiently!). :)
The question is:
Does Solr support something? Or should we write a feature request?

Cheers,
Egon



 Original-Message 
 Datum: Wed, 10 Feb 2010 14:38:51 + (GMT)
 Von: Ron Chan rc...@i-tao.com
 An: solr-user@lucene.apache.org
 Betreff: Re: How to not limit maximum number of documents?

 just set the rows to a very large number, larger than the number of
 documents available 
 
 useful to set the fl parameter with the fields required to avoid memory
 problems, if each document contains a lot of information 
 
 
 - Original Message - 
 From: stefan maric stefan.ma...@bt.com 
 To: solr-user@lucene.apache.org 
 Sent: Wednesday, 10 February, 2010 2:14:05 PM 
 Subject: RE: How to not limit maximum number of documents? 
 
 Egon 
 
 If you first run your query with q=queryrows=0 
 
 Then your you get back an indication of the total number of docs 
 result name=response numFound=53 start=0/ 
 
 Now your app can query again to get 1st n rows  manage forward|backward
 traversal of results by subsequent queries 
 
 
 
 Regards 
 Stefan Maric
-- 
NEU: Mit GMX DSL über 1000,- ¿ sparen!
http://portal.gmx.net/de/go/dsl02


RE: How to not limit maximum number of documents?

2010-02-10 Thread stefan.maric
Yes, I tried the q=queryrows=-1 - the other day and gave up

But as you say it wouldn't help because you might get 
a) timeouts because you have to wait a 'long' time for the large set of results 
to be returned
b) exceptions being thrown because you're retrieving too much info to be thrown 
around the system



Regards
Stefan Maric 

-Original Message-
From: ego...@gmx.de [mailto:ego...@gmx.de] 
Sent: 10 February 2010 15:06
To: solr-user@lucene.apache.org
Subject: Re: How to not limit maximum number of documents?

Setting the 'rows' parameter to a number larger than the number of documents 
available requires that you know how much are available. That's what I intended 
to retrieve via the LukeRequestHandler.

Anyway, nice approach Stefan. I'm afraid I forgot this 'numFound' aspect. :) 
But still, it feels like a hack. Originally I was searching more for something 
like:

q=queryrows=-1

Which leaves the API to do the job (efficiently!). :) The question is:
Does Solr support something? Or should we write a feature request?

Cheers,
Egon



 Original-Message 
 Datum: Wed, 10 Feb 2010 14:38:51 + (GMT)
 Von: Ron Chan rc...@i-tao.com
 An: solr-user@lucene.apache.org
 Betreff: Re: How to not limit maximum number of documents?

 just set the rows to a very large number, larger than the number of 
 documents available
 
 useful to set the fl parameter with the fields required to avoid 
 memory problems, if each document contains a lot of information
 
 
 - Original Message -
 From: stefan maric stefan.ma...@bt.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, 10 February, 2010 2:14:05 PM
 Subject: RE: How to not limit maximum number of documents? 
 
 Egon
 
 If you first run your query with q=queryrows=0
 
 Then your you get back an indication of the total number of docs 
 result name=response numFound=53 start=0/
 
 Now your app can query again to get 1st n rows  manage 
 forward|backward traversal of results by subsequent queries
 
 
 
 Regards
 Stefan Maric
--
NEU: Mit GMX DSL über 1000,- ¿ sparen!
http://portal.gmx.net/de/go/dsl02


Re: How to not limit maximum number of documents?

2010-02-10 Thread Walter Underwood
Solr will not do this efficiently. Getting all rows will be very slow. Adding a 
parameter will not make it fast.

Why do you want to do this?

wunder

On Feb 10, 2010, at 7:06 AM, ego...@gmx.de wrote:

 Setting the 'rows' parameter to a number larger than the number of documents 
 available requires that you know how much are available. That's what I 
 intended to retrieve via the LukeRequestHandler.
 
 Anyway, nice approach Stefan. I'm afraid I forgot this 'numFound' aspect. :)
 But still, it feels like a hack. Originally I was searching more for 
 something like:
 
 q=queryrows=-1
 
 Which leaves the API to do the job (efficiently!). :)
 The question is:
 Does Solr support something? Or should we write a feature request?
 
 Cheers,
 Egon
 
 
 
  Original-Message 
 Datum: Wed, 10 Feb 2010 14:38:51 + (GMT)
 Von: Ron Chan rc...@i-tao.com
 An: solr-user@lucene.apache.org
 Betreff: Re: How to not limit maximum number of documents?
 
 just set the rows to a very large number, larger than the number of
 documents available 
 
 useful to set the fl parameter with the fields required to avoid memory
 problems, if each document contains a lot of information 
 
 
 - Original Message - 
 From: stefan maric stefan.ma...@bt.com 
 To: solr-user@lucene.apache.org 
 Sent: Wednesday, 10 February, 2010 2:14:05 PM 
 Subject: RE: How to not limit maximum number of documents? 
 
 Egon 
 
 If you first run your query with q=queryrows=0 
 
 Then your you get back an indication of the total number of docs 
 result name=response numFound=53 start=0/ 
 
 Now your app can query again to get 1st n rows  manage forward|backward
 traversal of results by subsequent queries 
 
 
 
 Regards 
 Stefan Maric
 -- 
 NEU: Mit GMX DSL über 1000,- ¿ sparen!
 http://portal.gmx.net/de/go/dsl02
 



Re: How to not limit maximum number of documents?

2010-02-10 Thread egon . o
Okay. So we have to leave this question open for now. There might be other 
(more advanced) users that can answer this question. It's for sure, the 
solution we found is not quite good.

In the meantime, I will look for a way to submit a feature request. :)



 Original-Message 
 Datum: Wed, 10 Feb 2010 15:13:49 +
 Von: stefan.ma...@bt.com
 An: solr-user@lucene.apache.org
 Betreff: RE: How to not limit maximum number of documents?

 Yes, I tried the q=queryrows=-1 - the other day and gave up
 
 But as you say it wouldn't help because you might get 
 a) timeouts because you have to wait a 'long' time for the large set of
 results to be returned
 b) exceptions being thrown because you're retrieving too much info to be
 thrown around the system
-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01


Re: How to not limit maximum number of documents?

2010-02-10 Thread Ron Chan
I meant, available in total, not what just what satisfies the particular query 

you should have at least an estimate of the amount of total documents, even if 
it grows daily 

and if you are talking about millions of rows, and you are try to retrieve them 
all, IMHO, not getting all of them will be the least of your problems 


- Original Message - 
From: egon o ego...@gmx.de 
To: solr-user@lucene.apache.org 
Sent: Wednesday, 10 February, 2010 3:06:25 PM 
Subject: Re: How to not limit maximum number of documents? 

Setting the 'rows' parameter to a number larger than the number of documents 
available requires that you know how much are available. That's what I intended 
to retrieve via the LukeRequestHandler. 

Anyway, nice approach Stefan. I'm afraid I forgot this 'numFound' aspect. :) 
But still, it feels like a hack. Originally I was searching more for something 
like: 

q=queryrows=-1 

Which leaves the API to do the job (efficiently!). :) 
The question is: 
Does Solr support something? Or should we write a feature request? 

Cheers, 
Egon 



 Original-Message  
 Datum: Wed, 10 Feb 2010 14:38:51 + (GMT) 
 Von: Ron Chan rc...@i-tao.com 
 An: solr-user@lucene.apache.org 
 Betreff: Re: How to not limit maximum number of documents? 

 just set the rows to a very large number, larger than the number of 
 documents available 
 
 useful to set the fl parameter with the fields required to avoid memory 
 problems, if each document contains a lot of information 
 
 
 - Original Message - 
 From: stefan maric stefan.ma...@bt.com 
 To: solr-user@lucene.apache.org 
 Sent: Wednesday, 10 February, 2010 2:14:05 PM 
 Subject: RE: How to not limit maximum number of documents? 
 
 Egon 
 
 If you first run your query with q=queryrows=0 
 
 Then your you get back an indication of the total number of docs 
 result name=response numFound=53 start=0/ 
 
 Now your app can query again to get 1st n rows  manage forward|backward 
 traversal of results by subsequent queries 
 
 
 
 Regards 
 Stefan Maric 
-- 
NEU: Mit GMX DSL über 1000,- ¿ sparen! 
http://portal.gmx.net/de/go/dsl02 


delete via DIH

2010-02-10 Thread Lukas Kahwe Smith
Hi,

There is a solution to update via DIH, but is there also a way to define a 
query that fetches id's for documents that should be removed?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





question/suggestion for Solr-236 patch

2010-02-10 Thread gdeconto

I have been able to apply and use the solr-236 patch (field collapsing)
successfully.

Very, very cool and powerful.

My one comment/concern is that the collapseCount and aggregate function
values in the collapse_counts list only represent the collapsed documents
(ie the ones that are not shown in results).

Are there any plans to include the non-collapsed (?) document in the
collapseCount and aggregate function values (ie so that it includes ALL
documents, not just the collapsed ones)?  Possibly via some parameter like
collapse.includeAll?

I think this would be a great addition to the collapse code (and solr
functionality) via what I would think is a small change.
-- 
View this message in context: 
http://old.nabble.com/question-suggestion-for-Solr-236-patch-tp27533695p27533695.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: analysing wild carded terms

2010-02-10 Thread Joe Calderon
sorry, what i meant to say is apply text analysis to the part of the
query that is wildcarded, for example if a term with latin1 diacritics
is wildcarded ide still like to run it through ISOLatin1Filter

On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi f...@efendi.ca wrote:
 hello *, quick question, what would i have to change in the query
 parser to allow wildcarded terms to go through text analysis?

 I believe it is illogical. wildcarded terms will go through terms
 enumerator.





Re: question/suggestion for Solr-236 patch

2010-02-10 Thread gdeconto


Joe Calderon-2 wrote:
 
 you can do that very easily yourself in a post processing step after
 you receive the solr response
 

true (and am already doing so).

was thinking that having this done as part of the field collapsing code, it
might be faster than doing so via post processing (ie no need to navigate
the xml results for two different values for each collapsed set, adding the
numbers to get the total, etc)

it was just a suggestion.  field collapsing is a great feature.
-- 
View this message in context: 
http://old.nabble.com/question-suggestion-for-Solr-236-patch-tp27533695p27535153.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: analysing wild carded terms

2010-02-10 Thread Steven A Rowe
Hi Joe,

See this recent thread from a user with a very similar issue:

http://old.nabble.com/No-wildcards-with-solr.ASCIIFoldingFilterFactory--td24162104.html

In the above thread, Mark Miller mentions that Lucene's AnalyzingQueryParser 
should do the trick, but would need to be integrated into Solr.

Down at the bottom of the thread, the original poster has a patch file 
implementing Solr integration that he says worked for him.

Steve

From: Joe Calderon [calderon@gmail.com]
Sent: Wednesday, February 10, 2010 12:14 PM
To: solr-user@lucene.apache.org
Subject: Re: analysing wild carded terms

sorry, what i meant to say is apply text analysis to the part of the
query that is wildcarded, for example if a term with latin1 diacritics
is wildcarded ide still like to run it through ISOLatin1Filter

On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi f...@efendi.ca wrote:
 hello *, quick question, what would i have to change in the query
 parser to allow wildcarded terms to go through text analysis?

 I believe it is illogical. wildcarded terms will go through terms
 enumerator.




Re: Distributed search and haproxy and connection build up

2010-02-10 Thread Ian Connor
Thanks,

I bypassed haproxy as a test and it did reduce the number of connections -
but it did not seem as those these connections were hurting anything.

Ian.

On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog goks...@gmail.com wrote:

 This goes through the Apache Commons HTTP client library:
 http://hc.apache.org/httpclient-3.x/

 We used 'balance' at another project and did not have any problems.

 On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor ian.con...@gmail.com wrote:
  I have been using distributed search with haproxy but noticed that I am
  suffering a little from tcp connections building up waiting for the OS
 level
  closing/time out:
 
  netstat -a
  ...
  tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
  CLOSE_WAIT
  tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
  TIME_WAIT
  tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
  CLOSE_WAIT
  tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
  TIME_WAIT
  tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
  CLOSE_WAIT
  ...
 
  Digging a little into the haproxy documentation, it seems that they do
 not
  support persistent connections.
 
  Does solr normally persist the connections between shards (would this
  problem happen even without haproxy)?
 
  Ian.
 



 --
 Lance Norskog
 goks...@gmail.com




-- 
Regards,

Ian Connor


dismax and multi-language corpus

2010-02-10 Thread Claudio Martella
Hello list,

I have a corpus with 3 languages, so i setup a text content field (with
no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
i copyField the text to my language-away fields. So, I setup this dismax
searchHandler:

requestHandler name=content class=solr.SearchHandler default=true
lst name=defaults
   str name=defTypedismax/str
   str name=pftitle^1.2 content-en^0.8 content-it^0.8
content-de^0.8/str
   str name=bftitle^1.2 content-en^0.8 content-it^0.8
content-de^0.8/str
   str name=qftitle^1.2 content-en^0.8 content-it^0.8
content-de^0.8/str
   float name=tie0.1/float
/lst
/requestHandler


but i get this error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
',' at position 7 in 'content-en'

type Status report

message org.apache.lucene.queryParser.ParseException: Expected ',' at
position 7 in 'content-en'

description The request sent by the client was syntactically incorrect
(org.apache.lucene.queryParser.ParseException: Expected ',' at position
7 in 'content-en').

Any idea?

TIA

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research  Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




RE: Indexing / querying multiple data types

2010-02-10 Thread Stefan Maric
Lance

after a bit more reading -  cleaning up my configuration (case sensitivity 
corrected but didn't appear to be affecting the indexing  i don't use the 
atomID field for querying anyhow)

I've added a docType field when I index my data and now use the fq parameter to 
filter on that new field





-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: 10 February 2010 03:28
To: solr-user@lucene.apache.org
Subject: Re: Indexing / querying multiple data types


A couple of minor problems:

The qt parameter (Que Tee) selects the parser for the q (Q for query)
parameter. I think you mean 'qf':

http://wiki.apache.org/solr/DisMaxRequestHandler#qf_.28Query_Fields.29

Another problems with atomID, atomId, atomid: Solr field names are
case-sensitive. I don't know how this plays out.

Now, to the main part:  the entity name=name1 part does not create
a column named name1.
The two queries only populate the same namespace of four fields: id,
atomID, name, description.

If you want data from each entity to have a constant field
distinguishing it, you have to create a new field with a constant
value. You do this with the TemplateTransformer.

http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer

Add this as an entity attribute to both entities:
transformer=TemplateTransformer
and add this as a column to each entity:
field column=name template=name1 and then name2.

You may have to do something else for these to appear in the document.

On Tue, Feb 9, 2010 at 12:41 AM,  stefan.ma...@bt.com wrote:
 Sven

 In my data-config.xml I have the following
document 
entity name=name1 query=select id, atomID, name, 
 description from v_1 /
entity name=name2 query=select id, atomID, name, 
 description from V_2 /
/document

 In my schema.xml I have
   field name=id type=string indexed=true stored=true required=true 
 /
   field name=name type=text indexed=true stored=true/
   field name=atomId type=string indexed=false stored=true 
 required=true /
   field name=description type=text indexed=true stored=true /

 And in my solrconfig.xml I have
  requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
/lst
  /requestHandler

requestHandler name=name1 class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfname^1.5 description^1.0/str
/lst
/requestHandler

requestHandler name=contacts class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfname^1.5 description^1.0/str
/lst
/requestHandler

 And the
  requestHandler name=dismax class=solr.SearchHandler 
 Has been untouched

 So when I run
 http://localhost:7001/solr/select/?q=foodqt=name1
 I was expecting to get results form the data that had been indexed by entity 
 name=name1


 Regards
 Stefan Maric




-- 
Lance Norskog
goks...@gmail.com
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.435 / Virus Database: 271.1.1/2677 - Release Date: 02/09/10 
07:35:00




DataImportHandler - too many connections MySQL error after upgrade to Solr 1.4 release

2010-02-10 Thread Bojan Šmid
Hi all,

  I had DataImportHandler working perfectly on Solr 1.4 nightly build from
June 2009. I upgraded the Solr to 1.4 release and started getting errors:


Caused by: com.mysql.jdbc.exceptions.MySQLNonTransientConnectionException:
Server connection failure during transaction. Due to underlying exception:
'com.mysql.jdbc.except
ions.MySQLNonTransientConnectionException: Too many connections'.


  This is the same machine, the same setup (except new Solr) that never had
problems. The error doesn't pop-up at the beginning, DIH runs for few hours
and then breaks (after few millions of rows are processed).

  Solr is the only process using MySQL, max_connections on MySQL is set to
100, so it seems like there might exist some connection leak in DIH. Few
more informations on the setup:
  MySQL version 5.0.67
  driver: mysql-connector-java-5.0.8-bin.jar
  Java: 1.6.0_14
  connection URL parameters : autoReconnect=true, batchSize=-1
  OS : CentOS 5.2

  Did anyone else had similar problems with 1.4 release?


  Regards


implementing profanity detector

2010-02-10 Thread Mike Perham
FYI this does not work.  It appears that the update seems to run on a
different thread to the analysis, perhaps because the update is done
when the commit happens?  I'm sending the document XML with
commitWithin=6.

I would appreciate any other ideas.  I'm drawing a blank on how to
implement this efficiently with Lucene/Solr.

mike

On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 How about this crazy idea - a custom TokenFilter that stores the safe flag in 
 ThreadLocal?



 - Original Message 
  From: Mike Perham mper...@onespot.com
  To: solr-user@lucene.apache.org
  Sent: Thu, January 28, 2010 4:46:54 PM
  Subject: implementing profanity detector
 
  We'd like to implement a profanity detector for documents during indexing.
  That is, given a file of profane words, we'd like to be able to mark a
  document as safe or not safe if it contains any of those words so that we
  can have something similar to google's safe search.
 
  I'm trying to figure out how best to implement this with Solr 1.4:
 
  - An UpdateRequestProcessor would allow me to dynamically populate a safe
  boolean field but requires me to pull out the content, tokenize it and run
  each token through my set of profanities, essentially running the analysis
  pipeline again.  That's a lot of overheard AFAIK.
 
  - A TokenFilter would allow me to tap into the existing analysis pipeline so
  I get the tokens for free but I can't access the document.
 
  Any suggestions on how to best implement this?
 
  Thanks in advance,
  mike



Re: dismax and multi-language corpus

2010-02-10 Thread Otis Gospodnetic
Claudio - fields with '-' in them can be problematic.

Side comment: do you really want to search across all languages at once?  If 
not, maybe 3 different dismax configs would make your searches better.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Claudio Martella claudio.marte...@tis.bz.it
 To: solr-user@lucene.apache.org
 Sent: Wed, February 10, 2010 3:15:40 PM
 Subject: dismax and multi-language corpus
 
 Hello list,
 
 I have a corpus with 3 languages, so i setup a text content field (with
 no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
 i copyField the text to my language-away fields. So, I setup this dismax
 searchHandler:
 
 
 
   dismax
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   0.1
 
 
 
 
 but i get this error:
 
 HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
 ',' at position 7 in 'content-en'
 
 type Status report
 
 message org.apache.lucene.queryParser.ParseException: Expected ',' at
 position 7 in 'content-en'
 
 description The request sent by the client was syntactically incorrect
 (org.apache.lucene.queryParser.ParseException: Expected ',' at position
 7 in 'content-en').
 
 Any idea?
 
 TIA
 
 Claudio
 
 -- 
 Claudio Martella
 Digital Technologies
 Unit Research  Development - Analyst
 
 TIS innovation park
 Via Siemens 19 | Siemensstr. 19
 39100 Bolzano | 39100 Bozen
 Tel. +39 0471 068 123
 Fax  +39 0471 068 129
 claudio.marte...@tis.bz.it http://www.tis.bz.it
 
 Short information regarding use of personal data. According to Section 13 of 
 Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
 process your personal data in order to fulfil contractual and fiscal 
 obligations 
 and also to send you information regarding our services and events. Your 
 personal data are processed with and without electronic means and by 
 respecting 
 data subjects' rights, fundamental freedoms and dignity, particularly with 
 regard to confidentiality, personal identity and the right to personal data 
 protection. At any time and without formalities you can write an e-mail to 
 priv...@tis.bz.it in order to object the processing of your personal data for 
 the purpose of sending advertising materials and also to exercise the right 
 to 
 access personal data and other rights referred to in Section 7 of Decree 
 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
 Street n. 19, Bolzano. You can find the complete information on the web site 
 www.tis.bz.it.



Need a bit of help, Solr 1.4: type text.

2010-02-10 Thread Dickey, Dan
I'm using the standard text type for a field, and part of the data
being indexed is 13th, as in Friday the 13th.

I can't seem to get it to match when I'm querying for Friday the 13th
either quoted or not.

One thing that does match is 13 th if I send the search query with a
space between...

Any suggestions?

I know this is short on detail, but it's been a long day... time to get
outta here.

Thanks for any and all help.

-Dan

 


This message contains information which may be confidential and/or privileged. 
Unless you are the intended recipient (or authorized to receive for the 
intended recipient), you may not read, use, copy or disclose to anyone the 
message or any information contained in the message. If you have received the 
message in error, please advise the sender by reply e-mail and delete the 
message and any attachment(s) thereto without retaining any copies.

Re: Need a bit of help, Solr 1.4: type text.

2010-02-10 Thread Yu-Shan Fung
Check out the configuration of WordDelimiterFilterFactory in your
schema.xml.

Depending on your settings, it's probably tokenizaing 13th into 13 and
th. You can also have them concatenated back into a single token, but I
can't remember the exact parameter. I think it could be catenateAll.



On Wed, Feb 10, 2010 at 4:32 PM, Dickey, Dan dan.dic...@savvis.net wrote:

 I'm using the standard text type for a field, and part of the data
 being indexed is 13th, as in Friday the 13th.

 I can't seem to get it to match when I'm querying for Friday the 13th
 either quoted or not.

 One thing that does match is 13 th if I send the search query with a
 space between...

 Any suggestions?

 I know this is short on detail, but it's been a long day... time to get
 outta here.

 Thanks for any and all help.

-Dan




 This message contains information which may be confidential and/or
 privileged. Unless you are the intended recipient (or authorized to receive
 for the intended recipient), you may not read, use, copy or disclose to
 anyone the message or any information contained in the message. If you have
 received the message in error, please advise the sender by reply e-mail and
 delete the message and any attachment(s) thereto without retaining any
 copies.




-- 
“When nothing seems to help, I go look at a stonecutter hammering away at
his rock perhaps a hundred times without as much as a crack showing in it.
Yet at the hundred and first blow it will split in two, and I know it was
not that blow that did it, but all that had gone before.” — Jacob Riis


Re: How to configure multiple data import types

2010-02-10 Thread Chris Hostetter

: Subject: How to configure multiple data import types
: In-Reply-To: 4b6c0de5.8010...@zib.de

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Re: Indexing / querying multiple data types

2010-02-10 Thread Chris Hostetter

: Subject: Indexing / querying multiple data types
: In-Reply-To: 8cf3f00d0572f8479efcd0783be11eb1927...@xmb-rcd-104.cisco.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking





-Hoss



Re: Faceting

2010-02-10 Thread Chris Hostetter

: NOTE: Please start a new email thread for a new topic (See 
: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)

FWIW: I'm the most nit-picky person i know about Thread-Hijacking, but i 
don't see any MIME headers to indicate that Jose did that).

:  If i follow this path can i then facet on email and/or link ? For
:  example combining facet field with facet value params?

Any indexed field can be faceted on ... it's hard to be sure what exactly 
your goal is, but if you ultimately want to be able to have a list of 
search results, and then display facet info like Number of results 
containing an email address and Number of results containing a URL then 
yes: as long as you have a way of extracting that metadata and including 
it in an indexed field, you can facet on it ... you could use Field 
Faceting on something like a properities: field (where all the indexed 
values are contains_email and containes_url, etc...) or you could use 
facet queries to check arbitrary criteria (ie: facet.query=has_email:true 
 facet.query=urls:[* TO *], etc...



-Hoss


Re: How to not limit maximum number of documents?

2010-02-10 Thread Chris Hostetter

: Okay. So we have to leave this question open for now. There might be 
: other (more advanced) users that can answer this question. It's for 
: sure, the solution we found is not quite good.

The question really isn't open, it's a FAQ...

http://wiki.apache.org/solr/FAQ#How_can_I_get_ALL_the_matching_documents_back.3F_..._How_can_I_return_an_unlimited_number_of_rows.3F


-Hoss



Query elevation based on field

2010-02-10 Thread Jason Chaffee
Is it possible to do query elevation based on field?  

 

Basically, I would like to search the same term on three different
fields:

 

q=field1:term OR field2:term OR field3:term

 

and I would like to sort the results by fourth field

 

sort=field4+asc

 

However, I would like to elevate all of field1 matches to be at the
beginning, with those matches sorted ascending and the rest of the
field2 and field3 matches sorted ascending.  

 

Is this possible?

 

Thanks.



RE: Index Courruption after replication by new Solr 1.4 Replication

2010-02-10 Thread Osborn Chan
Hi All,

I found out there is file corruption issue by using both EmbeddedSolrServer  
Solr 1.4 Java based replication together in slave server.


In my slave server, I have 2 webapps in a tomcat instance. 
1) multicore webapp with slave config
2) my custom webapp using EmbeddedSolrServer while queries Solr Index Data.
 
Both webapps were set up according to the instruction from Solr wiki.
However, I found out there are multi-threading issue which cause index file 
corruption.

The following is the root case:
EmbeddedSolrServer requires to have a CoreContainer object as parameter. 
However, during the creation of CoreContainer object, the process load the 
slave solr configuration which silently creates an Extra ReplcationHandler 
(SnapPuller) in background. However, there is a ReplcationHandler (SnapPuller) 
already created by multicore webapp because of the slave configuration.

As a result, there are 2 threads doing file replication as same time. It causes 
index corruption with different IOExceptions.
After I replaced the usage of EmbeddedSolrServer with CommonsHttpSolrServer 
(Stop creating CoreContainer object in slave server), Solr 1.4 Java based 
replication work perfectly without having any file corruption issue.

In other to use EmbeddedSolrServer in slave server, I think we need to have a 
way to create CoreContainer object with slave configuration without creating 
extra thread to replicate files.
Should I file a bug?

Thanks,

Osborn



-Original Message-
From: Osborn Chan [mailto:oc...@shutterfly.com] 
Sent: Friday, January 15, 2010 12:35 PM
To: solr-user@lucene.apache.org
Subject: RE: Index Courruption after replication by new Solr 1.4 Replication

Hi Otis,

Thanks. There is no NFS anymore, and all index files are local. We migrated to 
new Solr 1.4 new Replication in order to avoid all the NSF Stale Exception. 

Thanks,

Osborn

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, January 15, 2010 12:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Index Courruption after replication by new Solr 1.4 Replication

This is not a direct answer to your question, but can you avoid NFS?  My first 
guess would be that NFS somehow causes this problem.  If you check the ML 
archives for: NFS lock , you will see what I mean.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Osborn Chan oc...@shutterfly.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Fri, January 15, 2010 3:23:21 PM
 Subject: Index Courruption after replication by new Solr 1.4 Replication
 
 Hi all,
 
 I have migrated new Solr 1.4 Replication feature with multicore support from 
 Solr 1.2 with NFS mounting recently. The following exceptions are in 
 catalina.log from time to time, and there are some EOF exceptions which I 
 believe the slave index files are corrupted after replication from index 
 server. 
 I have following configuration with Solr 1.4, please correct me if it is 
 configured incorrectly. 
 
 (The index files are not corrupted in master servers, but it is corrupted in 
 slave servers. Usually only one of the slave servers are corrupted with EOF 
 exception, but not all.)
 
 1 Master Server: (Index Server)
 - 8 indexes with multicore configuration.
 - All indexes are configured to replicateAfter optimize only.
 - The size of index data are vary. The smallest index only have 2.5 MB. 
 The 
 biggest index have ~ 100 MB. 
 - There would be infrequent optimize calls to indexes. (a optimize call 
 every ~30 mins to 6 hours depending on indexes).
 - There are many commit calls to all indexes. (But there is no concurrent 
 commit and optimize for all indexes.)
 - Did not configure commitReserveDuration in ReplicationHandler - Using 
 default values.
 
 4 Slave Servers (Search Server)
 - 8 indexes with multicore configuration.
 - All indexes are configured to poll for every ~15 minutes.
 - All update handler configuration are removed in solrconfig-slave.xml 
 (solrconfig.xml) in order to prevent add/commit/optimize calls. 
 - (Search Slave Servers are only responsible for search operation.)
 -  removed.
 - 
 removed.
 - 
 class=solr.BinaryUpdateRequestHandler / removed.
 
 A) FileNotFoundException
 
 INFO: Total time taken for download : 1 secs
 Jan 15, 2010 10:34:16 AM org.apache.solr.handler.ReplicationHandler doFetch
 SEVERE: SnapPull failed
 org.apache.solr.common.SolrException: Index fetch failed :
 at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
 at 
 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
 at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
 at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
  

Re: source tree for lucene

2010-02-10 Thread Chris Hostetter

: i want to recompile lucene with
: http://issues.apache.org/jira/browse/LUCENE-2230, but im not sure
: which source tree to use, i tried using the implied trunk revision
: from the admin/system page but solr fails to build with the generated
: jars, even if i exclude the patches from 2230...

Hmmm... I think the problem you are running into is that the Lucene 
Implementation Version information that Solr displays only tells you the 
svn revision number -- but not the branch.

If you note the Solr 1.4 CHANGES.txt it says...

Versions of Major Components

Apache Lucene 2.9.1 (r832363 on 2.9 branch)
Apache Tika 0.4
Carrot2 3.1.0 

...so the key is to check out the 2.9 branch.

(none of which garuntees that any patches you try will actually compile)




-Hoss



The Riddle of the Underscore and the Dollar Sign . . .

2010-02-10 Thread Christopher Ball
I am perplexed by the behavior I am seeing of the Solr Analyzer and Filters
with regard to Underscores.

I am trying to get rid of underscores('_') when shingling, but seem unable
to do so with a Stopwords Filter.

And yet underscores are being removed when I am not even trying to by the
WordDelimiter Filter.

Conversely, I would like to retain dollar signs symbols ('$') when they are
adjacent to numbers, but seem unable to without having to accept all forms
of other syntax. 

1) How can I get rid of underscores('_') without using the wordDelimiter
Filter (which gets rid of other syntax I need)?

2) How can I stop the wordDelimiter Filter from removing dollar signs
symbols ('$')?

Most grateful for any guidance,

Christopher




RE: HTTP caching and distributed search

2010-02-10 Thread Chris Hostetter

: I tried your suggestion, Hoss, but committing to the new coordinator
: core doesn't change the indexVersion and therefore the ETag value isn't
: changed.

Hmmm... so the empty commit doesn't change the indexVersion? ... i 
didn't realize that.

Well, I suppose you could replace your empty commit with an update to a 
bogus document ... it's hackish, but it should work...

http://host/solr/coordinator/update?stream.body=adddocfield 
name=bogusbogus/field/doc/addcommit=true




-Hoss



Re: Which schema changes are incompatible?

2010-02-10 Thread Chris Hostetter

: 
http://wiki.apache.org/solr/FAQ#How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema.3F
: 
: but it is not clear about the times when this is needed. So I wonder, do I
: need to do it after adding a field, removing a field, changing field type,
: changing indexed/stored/multiValue properties? What happens if I don't do
: it, will Solr die?

there is no simple answer to that question ... if you add a field you 
don't need to rebuild (unless you want to ensure every doc gets a value 
indexed or if you are depending on solr to apply a default value).  If you 
remove a field you don't need to rebuild (but none of hte space taken up 
by that field in the index will be reclaimed, and if it's stored it will 
still be included in the response. 

Changing a field type is one of the few sitautions where we can 
categoricly say you *HAVE* to reindex everything

: Also, the FAQ entry notes that one can delete all documents, change the
: schema.xml file, and then reload the core. Would it be possible to instead
: change schema.xml, reload the core, and then rebuild the index -- in effect
: slowly deleting the old documents, but never ending up with a completely
: empty index? I realize that some weird search results could happen during
: such a rebuild, but that may be preferable to having no search results at

The end result won't be 100% equivilent from an index standpoint -- whne 
you delete all solr is actaully able to completely start over with an 
empty index, absent all low level metadata about fields that use to exist 
-- if you incrementally delete, some of that low level metadata will still 
be in the index -- it probably won't be something that will ever affect 
you, but it is a distinction.



-Hoss



Re: dismax and multi-language corpus

2010-02-10 Thread Jason Rutherglen
 Claudio - fields with '-' in them can be problematic.

Why's that?

On Wed, Feb 10, 2010 at 2:38 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Claudio - fields with '-' in them can be problematic.

 Side comment: do you really want to search across all languages at once?  If 
 not, maybe 3 different dismax configs would make your searches better.

  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop ecosystem search :: http://search-hadoop.com/



 - Original Message 
 From: Claudio Martella claudio.marte...@tis.bz.it
 To: solr-user@lucene.apache.org
 Sent: Wed, February 10, 2010 3:15:40 PM
 Subject: dismax and multi-language corpus

 Hello list,

 I have a corpus with 3 languages, so i setup a text content field (with
 no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
 i copyField the text to my language-away fields. So, I setup this dismax
 searchHandler:



   dismax
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   title^1.2 content-en^0.8 content-it^0.8
 content-de^0.8
   0.1




 but i get this error:

 HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
 ',' at position 7 in 'content-en'

 type Status report

 message org.apache.lucene.queryParser.ParseException: Expected ',' at
 position 7 in 'content-en'

 description The request sent by the client was syntactically incorrect
 (org.apache.lucene.queryParser.ParseException: Expected ',' at position
 7 in 'content-en').

 Any idea?

 TIA

 Claudio

 --
 Claudio Martella
 Digital Technologies
 Unit Research  Development - Analyst

 TIS innovation park
 Via Siemens 19 | Siemensstr. 19
 39100 Bolzano | 39100 Bozen
 Tel. +39 0471 068 123
 Fax  +39 0471 068 129
 claudio.marte...@tis.bz.it http://www.tis.bz.it

 Short information regarding use of personal data. According to Section 13 of
 Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
 process your personal data in order to fulfil contractual and fiscal 
 obligations
 and also to send you information regarding our services and events. Your
 personal data are processed with and without electronic means and by 
 respecting
 data subjects' rights, fundamental freedoms and dignity, particularly with
 regard to confidentiality, personal identity and the right to personal data
 protection. At any time and without formalities you can write an e-mail to
 priv...@tis.bz.it in order to object the processing of your personal data for
 the purpose of sending advertising materials and also to exercise the right 
 to
 access personal data and other rights referred to in Section 7 of Decree
 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens
 Street n. 19, Bolzano. You can find the complete information on the web site
 www.tis.bz.it.




Question on Solr Scalability

2010-02-10 Thread abhishes

Suppose I am indexing very large data (5 billion rows in a database)

Now I want to use the Solr Core feature to split the index into manageable
chunks.

However I have two questions


1. Can Cores reside on difference physical servers?

2. when a query comes, will the query be answered by index in 1 core or the
query will be sent to all the cores?

My desire is to have a system which from outside appears as a single large
index... but inside it is multiple small indexes running on different
hardware machines.
-- 
View this message in context: 
http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27543068.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question on Solr Scalability

2010-02-10 Thread Juan Pedro Danculovic
To scale solr, take a look to this article

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr



Juan Pedro Danculovic
CTO - www.linebee.com


On Thu, Feb 11, 2010 at 4:12 AM, abhishes abhis...@gmail.com wrote:


 Suppose I am indexing very large data (5 billion rows in a database)

 Now I want to use the Solr Core feature to split the index into manageable
 chunks.

 However I have two questions


 1. Can Cores reside on difference physical servers?

 2. when a query comes, will the query be answered by index in 1 core or the
 query will be sent to all the cores?

 My desire is to have a system which from outside appears as a single large
 index... but inside it is multiple small indexes running on different
 hardware machines.
 --
 View this message in context:
 http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27543068.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Question on Solr Scalability

2010-02-10 Thread David Stuart

Hi,

I think your needs would meet better with Distributed Search http://wiki.apache.org/solr/DistributedSearch 
 Which allows sharding to live on different servers and will search  
across all of those shard when a query comes in. There are a few patch  
which will hopefully be available in the Solr 1.5 release that will  
improve this including distributed tf idf across shards


Regards,

David
On 11 Feb 2010, at 07:12, abhishes abhis...@gmail.com wrote:



Suppose I am indexing very large data (5 billion rows in a database)

Now I want to use the Solr Core feature to split the index into  
manageable

chunks.

However I have two questions


1. Can Cores reside on difference physical servers?

2. when a query comes, will the query be answered by index in 1 core  
or the

query will be sent to all the cores?

My desire is to have a system which from outside appears as a single  
large

index... but inside it is multiple small indexes running on different
hardware machines.
--
View this message in context: 
http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27543068.html
Sent from the Solr - User mailing list archive at Nabble.com.