Using Solr for indexing emails

2008-11-23 Thread Timo Sirainen
Hi,

A while ago I implemented searching emails with Solr for my IMAP server
(www.dovecot.org). Seems to work ok, but now I'm having a bit of trouble
trying to figure out how to implement searching from multiple mailboxes
efficiently. Would be great if someone had suggestions how to do things
better.

The main problem is that before doing the search, I first have to check
if there are any unindexed messages and then add them to Solr. This is
done using a query like:

 - fl=uid
 - rows=1
 - sort=uid desc
 - q=uidv:uidvalidity box:mailbox user:user

So it returns the highest IMAP UID field (which is an always-ascending
integer) for the given mailbox (you can ignore the uidvalidity). I can
then add all messages with higher UIDs to Solr before doing the actual
search.

When searching multiple mailboxes the above query would have to be sent
to every mailbox separately. That really doesn't seem like the best
solution, especially when there are a lot of mailboxes. But I don't
think Solr has a way to return highest uid field for each
box:mailbox?

Is that above query even efficient for a single mailbox? I did consider
using separate documents for storing the highest UID for each mailbox,
but that causes annoying desynchronization possibilities. Especially
because currently I can just keep sending documents to Solr without
locking and let it drop duplicates automatically (should be rare). With
per-mailbox highest-uid documents I can't really see a way to do this
without locking or allowing duplicate fields to be added and later some
garbage collection deleting all but the one highest value (annoyingly
complex).

I could of course also keep track of what's indexed on Dovecot's side,
but that could also lead to desynchronization issues and I'd like to
avoid them.

I guess the ideal solution would be if it was somehow possible to create
a SQL-like trigger that updates the per-mailbox highest-uid document
whenever adding a new document with a higher UID value.


signature.asc
Description: This is a digitally signed message part


Re: Question about Query Phrase Slop (qs) in dismax

2008-11-23 Thread anuvenk

Somebody please help clear this doubt. What more could i do with the dismax
handler to remove results that don't have 'word1'', 'word2', 'word3' etc in
a search phrase not within 5 words of one another, to not come up in the
results?


anuvenk wrote:
 
 From the solr wiki, it sounded like if qs is set to 5 for example,  if
 the search term is 'child custody', only docs with 'child'  'custody'
 within 5 words of one another would be returned in results. Is this
 correct? If so, it doesn't seem to be working for me. I see docs with
 'child'  'custody' more than 5 words of one another (excluding stop
 words) which is resulting in bad user experience as those docs are not so
 relevant. What more could i do to improve quality in the results?
 

-- 
View this message in context: 
http://www.nabble.com/Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20648109.html
Sent from the Solr - User mailing list archive at Nabble.com.



[VOTE] Community Logo Preferences

2008-11-23 Thread Ryan McKinley

Please submit your preferences for the solr logo.

For full voting details, see:
  http://wiki.apache.org/solr/LogoContest#Voting

The eligible logos are:
  http://people.apache.org/~ryan/solr-logo-options.html

Any and all members of the Solr community are encouraged to reply to  
this thread and list (up to) 5 ranked choices by listing the Jira  
attachment URLs. Votes will be assigned a point value based on rank.  
For each vote, 1st choice has a point value of 5, 5th place has a  
point value of 1, and all others follow a similar pattern.


https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
...

This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

When the poll is complete, the solr committers will tally the  
community preferences and take a final vote on the logo.


A big thanks to everyone would submitted possible logos -- its great  
to see so many good options.

Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Mark Lindeman

https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394265/apache_solr_b_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg

b.t.w, 2 logo's are missing:

https://issues.apache.org/jira/secure/attachment/12394270/apache_solr_d_blue.jpg
and
https://issues.apache.org/jira/secure/attachment/12394271/apache_solr_d_red.jpg

Ryan McKinley schreef op 11/23/2008 05:59 PM:

Please submit your preferences for the solr logo.

For full voting details, see:
  http://wiki.apache.org/solr/LogoContest#Voting

The eligible logos are:
  http://people.apache.org/~ryan/solr-logo-options.html

Any and all members of the Solr community are encouraged to reply to 
this thread and list (up to) 5 ranked choices by listing the Jira 
attachment URLs. Votes will be assigned a point value based on rank. For 
each vote, 1st choice has a point value of 5, 5th place has a point 
value of 1, and all others follow a similar pattern.


https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
...

This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

When the poll is complete, the solr committers will tally the community 
preferences and take a final vote on the logo.


A big thanks to everyone would submitted possible logos -- its great to 
see so many good options.




Compiling Solr 1.3.0 + KStem

2008-11-23 Thread Chris Haggstrom
I was hoping to try using KStem with Solr 1.3.0, but am having trouble  
getting it to compile.


With a fresh Solr 1.3.0 that will build successfully, I unzipped the  
KStemSolr.zip within the apache-solr-1.3.0 directory, but when I then  
try to build (using Ant 1.7.1 and Sun HotSpot JDK 1.6.0 update 10), I  
get:


[EMAIL PROTECTED]:/usr/local/build/apache-solr-1.3.0$ ant compile
Buildfile: build.xml

init-forrest-entities:
[mkdir] Created dir: /usr/local/build/apache-solr-1.3.0/build
[mkdir] Created dir: /usr/local/build/apache-solr-1.3.0/build/web

compile-common:
[mkdir] Created dir: /usr/local/build/apache-solr-1.3.0/build/ 
common

[javac] Compiling 36 source files to
/usr/local/build/apache-solr-1.3.0/build/common
[javac] Note:
/usr/local/build/apache-solr-1.3.0/src/java/org/apache/solr/common/ 
util/FastInputStream.java

uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

compile:
[mkdir] Created dir: /usr/local/build/apache-solr-1.3.0/build/core
[javac] Compiling 350 source files to
/usr/local/build/apache-solr-1.3.0/build/core
[javac]
/usr/local/build/apache-solr-1.3.0/src/java/org/apache/solr/analysis/ 
KStemFilterFactory.java:63:

cannot find symbol
[javac] symbol  : method
init 
(org 
.apache 
.solr.core.SolrConfig,java.util.Mapjava.lang.String,java.lang.String)
[javac] location: class  
org.apache.solr.analysis.BaseTokenFilterFactory

[javac] super.init(solrConfig, args);
[javac]  ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error

BUILD FAILED
/usr/local/build/apache-solr-1.3.0/build.xml:125: The following error
occurred while executing this line:
/usr/local/build/apache-solr-1.3.0/common-build.xml:149: Compile failed;
see the compiler error output for details.


I've also tried to build the KStem filter factory using the KStem.jar  
via the instructions on the Wiki, but I am not sure I'm doing the  
right things in steps 3 and 5:


3.  Modify the package name on the source files to match your install

Does that mean to change package org.apache.lucene.analysis; to  
org.apache.solr.analysis?


5.  Build the jar file and drop that into your Solr /lib directory.

Nothing I've tried here gives me any .class files, just more cannot  
find symbol errors.


Any suggestions would be much appreciated.  I am definitely a novice  
in building Java apps, so I could be missing something very simple  
here.  Thanks,


-Chris


Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Tricia Williams

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394366/solr3_maho.png
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg
https://issues.apache.org/jira/secure/attachment/12394218/solr-solid.png


Re: QueryElevationComponent

2008-11-23 Thread Paolo Ruscitti
Thanks Ryan for your answer.

The only thing that may be weird is that if you id field is named myid,
your elevate.xml file still refers to id as the unique key.  Is that what
you are refering to?

yes, my id field is named myid, but elevate.xml expects its name is id .

Please find below more info:

I' using the very last revision (720030)

I also tried both

elevate
query text=cars
doc myid=77b81d932353a5d16880043bdb4fe22b/
/query
/elevate


and

elevate
query text=cars
doc id=myid:77b81d932353a5d16880043bdb4fe22b/
/query
/elevate


In the former case I've got a tomcat error:

HTTP Status 500 - Severe errors in solr configuration. Check your log files
for more detailed information on what may be wrong. If you want solr to
continue after configuration errors, change:
abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml
-
org.apache.solr.common.SolrException: Error initializing
QueryElevationComponent. at
org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:200)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:319)
at org.apache.solr.core.SolrCore.init(SolrCore.java:563) at
...

In the latter case solr works but the QueryElevation does not.

The query I' using is:
http://localhost:8080/solr/post1/select/?q=carsversion=2.2start=0rows=10indent=onenableElevation=true

thanks
Paolo

On Sun, Nov 23, 2008 at 12:29 AM, Ryan McKinley [EMAIL PROTECTED] wrote:

 hymm -- that *should* not be the case.  The id field in
 QueryElevationComponent uses the globally defined field:

SchemaField sf = core.getSchema().getUniqueKeyField();
...
idField = sf.getName().intern();

 The only thing that may be weird is that if you id field is named myid,
 your elevate.xml file still refers to id as the unique key.  Is that what
 you are refering to?

 I have not tested this, so it may very well be broken.

 ryan




 On Nov 22, 2008, at 5:31 PM, Paolo Ruscitti wrote:

  I have a question about QueryElevationComponent.

 I'm trying to use it but it seems it works properly if, and only if, the
 id
 field name in  uniqueKey definition is '*id*'.

 so if I have uniqueKey*myid*/uniqueKey, it does not work.


 Could you please tell me what I'm doing wrong?
 thaks a lot

 Paolo

 - this is my elevate.xml

 elevate
 query text=cars
 doc id=77b81d932353a5d16880043bdb4fe22b/
 /query
 /elevate

 - I added at the tail of solrconfig.xml file
 ...

 !-- a search component that enables you to configure the top results for
  a given query regardless of the normal lucene scoring.--
  searchComponent name=elevator class=solr.QueryElevationComponent 
   !-- pick a fieldType to analyze queries --
   str name=queryFieldTypestring/str
   str name=config-fileelevate.xml/str
  /searchComponent

  !-- a request handler utilizing the elevator component --
  requestHandler name=/elevate class=solr.SearchHandler
 startup=lazy
   lst name=defaults
 str name=echoParamsexplicit/str
   /lst
   arr name=last-components
 strelevator/str
   /arr
  /requestHandler

 /config

 - in my schema I have

 field name=md type=string indexed=true stored=true
 required=true
 /
 ...
 uniqueKeymyid/uniqueKey





Re: QueryElevationComponent

2008-11-23 Thread Erik Hatcher


On Nov 23, 2008, at 3:06 PM, Paolo Ruscitti wrote:

Thanks Ryan for your answer.

The only thing that may be weird is that if you id field is named  
myid,
your elevate.xml file still refers to id as the unique key.  Is  
that what

you are refering to?

yes, my id field is named myid, but elevate.xml expects its name  
is id .


Please find below more info:

I' using the very last revision (720030)

I also tried both

elevate
query text=cars
doc myid=77b81d932353a5d16880043bdb4fe22b/
/query
/elevate


As Ryan said, that is incorrect - it must be id=... regardless of  
what your uniqueKey field is.



elevate
query text=cars
doc id=myid:77b81d932353a5d16880043bdb4fe22b/
/query
/elevate


remove myid: from that value and you should be in good shape.

Granted it is confusing.  But what's the alternative?  Maybe calling  
every attribute that needs to refer to a uniqueKey literally  
uniqueKey?   I don't think we want to have attributes changing their  
name based on the uniqueKey field name.


Erik



Re: Pagination with Solr

2008-11-23 Thread lupiss

 ok! gracias ryguasu por tu respuesta, mira que ahora que recuerdo si hay un
setStart y setRows trataré con eso y espero poder terminar mi proyecto, 1000
gracias =)
-- 
View this message in context: 
http://www.nabble.com/Pagination-with-Solr-tp13847908p20650529.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Mark Miller

https://issues.apache.org/jira/secure/attachment/12394218/solr-solid.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12393951/sslogo-solr-classic.png
https://issues.apache.org/jira/secure/attachment/12391946/apache_solr_burning.png
https://issues.apache.org/jira/secure/attachment/12392306/apache_solr_sun.png

- Mark


Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Chris Haggstrom

https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394268/apache_solr_c_red.jpg
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394366/solr3_maho.png
https://issues.apache.org/jira/secure/attachment/12393936/logo_remake.jpg

Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Jon Baer

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg


Re: filtering on blank OR specific range

2008-11-23 Thread Chris Hostetter

: I'm having difficultly filtering my documents when a field is either
: blank or set to a specific value.  I would have thought this would work
: 
:   fq=-Type:[* TO *] OR Type:blue

Rule#1 don't try to mix AND/OR syntax with +/- syntax ... it never works 
the way you want.

a OR b is just syntactic sugar for a b ... -a OR b is equivilent to 
-a b ... if you use debugQuery=true and look at the 
parsed_filter_queries you'll see that your fq is being parsed as...

   -Type:[* TO *]  Type:blue

...looking at it that way, odes it make sense why it doesn't match any 
documents?  there is only one positive clause, which is that Type == 
blue.  But then you are excluding any docs where Type has a value, so you 
get the empty set.


you could have a special Type_empty boolean field and use...

fq = Type_empty:true Type:blue

...or you can play tricks with the syntax, and do something like this...

fq = (*:* -Type:[* TO *]) Type:blue


-Hoss



Re: [VOTE] Community Logo Preferences

2008-11-23 Thread phil cryer
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394475/solr2_maho-vote.png
https://issues.apache.org/jira/secure/attachment/12394268/apache_solr_c_red.jpg

On Sun, Nov 23, 2008 at 10:59 AM, Ryan McKinley [EMAIL PROTECTED] wrote:
 Please submit your preferences for the solr logo.

 For full voting details, see:
  http://wiki.apache.org/solr/LogoContest#Voting

 The eligible logos are:
  http://people.apache.org/~ryan/solr-logo-options.html

 Any and all members of the Solr community are encouraged to reply to this
 thread and list (up to) 5 ranked choices by listing the Jira attachment
 URLs. Votes will be assigned a point value based on rank. For each vote, 1st
 choice has a point value of 5, 5th place has a point value of 1, and all
 others follow a similar pattern.

 https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
 https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
 ...

 This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

 When the poll is complete, the solr committers will tally the community
 preferences and take a final vote on the logo.

 A big thanks to everyone would submitted possible logos -- its great to see
 so many good options.



-- 
http://fak3r.com dim high beams for oncoming traffic
http://lefttochance.com know your rights, don't lose them


RE: Updating schema.xml without deleting index?

2008-11-23 Thread Chris Hostetter

: of myfield as the same result.  I wish there was an option to just
: completely reindex all data..i suppose optimize may do that a little
: bit?

optimize is just a low level lucene call to purge all deleted docs and 
merge all index segments into a single segment.

and there is an option to reindex all data: take whatever you used to 
index in the data the first time, and do it again. :)

seriously though, if you use something like DateImportHandler this is 
fairly easy, if you don't use something like DIH, it's a matter of 
designing whatever system you do use so that it's easy do reindex later as 
needed (unless you're certain that your schema is perfect and never needs 
to change)

The way you solved your use case (exclude things that don't have a value) 
is exactly how i go about deal with situations like this routinely.



-Hoss



Re: Newbie Question - getting search results from dataimport request handler

2008-11-23 Thread Chris Hostetter

:  Logging an error and returning successfully (without adding any docs) is
:  still inconsistent with the way all other RequestHandlers work: fail the
:  request.
: 
:  I know DIH isn't a typical RequestHandler, but some things (like failing
:  on failure) seem like they should be a given.
: SOLR-842 .
: DIH is an ETL tool pretending to be a RequestHandler. Originally it
: was built to run outside of Solr using SolrJ. For better integration
: and ease of use we changed it later.
: 
: SOLR-853 aims to achieve the oroginal goal
: 
: The goal of DIH is to become a full featured ETL tool.

Understood ... but shouldn't ETL Tools fail on failure ?

I mean forget Solr for a minute:   If i've got a standalone ETL Tool that 
runs as a daemon, and on startup it logs some error messages because i've 
got bad configs (and it can tell the fields i've listed for my 
'target' system don't exist there) should it report success everytime i 
push data to it?

Based on this thread, that's what it sounds like DIH is doing right now in 
situations like this.

If nothing else, we could give DIH a way to check the global
abortOnConfigurationError value from solrconfig.xml and make it's 
decisison that way.



-Hoss



Re: not string or text fields and shards

2008-11-23 Thread Yonik Seeley
On Thu, Nov 20, 2008 at 7:41 AM, Marc Sturlese [EMAIL PROTECTED] wrote:
 I have started working with an index divided in 3 shards. When I did a
 distributed search I got an error with the fields that were not string or
 text. I read that the error was due to BinaryResponseWriter and not
 string/text empty fields.

I think it's more the case that if you have an invalid field value, it
could blow up at different points in different code paths.  The root
cause is still an invalid value in the field.

-Yonik


Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Norberto Meijome
On Sun, 23 Nov 2008 11:59:50 -0500
Ryan McKinley [EMAIL PROTECTED] wrote:

 Please submit your preferences for the solr logo.

https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394070/sslogo-solr-finder2.0.png
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg

thanks!!
B

_
{Beto|Norberto|Numard} Meijome

Tell a person you're the Metatron and they stare at you blankly. Mention 
something out of a Charleton Heston movie and suddenly everyone's a Theology 
scholar!
   Dogma

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: How can i protect the SOLR Cores?

2008-11-23 Thread Chris Hostetter

: 1) modify web.xml (part of the sources of solr.war, which you'll have to 
: rebuild)  to define the authentication constraints you want.

for many servlet containers, this isn't neccessary.  Jetty cor example 
also lets you define security realms in the jetty.xml (there's an example 
of this commented out in the example jetty.xml)



-Hoss



Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Nick Jenkin
https://issues.apache.org/jira/secure/attachment/12394366/solr3_maho.png
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12392306/apache_solr_sun.png
https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg

Good work to all the people who contributed.
-Nick

On Mon, Nov 24, 2008 at 3:06 PM, Norberto Meijome [EMAIL PROTECTED] wrote:
 On Sun, 23 Nov 2008 11:59:50 -0500
 Ryan McKinley [EMAIL PROTECTED] wrote:

 Please submit your preferences for the solr logo.

 https://issues.apache.org/jira/secure/attachment/12394267/apache_solr_c_blue.jpg
 https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg
 https://issues.apache.org/jira/secure/attachment/12394070/sslogo-solr-finder2.0.png
 https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png
 https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg

 thanks!!
 B

 _
 {Beto|Norberto|Numard} Meijome

 Tell a person you're the Metatron and they stare at you blankly. Mention 
 something out of a Charleton Heston movie and suddenly everyone's a Theology 
 scholar!
   Dogma

 I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
 Reading disclaimers makes you go blind. Writing them is worse. You have been 
 Warned.



RE: [VOTE] Community Logo Preferences

2008-11-23 Thread Vinu Kumar
https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394353/solr.s5.jpg
https://issues.apache.org/jira/secure/attachment/12394265/apache_solr_b_blue.jpg
https://issues.apache.org/jira/secure/attachment/12394167/solrlogo.jpg
https://issues.apache.org/jira/secure/attachment/12394376/solr_sp.png

- Vinu


-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED]
Sent: Sunday, November 23, 2008 10:30 PM
To: solr-user@lucene.apache.org
Subject: [VOTE] Community Logo Preferences

Please submit your preferences for the solr logo.

For full voting details, see:
   http://wiki.apache.org/solr/LogoContest#Voting

The eligible logos are:
   http://people.apache.org/~ryan/solr-logo-options.html

Any and all members of the Solr community are encouraged to reply to
this thread and list (up to) 5 ranked choices by listing the Jira
attachment URLs. Votes will be assigned a point value based on rank.
For each vote, 1st choice has a point value of 5, 5th place has a
point value of 1, and all others follow a similar pattern.

https://issues.apache.org/jira/secure/attachment/12345/yourfrstchoice.jpg
https://issues.apache.org/jira/secure/attachment/34567/yoursecondchoice.jpg
...

This poll will be open until Wednesday November 26th, 2008 @ 11:59PM GMT

When the poll is complete, the solr committers will tally the
community preferences and take a final vote on the logo.

A big thanks to everyone would submitted possible logos -- its great
to see so many good options.


Re: Wait Flush, Wait Searcher and commit Scenarios

2008-11-23 Thread Yonik Seeley
On Tue, Nov 18, 2008 at 10:55 PM, Mark Miller [EMAIL PROTECTED] wrote:
 Does waitFlush do anything now? I only see it being set if eclipse is not
 missing a reference...

Not currently.  The idea was that if waitFlush== false that the call
would be totally asynchronous and return immediately.  If
waitFlush==true, then the call would return only after everything was
flushed to stable storage (which is always the case now).

-Yonik

p.s. late replies since I'm getting back from a week of travel.


Re: Using Solr for indexing emails

2008-11-23 Thread Norberto Meijome
On Sun, 23 Nov 2008 16:02:16 +0200
Timo Sirainen [EMAIL PROTECTED] wrote:

 Hi,

Hi Timo,

 
[...]

 The main problem is that before doing the search, I first have to check
 if there are any unindexed messages and then add them to Solr. This is
 done using a query like:
  - fl=uid
  - rows=1
  - sort=uid desc
  - q=uidv:uidvalidity box:mailbox user:user

So, if I understand correctly, the process is :

1. user sends search query Q to search interface
2. interface checks highest indexed uidv in SOLR
3. checks in IMAP store for mailbox if there are any objects ('emails') newer
than uidv from 2.
4. anything found in 3. is processed, submitted to SOLR, committed.
5. interface submits search query Q to index, gets results
6. results are presented / returned to user

It strikes me that this may work ok in some situations but may not scale. I
would decouple the {find new documents / submit / commit } process from the
{ search / presentation} layer - SPECIALLY if you plan to have several
mailboxes in play now.

 So it returns the highest IMAP UID field (which is an always-ascending
 integer) for the given mailbox (you can ignore the uidvalidity). I can
 then add all messages with higher UIDs to Solr before doing the actual
 search.
 
 When searching multiple mailboxes the above query would have to be sent
 to every mailbox separately. 

hmm...not sure what you mean by query would have to be sent to every
MAILBOX ... 

 That really doesn't seem like the best
 solution, especially when there are a lot of mailboxes. But I don't
 think Solr has a way to return highest uid field for each
 box:mailbox?

hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
query for each box, i think...

 Is that above query even efficient for a single mailbox? 

i don't think so.

I did consider
 using separate documents for storing the highest UID for each mailbox,
 but that causes annoying desynchronization possibilities. Especially
 because currently I can just keep sending documents to Solr without
 locking and let it drop duplicates automatically (should be rare). With
 per-mailbox highest-uid documents I can't really see a way to do this
 without locking or allowing duplicate fields to be added and later some
 garbage collection deleting all but the one highest value (annoyingly
 complex).

I have a feeling the issues arise from serialising the whole process (as I
described above... ). It makes more sense (to me)  to implement something
similar to DIH, where you load data as needed (even a 'delta query', which
would only return new data... I am not sure whether you could use DIH ( RSS
feed from IMAP store? )

 I could of course also keep track of what's indexed on Dovecot's side,
 but that could also lead to desynchronization issues and I'd like to
 avoid them.
 
 I guess the ideal solution would be if it was somehow possible to create
 a SQL-like trigger that updates the per-mailbox highest-uid document
 whenever adding a new document with a higher UID value.

I am not sure how much effort you want to put into this...but I would think
that writing a lean app that periodically (for a period that makes sense for
your hardware and user's expectation... 5 minutes? 10?  1? ) crawls the IMAP
stores for UID, processes them and submits to SOLR, and keeps its own state
( dbm or sqlite ) may be a more flexible approach. Or, if dovecot support this,
a 'plugin / hook ' that sends a msg to your indexing app everytime a new
document is created.

I am interested to hear what you decide to go with, and why.

cheers,
B

_
{Beto|Norberto|Numard} Meijome

All parts should go together without forcing. You must remember that the parts
you are reassembling were disassembled by you. Therefore, if you can't get them
together again, there must be a reason. By all means, do not use hammer. IBM
maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Newbie Question - getting search results from dataimport request handler

2008-11-23 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Nov 24, 2008 at 7:25 AM, Chris Hostetter
[EMAIL PROTECTED] wrote:

 :  Logging an error and returning successfully (without adding any docs) is
 :  still inconsistent with the way all other RequestHandlers work: fail the
 :  request.
 : 
 :  I know DIH isn't a typical RequestHandler, but some things (like failing
 :  on failure) seem like they should be a given.
 : SOLR-842 .
 : DIH is an ETL tool pretending to be a RequestHandler. Originally it
 : was built to run outside of Solr using SolrJ. For better integration
 : and ease of use we changed it later.
 :
 : SOLR-853 aims to achieve the oroginal goal
 :
 : The goal of DIH is to become a full featured ETL tool.

 Understood ... but shouldn't ETL Tools fail on failure ?

 I mean forget Solr for a minute:   If i've got a standalone ETL Tool that
 runs as a daemon, and on startup it logs some error messages because i've
 got bad configs (and it can tell the fields i've listed for my
 'target' system don't exist there) should it report success everytime i
 push data to it?

 Based on this thread, that's what it sounds like DIH is doing right now in
 situations like this.

 If nothing else, we could give DIH a way to check the global
 abortOnConfigurationError value from solrconfig.xml and make it's
 decisison that way
We considered these. The severity of errors are very much specific to
the source of data. It is very unlikely that a DB source throws up
errors. In xml data sources say out of x urls 1 or two are wrong,
would the user wish to ignore or want to abort the entire import.


So we decided to give more options and the implementations are left to
the EntityProcessor. Moreover the default is set to onError=abort





 -Hoss





-- 
--Noble Paul


Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

2008-11-23 Thread anuvenk

Thanks for the response. Well my current ps setting works great for most
search terms. But say this typical example, north dakota 1031 exchange
lawyers - we don't have any relevant docs in the index. Solr is returning
the irrelevant doc, just because it found 'lawyer', exchange, north  dakota
somewhere. I thought if there is a way to just not return any results if
they are not within close proximity, it would be great. 

Yonik Seeley wrote:
 
 On Sun, Nov 23, 2008 at 11:51 PM, anuvenk [EMAIL PROTECTED]
 wrote:
 Please help someone...i've been waiting for an answer for the last couple
 of
 days  no one seems to be helping out here. I did search the wiki  this
 forum for an answer. But couldn't find an answer. I know if ps is set to
 5
 words within 5 words of one another receive a boost in score. But is
 there a
 way to not return results that have the words in search terms more than 5
 words apart. ?
 
 Not with dismax.  I'm not sure why it's a problem, given that with
 enough boost you should be able to ensure that all of the results with
 a slop less than 5 appear before other results.
 Anyway, if you want to restrict results to those with a slop of 5, use
 the standard query parser with an explicit sloppy phrase query:
 
 north dakota 1031 exchange lawyers~5
 
 -Yonik
 
 
 Typical example: north dakota 1031 exchange lawyers
 My first result is absolutely ir-relevant. It returned a north dakota doc
 though but had an occurrence of attorney somewhere  an occurrence of
 exchange (not related to 1031 exchange though). They were not within 5
 words
 of one another. My guys have been hammering me reg this relevancy issue.
 Please help someone.

 anuvenk wrote:

 From the solr wiki, it sounded like if qs is set to 5 for example,  if
 the search term is 'child custody', only docs with 'child'  'custody'
 within 5 words of one another would be returned in results. Is this
 correct? If so, it doesn't seem to be working for me. I see docs with
 'child'  'custody' more than 5 words of one another (excluding stop
 words) which is resulting in bad user experience as those docs are not
 so
 relevant. What more could i do to improve quality in the results?


 --
 View this message in context:
 http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20655014.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Please Help !! Question about Query Phrase Slop (qs) in dismax

2008-11-23 Thread Yonik Seeley
If you boost the phrase queries by enough, you could tell when you hit
the less relevant documents by the score.

-Yonik

On Mon, Nov 24, 2008 at 12:07 AM, anuvenk [EMAIL PROTECTED] wrote:

 Thanks for the response. Well my current ps setting works great for most
 search terms. But say this typical example, north dakota 1031 exchange
 lawyers - we don't have any relevant docs in the index. Solr is returning
 the irrelevant doc, just because it found 'lawyer', exchange, north  dakota
 somewhere. I thought if there is a way to just not return any results if
 they are not within close proximity, it would be great.

 Yonik Seeley wrote:

 On Sun, Nov 23, 2008 at 11:51 PM, anuvenk [EMAIL PROTECTED]
 wrote:
 Please help someone...i've been waiting for an answer for the last couple
 of
 days  no one seems to be helping out here. I did search the wiki  this
 forum for an answer. But couldn't find an answer. I know if ps is set to
 5
 words within 5 words of one another receive a boost in score. But is
 there a
 way to not return results that have the words in search terms more than 5
 words apart. ?

 Not with dismax.  I'm not sure why it's a problem, given that with
 enough boost you should be able to ensure that all of the results with
 a slop less than 5 appear before other results.
 Anyway, if you want to restrict results to those with a slop of 5, use
 the standard query parser with an explicit sloppy phrase query:

 north dakota 1031 exchange lawyers~5

 -Yonik


 Typical example: north dakota 1031 exchange lawyers
 My first result is absolutely ir-relevant. It returned a north dakota doc
 though but had an occurrence of attorney somewhere  an occurrence of
 exchange (not related to 1031 exchange though). They were not within 5
 words
 of one another. My guys have been hammering me reg this relevancy issue.
 Please help someone.

 anuvenk wrote:

 From the solr wiki, it sounded like if qs is set to 5 for example,  if
 the search term is 'child custody', only docs with 'child'  'custody'
 within 5 words of one another would be returned in results. Is this
 correct? If so, it doesn't seem to be working for me. I see docs with
 'child'  'custody' more than 5 words of one another (excluding stop
 words) which is resulting in bad user experience as those docs are not
 so
 relevant. What more could i do to improve quality in the results?


 --
 View this message in context:
 http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20654906.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://www.nabble.com/Please-Help-%21%21-Question-about-Query-Phrase-Slop-%28qs%29-in-dismax-tp20643003p20655014.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Query for Distributed search -

2008-11-23 Thread souravm
Hi,

Looking for some insight on distributed search.

Say I have an index distributed in 3 boxes and the index contains time and text 
data (typical log file). Each box has index for different timeline - say Box 1 
for all Jan to April, Box 2 for May to August and Box 3 for Sep to Dec.

Now if I try to search for a text string, will the search would happen in 
parallel in all 3 boxes or sequentially?

Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


RE: facet sort by ranking

2008-11-23 Thread Amit
Hi,

We having 100 category and each category having it own internal ranking.
Let consider if I search for any product and its fall under 30 categories
and we are showing top 10 categories in filter so that user can filter there
results.

Let consider hypothetical example(as we don't have correct data and we are
under testing solr features):
Categories values and internal ranking:
Cat1
- 1
Cat2
- 2
Cat3
- 3
Cat4
- 4
Cat5
- 5
Cat6
- 6
Cat7
- 7
Cat8
- 8
Cat9
- 9

Cat10 - 10

Cat11 - 11

Cat12 - 12

Cat13 - 13

Cat14 - 14

Cat15 - 15  
If I search for product it will return result:
   Category
count(as sort by count)
Cat2
- 20
Cat3
- 17
Cat4
- 15
Cat1
- 14
Cat7
- 13
Cat8
- 12
Cat9
- 10

Cat15 - 9

Cat13 - 8

Cat10 - 7   

Cat11 - 6

Cat12 - 5
Now we want show only top 10 values so we will miss: Cat11 and Cat12 as it
sort by count not by its ranking

We would like result below :

Cat15
  Cat13


Cat12 

Cat11 

Cat10 
Cat9

Cat8

Cat7

Cat4

Cat3

Cat2

Cat1

Hope this will convey what we want

Have great day .:)

Thanks and Regards,
Amit


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: 22 November 2008 22:51
To: solr-user@lucene.apache.org
Subject: Re: facet sort by ranking

On Sat, Nov 22, 2008 at 12:05 PM, Amit [EMAIL PROTECTED] wrote:
 Actually we have some ranking associated to field on which we are faceting
 and we want to show only top 10 facet value now which is sort by count but
 we want to sort by it ranking.

I think you're going to have to give some concrete examples of what
your documents look like, and what results you want back.

-Yonik

No virus found in this incoming message.
Checked by AVG. 
Version: 7.5.549 / Virus Database: 270.9.9/1804 - Release Date: 21-11-2008
18:24
 

No virus found in this outgoing message.
Checked by AVG. 
Version: 7.5.549 / Virus Database: 270.9.9/1804 - Release Date: 21-11-2008
18:24