Filtering results based on score

2010-11-01 Thread sivaprasad

Hi,
As part of solr results i am able to get the max score.If i want to filter
the results based on the max score, let say the max score  is 10 And i need
only the results between max score  to 50 % of max score.This max score is
going to change dynamically.How can we implement this?Do we need to
customize the solr?Please any suggestions.


Regards,
JS
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-based-on-score-tp1819769p1819769.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Relevency Calculation

2010-11-01 Thread sivaprasad

Hi,
I have 25 indexed fields in my document.But by default, if i give
q=laptops this is going to search on five fields and iam getting the score
as part of search results.How solr will calculate the score?Is it going to
calculate only on the five fields or on 25 fields which are indexed?What is
the order it is going to take to calculate score?Any documents related to
this topic is helpful for me.

Regards,
JS
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html
Sent from the Solr - User mailing list archive at Nabble.com.


Boosting the score based on certain field

2010-11-01 Thread sivaprasad

Hi,

In my document i have a filed called category.This contains
electronics,games ,..etc.For some of the category values i need to boost
the document score.Let us say, for electronics category, i will decide the
boosting parameter grater than the games category.Is there any body has
the idea to achieve this functionality?

Regards,
Siva


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering results based on score

2010-11-01 Thread Ahmet Arslan
 As part of solr results i am able to get the max score.If i
 want to filter
 the results based on the max score, let say the max
 score  is 10 And i need
 only the results between max score  to 50 % of max
 score.This max score is
 going to change dynamically.How can we implement this?Do we
 need to
 customize the solr?Please any suggestions.

frange is advised in a similar discussion:
http://search-lucene.com/m/4AHNF17wIJW1/





Multiple Keyword Search

2010-11-01 Thread Pawan Darira
Hi

There is a situation where i search for more than 1 keyword  my main 2
fields are ad_title  ad_description.
I want those results which match all of the keywords in both fields, should
come on top. Then sequentially one by one keyword can be dropped in further
results.

E.g. In a search of 3 keywords, let there are 100 results. If 35 contain all
the keywords combined in ad_title  ad_description, then they should come
first. Then if 50 results contain combination of any 2 keywords, they should
come next. Finally results with single keyword should come at last

Please suggest

-- 
Thanks,
Pawan Darira


Re:Re: problem of solr replcation's speed

2010-11-01 Thread kafka0102
I hacked SnapPuller to log the cost, and the log is like thus:
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
[2010-11-01 
17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979


It's saying it cost about 1000ms for transfering 1M data every 2 times. I used 
jetty as server and embeded solr in my app.I'm so confused.What I have done 
wrong?


At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote:

If you are copying from an indexer while you are indexing new content,
this would cause contention for the disk head. Does indexing slow down
during this period?

Lance

2010/10/31 Peter Karich peat...@yahoo.de:
  we have an identical-sized index and it takes ~5minutes


 It takes about one hour to replacate 6G index for solr in my env. But my
 network can transfer file about 10-20M/s using scp. So solr's http
 replcation is too slow, it's normal or I do something wrong?






-- 
Lance Norskog
goks...@gmail.com


Re:Re:Re: problem of solr replcation's speed

2010-11-01 Thread kafka0102
I suspected my app has some sleeping op every 1s, so
I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB

and log result is like thus :
[2010-11-01 
17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3184
[2010-11-01 
17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3426
[2010-11-01 
17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3359
[2010-11-01 
17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3166
[2010-11-01 
17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3513
[2010-11-01 
17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3140
[2010-11-01 
17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
3471

That means It's still slow like before. what's wrong with my env

At 2010-11-01 17:30:32,kafka0102 kafka0...@163.com wrote:
I hacked SnapPuller to log the cost, and the log is like thus:
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 980
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
[2010-11-01 
17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
[2010-11-01 
17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 979


It's saying it cost about 1000ms for transfering 1M data every 2 times. I used 
jetty as server and embeded solr in my app.I'm so confused.What I have done 
wrong?


At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote:

If you are copying from an indexer while you are indexing new content,
this would cause contention for the disk head. Does indexing slow down
during this period?

Lance

2010/10/31 Peter Karich peat...@yahoo.de:
  we have an identical-sized index and it takes ~5minutes


 It takes about one hour to replacate 6G index for solr in my env. But my
 network can transfer file about 10-20M/s using scp. So solr's http
 replcation is too slow, it's normal or I do something wrong?






-- 
Lance Norskog
goks...@gmail.com




Re: Design and Usage Questions

2010-11-01 Thread torin farmer
Hm, I do not have a webserver setup for security reasons.I use SVNKit to 
connect to SVN via the file:// protocol, what I get then is the 
ByteArrayOutputStream.What would the buffer-solution or the DualThread 
Writer/Reader pair look like?-Ursprüngliche Nachricht-

Von: Lance Norskog goks...@gmail.com

Gesendet: Nov 1, 2010 3:23:55 AM

An: solr-user@lucene.apache.org

Betreff: Re: Design and Usage Questions



2.

The SolrJ library handling of content streams is pull, not push.

That is, you give it a reader and it pulls content when it feels like

it. If your software to feed the connection wants to write the data,

you have to either buffer the whole thing or do a dual-thread

writer/reader pair.



The easiest way to pull stuff from SVN is to use one of the web server

apps. Solr takes a stream.url parameter. (Also stream.file.) Note

that there is no outbound authentication supported; your web server

has to be open (at least to the Solr instance).





On Sun, Oct 31, 2010 at 4:06 PM, getagrip getag...@web.de wrote:

 Hi,



 I've got some basic usage / design questions.



 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer

   instance for all requests to avoid connection leaks.

   So if I create a Singleton instance upon application-startup I can

   securely use this instance for ALL queries/updates throughout my

   application without running into performance issues?



 2. My System's documents are stored in a Subversion repository.

   For fast searchresults I want to periodically index new documents

   from the repository.



   What I get from the repository is a ByteArrayOutputStream. How can I

   pass this Stream to Solr?



   I only see possibilities to pass Files but in my case it does not

   make sense to write the ByteArrayOutputStream to disk again as this

   would cause performance issues apart from making no sense anyway.



 3. Are there any disadvantages using Solrj over some other HTTP based

   solution e.g. creating  sending my own HTTP requests? Do I even

   have to use HTTP?

   I see the EmbeddedSolrServer exists. Any drawbacks using that?



 Any hints are welcome, Thanks!









-- 

Lance Norskog

goks...@gmail.com
___
Neu: WEB.DE De-Mail - Einfach wie E-Mail, sicher wie ein Brief!  
Jetzt De-Mail-Adresse reservieren: https://produkte.web.de/go/demail02


Re: Design and Usage Questions

2010-11-01 Thread getagrip

Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
I do not depend on Solr_J, any connection-method would suffice.

On 11/01/2010 03:23 AM, Lance Norskog wrote:

2.
The SolrJ library handling of content streams is pull, not push.
That is, you give it a reader and it pulls content when it feels like
it. If your software to feed the connection wants to write the data,
you have to either buffer the whole thing or do a dual-thread
writer/reader pair.

The easiest way to pull stuff from SVN is to use one of the web server
apps. Solr takes a stream.url parameter. (Also stream.file.) Note
that there is no outbound authentication supported; your web server
has to be open (at least to the Solr instance).


On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de  wrote:

Hi,

I've got some basic usage / design questions.

1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating  sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

Any hints are welcome, Thanks!







Re: Custom Sorting in Solr

2010-11-01 Thread Ezequiel Calderara
Ok i imagined that the double linked list would be far too complicated for
solr.

Now, how can i achieve that solr connects to a webservice and do the import?

I'm sorry if i'm not clear, sometimes my english gets fuzzy :P

On Fri, Oct 29, 2010 at 4:51 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Fri, Oct 29, 2010 at 3:39 PM, Ezequiel Calderara ezech...@gmail.com
 wrote:
  Hi all guys!
  I'm in a weird situation here.
  We have index a set of documents which are ordered using a linked list
 (each
  documents has the reference of the previous and the next).
 
  Is there a way when sorting in the solr search, Use the linked list to
 sort?

 It seems like you should be able to encode this linked list as an
 integer instead, and sort by that?
 If there are multiple linked lists in the index, it seems like you
 could even use the high bits of the int to designate which list the
 doc belongs to, and the low order bits as the order in that list.

 -Yonik
 http://www.lucidimagination.com




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Roxana Angheluta
Hi,

Yes, sometimes it takes 5 minutes for a query. I agree this is not desirable. 
However, if the application has no control over the input queries other that 
closing the socket after a while, solr should not continue writing the 
response, but terminate the thread.

In general, is there a way to quantify the complexity of a given query on a 
certain index? Some general guidelines which can be used by non-technical 
people?

Thanks a lot,
roxana 

--- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: solr stuck in writing to inexisting sockets
 To: solr-user@lucene.apache.org
 Date: Sunday, October 31, 2010, 2:29 AM
 Are you saying that your Solr server
 is at times taking 5 minutes to
 complete? If so,
 I'd get to the bottom of that first off. My first guess
 would be you're
 either hitting
 memory issues and swapping horribly or..well, that would be
 my first guess.
 
 Best
 Erick
 
 On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.comwrote:
 
  Hi all,
 
  We are using Solr over Jetty with a large index,
 sharded and distributed
  over multiple machines. Our queries are quite long,
 involving boolean and
  proximity operators. We cut the connection at the
 client side after 5
  minutes. Also, we are using parameter timeAllowed to
 stop executing it on
  the server after a while.
  We quite often run into situations when solr blocks.
 The load on the
  server increases and a thread dump on the solr process
 shows many threads
  like below:
 
 
  btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581
 runnable
  [0x451a]
    java.lang.Thread.State: RUNNABLE
         at
 java.io.PrintWriter.write(PrintWriter.java:362)
         at
 org.apache.solr.common.util.XML.escape(XML.java:206)
         at
 org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
         at
 org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
         at
 org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
         at
 org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
         at
 org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
         at
 org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
         at
 
 org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
         at
 
 org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
         at
 org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
         at
 
 org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
         at
 
 org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
         at
 
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
         at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
         at
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
         at
 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
         at
 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
         at
 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
         at
 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
  ..
 
 
  A netstat on the machine shows sockets in state
 CLOSE_WAIT. However, they
  are fewer than the number of RUNNABLE threads as the
 above.
 
  Why is this happening? Is there anything we can do to
 avoid getting in
  these situations?
 
  Thanks,
  roxana
 
 
 
 
 





big terms in UnInvertedField

2010-11-01 Thread Koji Sekiguchi
Hello,

With solr example, using facet.field=text creates UnInvertedField
for the text field in fieldValueCache. After that, I saw stats page
and I was surprised at counters in *filterCache* were up:

lookups : 213
hits : 106
hitratio : 0.49
inserts : 107
evictions : 0
size : 107
warmupTime : 0
cumulative_lookups : 213
cumulative_hits : 106
cumulative_hitratio : 0.49
cumulative_inserts : 107
cumulative_evictions : 0

Do they cause of big words in UnInvertedField?
If so, when using both facet for multiValued field and facet for
single valued field/facet query, it is difficult to estimate
the size of filterCache.

Koji
-- 
http://www.rondhuit.com/en/


Re: big terms in UnInvertedField

2010-11-01 Thread Yonik Seeley
2010/11/1 Koji Sekiguchi k...@r.email.ne.jp:
 With solr example, using facet.field=text creates UnInvertedField
 for the text field in fieldValueCache. After that, I saw stats page
 and I was surprised at counters in *filterCache* were up:

 Do they cause of big words in UnInvertedField?

Yes.  big terms (defined as matching more than 5% of the index) are
not uninverted since it's more efficient (both CPU and memory) to use
the filterCache and calculate intersections.

 If so, when using both facet for multiValued field and facet for
 single valued field/facet query, it is difficult to estimate
 the size of filterCache.

Yep.  At least fieldValueCache (for UnInvertedField) tells you the
number of big terms in each field you are faceting on though.

-Yonik
http://www.lucidimagination.com


Re: big terms in UnInvertedField

2010-11-01 Thread Koji Sekiguchi

Yonik,

Thank you for your reply. I just wanted to share my surprise. :)

Koji
--
http://www.rondhuit.com/en/

(10/11/01 23:17), Yonik Seeley wrote:

2010/11/1 Koji Sekiguchik...@r.email.ne.jp:

With solr example, using facet.field=text creates UnInvertedField
for the text field in fieldValueCache. After that, I saw stats page
and I was surprised at counters in *filterCache* were up:



Do they cause of big words in UnInvertedField?


Yes.  big terms (defined as matching more than 5% of the index) are
not uninverted since it's more efficient (both CPU and memory) to use
the filterCache and calculate intersections.


If so, when using both facet for multiValued field and facet for
single valued field/facet query, it is difficult to estimate
the size of filterCache.


Yep.  At least fieldValueCache (for UnInvertedField) tells you the
number of big terms in each field you are faceting on though.

-Yonik
http://www.lucidimagination.com





Re: Solr Relevency Calculation

2010-11-01 Thread Erick Erickson
Here's a good place to start:
http://search.lucidimagination.com/search/out?u=http://lucene.apache.org/java/2_4_0/scoring.html

http://search.lucidimagination.com/search/out?u=http://lucene.apache.org/java/2_4_0/scoring.htmlBut
what do you mean this is going to search on five fields? This
sounds like you're using DisMax in which case it throws out all but the
top-scoring
clause when it calculates the score for the document.

HTH
Erick

On Sun, Oct 31, 2010 at 10:48 PM, sivaprasad sivaprasa...@echidnainc.comwrote:


 Hi,
 I have 25 indexed fields in my document.But by default, if i give
 q=laptops this is going to search on five fields and iam getting the
 score
 as part of search results.How solr will calculate the score?Is it going to
 calculate only on the five fields or on 25 fields which are indexed?What is
 the order it is going to take to calculate score?Any documents related to
 this topic is helpful for me.

 Regards,
 JS
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Relevency-Calculation-tp1819798p1819798.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boosting the score based on certain field

2010-11-01 Thread Erick Erickson
Would simple boosting work? As in category:electronics^2?

If not, perhaps you can explain a bit more about what you're trying to
accomplish...

Best
Erick

On Sun, Oct 31, 2010 at 10:55 PM, sivaprasad sivaprasa...@echidnainc.comwrote:


 Hi,

 In my document i have a filed called category.This contains
 electronics,games ,..etc.For some of the category values i need to boost
 the document score.Let us say, for electronics category, i will decide
 the
 boosting parameter grater than the games category.Is there any body has
 the idea to achieve this functionality?

 Regards,
 Siva


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boosting-the-score-based-on-certain-field-tp1819820p1819820.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multiple Keyword Search

2010-11-01 Thread Erick Erickson
I'm not sure this exactly fits your use-case, but it may come
close enough. Have you looked at disMax and the mm parameter
(minimum should match)?

Best
Erick

On Mon, Nov 1, 2010 at 5:00 AM, Pawan Darira pawan.dar...@gmail.com wrote:

 Hi

 There is a situation where i search for more than 1 keyword  my main 2
 fields are ad_title  ad_description.
 I want those results which match all of the keywords in both fields, should
 come on top. Then sequentially one by one keyword can be dropped in further
 results.

 E.g. In a search of 3 keywords, let there are 100 results. If 35 contain
 all
 the keywords combined in ad_title  ad_description, then they should come
 first. Then if 50 results contain combination of any 2 keywords, they
 should
 come next. Finally results with single keyword should come at last

 Please suggest

 --
 Thanks,
 Pawan Darira



Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Erick Erickson
I'm going to nudge you in the direction of understanding why the queries
take so long in the first place rather than going toward the blunt approach
of cutting them off after some time. The fact that you don't control the
queries submitted doesn't prevent you from trying to understand what
is taking so long.

The first thing I'd look for is whether the system is memory starved. What
JVM are you using and what memory parameters are you giving it? What
version of Solr are you using? Have you tried any performance monitoring
to determine what is happening?

The reason I'm pushing in this direction is that 5 minute searches are
pathological. Once you're up in that range, virtually any fix you come up
with will simply mask the underlying problems, and you'll be forever
chasing the next manifestation of the underlying problem.

Besides, I don't know how you'd stop Solr processing a query mid-way
through,
I don't know of any way to make that happen.

Best
Erick

On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta anghelu...@yahoo.comwrote:

 Hi,

 Yes, sometimes it takes 5 minutes for a query. I agree this is not
 desirable. However, if the application has no control over the input queries
 other that closing the socket after a while, solr should not continue
 writing the response, but terminate the thread.

 In general, is there a way to quantify the complexity of a given query on a
 certain index? Some general guidelines which can be used by non-technical
 people?

 Thanks a lot,
 roxana

 --- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote:

  From: Erick Erickson erickerick...@gmail.com
  Subject: Re: solr stuck in writing to inexisting sockets
  To: solr-user@lucene.apache.org
  Date: Sunday, October 31, 2010, 2:29 AM
  Are you saying that your Solr server
  is at times taking 5 minutes to
  complete? If so,
  I'd get to the bottom of that first off. My first guess
  would be you're
  either hitting
  memory issues and swapping horribly or..well, that would be
  my first guess.
 
  Best
  Erick
 
  On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.com
 wrote:
 
   Hi all,
  
   We are using Solr over Jetty with a large index,
  sharded and distributed
   over multiple machines. Our queries are quite long,
  involving boolean and
   proximity operators. We cut the connection at the
  client side after 5
   minutes. Also, we are using parameter timeAllowed to
  stop executing it on
   the server after a while.
   We quite often run into situations when solr blocks.
  The load on the
   server increases and a thread dump on the solr process
  shows many threads
   like below:
  
  
   btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581
  runnable
   [0x451a]
 java.lang.Thread.State: RUNNABLE
  at
  java.io.PrintWriter.write(PrintWriter.java:362)
  at
  org.apache.solr.common.util.XML.escape(XML.java:206)
  at
  org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
  at
  org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
  at
  org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
  at
  org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
  at
  org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
  at
  org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
  at
  
  org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
  at
  
 
 org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
  at
  org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
  at
  
  org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
  at
  
 
 org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
  at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
  at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
  at
  
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
  at
  
  org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
  at
  
 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at
  
  org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
  at
  
  org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
   ..
  
  
   A netstat on the machine shows sockets in state
  CLOSE_WAIT. However, they
   are fewer than the number of RUNNABLE threads as the
  above.
  
   Why is this happening? Is there anything we can do to
  avoid getting in
   these situations?
  
   Thanks,
   roxana
  
  
  
  
 






Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
We are trying to solve some multilingual issues with our Solr analysis filter 
chain and would like to use the new Lucene 3.x filters that are Unicode 
compliant.

Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
UAX#29 support from Solr?

Is it just a matter of writing the appropriate Solr filter factories?  Are 
there any tricky gotchas in writing such a filter?

If so, should I open a JIRA issue or two JIRA issues so the filter factories 
can be contributed to the Solr code base?

Tom



Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Robert Muir
On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are trying to solve some multilingual issues with our Solr analysis filter 
 chain and would like to use the new Lucene 3.x filters that are Unicode 
 compliant.

 Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
 UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the old standardtokenizer for
backwards compatibility.

  !--
Controls what version of Lucene various components of Solr adhere
to. Generally, you want
to use the latest version to get all bug fixes and improvements.
It is highly recommended
that you fully re-index after changing this setting as it can
affect both how text is indexed
and queried.
  --
  luceneMatchVersionLUCENE_31/luceneMatchVersion

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

 If so, should I open a JIRA issue or two JIRA issues so the filter factories 
 can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !


Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Jonathan Rochkind
I think you guys are talking about two different kinds of 'virtual 
hosts'.  Lance is talking about CPU virtualization. Eric appears to be 
talking about apache virtual web hosts, although Eric hasn't told us how 
apache is involved in his setup in the first place, so it's unclear.


Assuming you are using apache to reverse proxy to Solr, there is no 
reason I can think of that your front-end apache setup would effect CPU 
utilizaton by Solr, let alone by nutch.


Eric Martin wrote:

Oh. So I should take out the installations and move them to /some_dir as opposed to 
inside my virtual host of /home/my solr  nutch is here/www
'

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Sunday, October 31, 2010 7:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib

With virtual hosting you can give CPU  memory quotas to your
different VMs. This allows you to control the Nutch v.s. The World
problem. Unforch, you cannot allocate disk channel. With two i/o bound
apps, this is a problem.

On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote:
  

Excellent information. Thank you. Solr is acting just fine then. I can
connect to it no issues, it indexes fine and there didn't seem to be any
complication with it. Now I can rule it out and go about solving, what you
pointed out, and I agree, to be a java/nutch issue.

Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
source and found on apache.org

Thanks for your time.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Sunday, October 31, 2010 4:33 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

What servlet container are you putting your Solr in? Jetty? Tomcat?
Something else?  Are you fronting it with apache on top of that? (I think
maybe you are, otherwise I'm not sure how the phrase 'virtual host'
applies).

In general, Solr of course doesn't care what directory it's in on disk, so
long as the process running solr has the neccesary read/write permissions to
the neccesary directories (and if it doesn't, you'd usually find out right
away with an error message).  And clients to Solr don't care what directory
it's in on disk either, they only care that they can get it to it connecting
to a certain port at a certain hostname. In general, if they can't get to it
on a certain port at a certain hostname, that's something you'd discover
right away, not something that would be intermittent.  But I'm not familiar
with nutch, you may want to try connecting to the port you have Solr running
on (the hostname/port you have told nutch to find solr on?) yourself
manually, and just make sure it is connectable.

I can't think of any reason that what directory you have Solr in could cause
CPU utilization issues. I think it's got nothing to do with that.

I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
you might want to find some nutch experts to ask. Perhaps there's a nutch
listserv?  I am also not familiar with hadoop; you mention just in passing
that you're using hadoop too, maybe that's an added complication, I don't
know.

One obvious reason nutch could be taking 100% cpu would be simply because
you've asked it to do a lot of work quickly, and it's trying to.

One reason I have seen Solr take 100% of CPU and become responsive, is when
the Solr process gets caught up in terrible Java garbage collection. If
that's what's happening, then giving the Solr JVM a higher maximum heap size
can sometimes help (although confusingly, I've seen people suggest that if
you give the Solr JVM too MUCH heap it can also result in long GC pauses),
and if you have a multi-core/multi-CPU machine, I've found the JVM argument
-XX:+UseConcMarkSweepGC to be very helpful.

Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
Solr issue.

From: Eric Martin [e...@makethembite.com]
Sent: Sunday, October 31, 2010 7:16 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

Hi,

Thank you. This is more than idle curiosity. I am trying to debug an issue I
am having with my installation and this is one step in verifying that I have
a setup that does not consume resources. I am trying to debunk my internal
myth that having Solr nad Nutch in a virtual host would be causing these
issues. Here is the main issue that involves Nutch/Solr and Drupal:

/home/mootlaw/lib/solr
/home/mootlaw/lib/nutch
/home/mootlaw/www/Drupal site

I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
jetty for my Solr. My server is not rooted.

Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:

/usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
-Dhadoop.log.file=hadoop.log

Facet count of zero

2010-11-01 Thread Tod
I'm trying to exclude certain facet results from a facet query.  It 
seems to work but rather than being excluded from the facet list its 
returned with a count of zero.


Ex: 
q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true


This returns bar with a count of zero.  All the other foo's show up with 
valid counts.


Can I do this?  Is my syntax incorrect?



Thanks - Tod


Problem with phrase matches in Solr

2010-11-01 Thread Moazzam Khan
Hey guys,

I have a solr index where i store information about experts from
various fields. The thing is when I search for channel marketing i
get people that have the word channel or marketing in their data. I
only want people who have that entire phrase in their bio. I copy the
contents of bio to the default search field (which is text)

How can I make sure that exact phrase matching works while the search
is agile enough that half searches match too (like uni matches
university, etc - this works but not phrase matching)?

I hope I was able to properly explain my problem. If not, please let me know.

Thanks in advance,
Moazzam


Re: Facet count of zero

2010-11-01 Thread Yonik Seeley
On Mon, Nov 1, 2010 at 12:55 PM, Tod listac...@gmail.com wrote:
 I'm trying to exclude certain facet results from a facet query.  It seems to
 work but rather than being excluded from the facet list its returned with a
 count of zero.

If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co


 Ex:
 q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true

 This returns bar with a count of zero.  All the other foo's show up with
 valid counts.

 Can I do this?  Is my syntax incorrect?



 Thanks - Tod



RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Eric Martin
I was speaking about apache virtual hosts. I was concerned that there was an 
increase processing time due to the solr and nutch instance being housed inside 
a virtual host as opposed to being dropped in root of my distro.

Thank you for the astute clarification.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Monday, November 01, 2010 9:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib

I think you guys are talking about two different kinds of 'virtual 
hosts'.  Lance is talking about CPU virtualization. Eric appears to be 
talking about apache virtual web hosts, although Eric hasn't told us how 
apache is involved in his setup in the first place, so it's unclear.

Assuming you are using apache to reverse proxy to Solr, there is no 
reason I can think of that your front-end apache setup would effect CPU 
utilizaton by Solr, let alone by nutch.

Eric Martin wrote:
 Oh. So I should take out the installations and move them to /some_dir as 
 opposed to inside my virtual host of /home/my solr  nutch is here/www
 '

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Sunday, October 31, 2010 7:26 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr in virtual host as opposed to /lib

 With virtual hosting you can give CPU  memory quotas to your
 different VMs. This allows you to control the Nutch v.s. The World
 problem. Unforch, you cannot allocate disk channel. With two i/o bound
 apps, this is a problem.

 On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin e...@makethembite.com wrote:
   
 Excellent information. Thank you. Solr is acting just fine then. I can
 connect to it no issues, it indexes fine and there didn't seem to be any
 complication with it. Now I can rule it out and go about solving, what you
 pointed out, and I agree, to be a java/nutch issue.

 Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
 source and found on apache.org

 Thanks for your time.

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Sunday, October 31, 2010 4:33 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr in virtual host as opposed to /lib

 What servlet container are you putting your Solr in? Jetty? Tomcat?
 Something else?  Are you fronting it with apache on top of that? (I think
 maybe you are, otherwise I'm not sure how the phrase 'virtual host'
 applies).

 In general, Solr of course doesn't care what directory it's in on disk, so
 long as the process running solr has the neccesary read/write permissions to
 the neccesary directories (and if it doesn't, you'd usually find out right
 away with an error message).  And clients to Solr don't care what directory
 it's in on disk either, they only care that they can get it to it connecting
 to a certain port at a certain hostname. In general, if they can't get to it
 on a certain port at a certain hostname, that's something you'd discover
 right away, not something that would be intermittent.  But I'm not familiar
 with nutch, you may want to try connecting to the port you have Solr running
 on (the hostname/port you have told nutch to find solr on?) yourself
 manually, and just make sure it is connectable.

 I can't think of any reason that what directory you have Solr in could cause
 CPU utilization issues. I think it's got nothing to do with that.

 I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
 you might want to find some nutch experts to ask. Perhaps there's a nutch
 listserv?  I am also not familiar with hadoop; you mention just in passing
 that you're using hadoop too, maybe that's an added complication, I don't
 know.

 One obvious reason nutch could be taking 100% cpu would be simply because
 you've asked it to do a lot of work quickly, and it's trying to.

 One reason I have seen Solr take 100% of CPU and become responsive, is when
 the Solr process gets caught up in terrible Java garbage collection. If
 that's what's happening, then giving the Solr JVM a higher maximum heap size
 can sometimes help (although confusingly, I've seen people suggest that if
 you give the Solr JVM too MUCH heap it can also result in long GC pauses),
 and if you have a multi-core/multi-CPU machine, I've found the JVM argument
 -XX:+UseConcMarkSweepGC to be very helpful.

 Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
 Solr issue.
 
 From: Eric Martin [e...@makethembite.com]
 Sent: Sunday, October 31, 2010 7:16 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr in virtual host as opposed to /lib

 Hi,

 Thank you. This is more than idle curiosity. I am trying to debug an issue I
 am having with my installation and this is one step in verifying that I have
 a setup that does not consume resources. I am trying to debunk my internal
 myth that having Solr nad Nutch in a virtual host would be causing these
 issues. 

Re: Problem with phrase matches in Solr

2010-11-01 Thread darren
Take a look at term proximity and phrase query.

http://wiki.apache.org/solr/SolrRelevancyCookbook

 Hey guys,

 I have a solr index where i store information about experts from
 various fields. The thing is when I search for channel marketing i
 get people that have the word channel or marketing in their data. I
 only want people who have that entire phrase in their bio. I copy the
 contents of bio to the default search field (which is text)

 How can I make sure that exact phrase matching works while the search
 is agile enough that half searches match too (like uni matches
 university, etc - this works but not phrase matching)?

 I hope I was able to properly explain my problem. If not, please let me
 know.

 Thanks in advance,
 Moazzam




RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
Thanks Robert,

I'll use the workaround for now (using StandardTokenizerFactory and specifying 
version 3.1), but I suspect that I don't want the added URL/IP address 
recognition due to my use case.  I've also talked to a couple people who 
recommended using the ICUTokenFilter with some rule modifications, but haven't 
had a chance to investigate that yet.

  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this week 
I'll try writing the FilterFactories and upload patches. (Unless someone beats 
me to it :)

Tom

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, November 01, 2010 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support 
from Solr

On Mon, Nov 1, 2010 at 12:24 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are trying to solve some multilingual issues with our Solr analysis filter 
 chain and would like to use the new Lucene 3.x filters that are Unicode 
 compliant.

 Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with 
 UAX#29 support from Solr?

right now, you can use the StandardTokenizerFactory (which is UAX#29 +
URL and IP address recognition) from Solr.
just make sure you set the Version to 3.1 in your solrconfig.xml with
branch_3x, otherwise it will use the old standardtokenizer for
backwards compatibility.

  !--
Controls what version of Lucene various components of Solr adhere
to. Generally, you want
to use the latest version to get all bug fixes and improvements.
It is highly recommended
that you fully re-index after changing this setting as it can
affect both how text is indexed
and queried.
  --
  luceneMatchVersionLUCENE_31/luceneMatchVersion

But if you want the pure UAX#29 Tokenizer without this, there isn't a
factory. Also if you want customization/supplementary character
support, there is no factory for ICUTokenizer at the moment.

 If so, should I open a JIRA issue or two JIRA issues so the filter factories 
 can be contributed to the Solr code base?

Please open issues for a factory for the pure UAX#29 Tokenizer, and
for the ICU factories (maybe we can just put this into a contrib for
now?) !


RE: How does DIH multithreading work?

2010-11-01 Thread Dyer, James
Mark,

I have the same question so I did a little research on this.  Not a complete 
answer but here is what I've found:

- threads was aded with SOLR-1352 
(https://issues.apache.org/jira/browse/SOLR-1352).

- Also see 
http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler
 for background info.

- Only available in 3.x and trunk.  Committed on 1/12/2010 by Noble Paul (who 
surely can tell you more accurate info than I can).

- Seems like when using, each thread will call nextRow on your root entity 
datasource in parallel.

- Not sure this will help with child entities (ie. I had hoped I could get it 
to build child caches in parallel but I don't think this is the case).

- A doc comment on ThreadedEntityProcessorWrapper indicates this will help 
speed up running transformers becauses they'd be in parallel.  This would make 
sense if maybe your database can only pull back so fast, but then you have an 
intensive transformer.  Maybe adding a thread would make your processing no 
slower than the db...

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: markwaddle [mailto:m...@markwaddle.com] 
Sent: Tuesday, October 26, 2010 2:25 PM
To: solr-user@lucene.apache.org
Subject: How does DIH multithreading work?


I understand that the thread count is specified on root entities only. Does
it spawn multiple threads per root entity? Or multiple threads per
descendant entity? Can someone give an example of how you would make a
database query in an entity with 4 threads that would select 1 row per
thread?

Thanks,
Mark
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: indexing '-

2010-11-01 Thread PeterKerk

Guys, the string type did the trick :)

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-tp1816969p1823199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Robert Muir
On Mon, Nov 1, 2010 at 1:34 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Robert,

 I'll use the workaround for now (using StandardTokenizerFactory and 
 specifying version 3.1), but I suspect that I don't want the added URL/IP 
 address recognition due to my use case.  I've also talked to a couple people 
 who recommended using the ICUTokenFilter with some rule modifications, but 
 haven't had a chance to investigate that yet.


yes, as far as doing rule modifications, we can think about how to
hook this in. At the end of the day, if we allow someone to specify
the classname of their ICUTokenizerConfig (default:
DefaultICUTokenizerConfig), that would at least allow this
customization.

separately i'd be interested in hearing about whatever rule
modifications might be useful for different purposes.

  I opened two JIRA issues (https://issues.apache.org/jira/browse/SOLR-2210) 
 and https://issues.apache.org/jira/browse/SOLR-2211.  Sometime later this 
 week I'll try writing the FilterFactories and upload patches. (Unless someone 
 beats me to it :)


Thanks Tom, there are actually a lot of analysis factories (even in
just icu itself) not exposed to Solr, so its a good deal of work. I
know i have a few of them, but they aren't the best. I suggested on
SOLR-2210 we could make a contrib like 'extraAnalyzers' and put all
the analyzers-that-have-large-dependencies/dictionaries (e.g.
SmartChinese too) in there.

So theres a lot to be done... including tests, any help is appreciated!


Testing/packaging question

2010-11-01 Thread Bernhard Reiter
Hi, 

I'm pretty much of a Solr newbie currently packaging solrpy for Debian;
see
http://svn.debian.org/viewsvn/python-modules/packages/python-solrpy/trunk/

In order to run solrpy's supplied tests at build time, I'd need Solr to
know about the schema.xml that comes with the tests.
Can anyone tell me how do that properly? I'd basically need Solr to
temporarily recognize that schema.xml without permanently installing it
-- is there any way to do this, eg via environment variables?

TIA
Bernhard Reiter



Re: Facet count of zero

2010-11-01 Thread Tod

On 11/1/2010 1:03 PM, Yonik Seeley wrote:

On Mon, Nov 1, 2010 at 12:55 PM, Todlistac...@gmail.com  wrote:

I'm trying to exclude certain facet results from a facet query. �It seems to
work but rather than being excluded from the facet list its returned with a
count of zero.


If you don't want to see 0 counts, use facet.mincount=1

http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
http://www.lucidimagination.co



Ex:
q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true

This returns bar with a count of zero. �All the other foo's show up with
valid counts.

Can I do this? �Is my syntax incorrect?



Thanks - Tod





Excellent, I completely missed it - thanks!


Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Chris Hostetter

: References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com
: aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com
: aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com
: aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com
: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: In-Reply-To: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: Subject: Solr in virtual host as opposed to /lib

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss


Re: Reverse range search

2010-11-01 Thread Jan Høydahl / Cominvent
Hi,

I think I have seen a comment on the list from someone with the same need a few 
months ago.
He planned to make a new fieldType to support this, e.g. MinMaxRangeFieldType 
which would
be a polyField type holding both a min and max value, and then you could query 
it
q=myminmaxfield:123

I did not find it as a Jira issue however, but I can see how it would be useful 
for a lot of usecases. Perhaps you can create a Jira issue for it and supply a 
patch? :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 28. okt. 2010, at 23.24, kenf_nc wrote:

 
 Doing a range search is straightforward. I have a fixed value in a document
 field, I search on [x TO y] and if the fixed value is in the range requested
 it gets a hit. But, what if I have data in a document where there is a min
 value and a max value and my query is a fixed value and I want to get a hit
 if the query value is in that range. For example:
 
 Solr Doc1:
 field  min_price:100
 field  max_price:500
 
 Solr Doc2:
 field  min_price:300
 field  max_price:500
 
 and my query is price:250. I could create a query of (min_price:[* TO 250]
 AND max_price:[250 TO *]) and that should work. It should find only doc 1.
 However, if I have several fields like this and complex queries that include
 most of those fields, it becomes a very ugly query. Ideally I'd like to do
 something similar to what the spatial contrib guys do where they make
 lat/long a single point. If I had a min/max field, I could call it Price
 (100, 500) or Price (300,500) and just do a query of  Price:250 and Solr
 would see if 250 was in the appropriate range.
 
 Looong question short...Is there something out there already that does this?
 Does anyone else do something like this and have some suggestions?
 Thanks,
 Ken
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Reverse-range-search-tp1789135p1789135.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Eric Martin
I don't think you read the entire thread. I'm assuming you made a mistake.

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, November 01, 2010 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib


: References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com
: aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com
: aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com
: aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com
: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: In-Reply-To:
aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: Subject: Solr in virtual host as opposed to /lib

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: Solr in virtual host as opposed to /lib

2010-11-01 Thread Markus Jelsma
No, he didn't make a mistake but you did. Next time, please start a new thread 
not by conveniently replying to an existing thread and just changing the 
subject. Now we have two threads in thread. :)

 I don't think you read the entire thread. I'm assuming you made a mistake.
 
 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Monday, November 01, 2010 11:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr in virtual host as opposed to /lib
 
 : References:
 : aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com
 : 
 : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com
 : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com
 : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com
 : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
 : 
 : In-Reply-To:
 aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
 
 : Subject: Solr in virtual host as opposed to /lib
 
 http://people.apache.org/~hossman/#threadhijack
 Thread Hijacking on Mailing Lists
 
 When starting a new discussion on a mailing list, please do not reply to
 an existing message, instead start a fresh email.  Even if you change the
 subject line of your email, other mail headers still track which thread
 you replied to and your question is hidden in that thread and gets less
 attention.   It makes following discussions in the mailing list archives
 particularly difficult.
 See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
 
 
 
 -Hoss


RE: Solr in virtual host as opposed to /lib

2010-11-01 Thread Chris Hostetter

: I don't think you read the entire thread. I'm assuming you made a mistake.

No mistake.  When you sent your first message with the subject Solr in 
virtual host as opposed to /lib you did so in response to a completely 
unrelated thread (Searching with wrong keyboard layout or using 
translit)

Please note the headers i quoted below documenting this, or consult any 
mailing list archive that displays full threads...

http://markmail.org/thread/bjl23qcigp6w3kyl


: 
: -Original Message-
: From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
: Sent: Monday, November 01, 2010 11:49 AM
: To: solr-user@lucene.apache.org
: Subject: Re: Solr in virtual host as opposed to /lib
: 
: 
: : References: aanlktimvv5foc2b=gxo+xs1zwgps9o5t5jorwv3id...@mail.gmail.com
: : aanlktim30aat8s0nxq_8utxcokv8myyabz8wtxeyl...@mail.gmail.com
: : aanlktimpo9v_krgaxomd4hocqabibgzdhc+jhhgsq...@mail.gmail.com
: : aanlktimdvaawj7=b7=pgu+rzm+nobvzdfh4o39nkp...@mail.gmail.com
: : aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: : In-Reply-To:
: aanlktindzuwyjxwqqmtr5-rrp4gekvmj5vzzc_f0n...@mail.gmail.com
: : Subject: Solr in virtual host as opposed to /lib
: 
: http://people.apache.org/~hossman/#threadhijack
: Thread Hijacking on Mailing Lists
: 
: When starting a new discussion on a mailing list, please do not reply to 
: an existing message, instead start a fresh email.  Even if you change the 
: subject line of your email, other mail headers still track which thread 
: you replied to and your question is hidden in that thread and gets less 
: attention.   It makes following discussions in the mailing list archives 
: particularly difficult.
: See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
: 
: 
: 
: -Hoss
: 

-Hoss


is my search fast ?! date search i need some feedback :D

2010-11-01 Thread stockiii

my index is 13M big and i have not index all of my documents. the index in
production system should be about 30M Documents big. 

so with my test 13M Index i try a search over all documents, with 
first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00]  
than i run the next query, for statistics. grouped by currency_id and get
the amounts, of these Currencys.

thats my result:
- EUR Sum: 437.259.518,28 € Founded: 3712331 
- CHF Sum: 2.048.147,62 SFr. Founded: 10473 
- GBP Sum: 1.221,41 £ Founded: 181 

for getting the result solr needs 9 seconds. ... i dont think thats really
fast =(
what do you think ? 


for faster search i want to try change precisionStep=6 to -- for deleting
the milliseconds. whats the value for deleting also the seconds ? we only
need HH:MM and not HH:MM:SS:MSMS
and i change the datesearch from q to fq ...

thx


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Use SolrCloud (SOLR-1873) on trunk, or with 1.4.1?

2010-11-01 Thread Jeremy Hinegardner
I took a swag at applying SOLR-1873 to branch_3x.  It applied mostly, most
of the rest of the issues where Zookeeper integrations, and those
appliedly cleanly by hand.

There were also a few constants and such that need to be pulled in from trunk.

At the moment, it passes all the tests.  I have not actually used it yet,
and probably won't for a few weeks, but if someone else wants to try it out:

http://github.com/collectiveintellect/lucene-solr/tree/branch_3x-cloud

Have at it.

enjoy,

-jeremy

On Thu, Oct 28, 2010 at 11:21:12PM +0200, Jan H?ydahl / Cominvent wrote:
 Hi,
 
 I would aim for reindexing on branch3_x, which will be the 3.1 release soon. 
 I don't know if SOLR-1873 applies cleanly to 3_x now, but it would surely be 
 less effort to have it apply to 3_x than to 1.4. Perhaps you can help 
 backport the patch to 3_x?
 
 --
 Jan H?ydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 On 28. okt. 2010, at 03.04, Jeremy Hinegardner wrote:
 
  Hi all,
  
  I see that as of r1022188 Solr Cloud has been committed to trunk.
  
  I was wondering about the stability of Solr Cloud on trunk.  We are
  planning to do a major reindexing soon (within 30 days), several billion 
  docs,
  and would like to switch to a Solr Cloud based infrastructure. 
  
  We are wondering should use trunk as it is now that SOLR-1873 is applied, or
  should we take SOLR-1873 and apply it to Solr 1.4.1.
  
  Has anyone used 1.4.1 + SOLR-1873?  In production?
  
  Thanks,
  
  -jeremy
  
  -- 
  
  Jeremy Hinegardner  jer...@hinegardner.org 
  
 

-- 

 Jeremy Hinegardner  jer...@hinegardner.org 



Re: How does DIH multithreading work?

2010-11-01 Thread Lance Norskog
It is useful for parsing PDFs on a multi-processor machine. Also, if a
sub-entity does an outbound I/O call to a database, a file, or another
SOLR (SOLR-1499).

Anything where the pipeline time outweighs disk i/o time.

Threading happens on a per-document level- there is no concurrent
access inside a document pipeline.

There is a bug which causes Entityprocessor that look up attributes to
throw an exception. This make Tika unusable inside a thread. Two other
EPs also won't work, but I did not test them.

https://issues.apache.org/jira/browse/SOLR-2186

On Mon, Nov 1, 2010 at 10:43 AM, Dyer, James james.d...@ingrambook.com wrote:
 Mark,

 I have the same question so I did a little research on this.  Not a complete 
 answer but here is what I've found:

 - threads was aded with SOLR-1352 
 (https://issues.apache.org/jira/browse/SOLR-1352).

 - Also see 
 http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler
  for background info.

 - Only available in 3.x and trunk.  Committed on 1/12/2010 by Noble Paul (who 
 surely can tell you more accurate info than I can).

 - Seems like when using, each thread will call nextRow on your root entity 
 datasource in parallel.

 - Not sure this will help with child entities (ie. I had hoped I could get it 
 to build child caches in parallel but I don't think this is the case).

 - A doc comment on ThreadedEntityProcessorWrapper indicates this will help 
 speed up running transformers becauses they'd be in parallel.  This would 
 make sense if maybe your database can only pull back so fast, but then you 
 have an intensive transformer.  Maybe adding a thread would make your 
 processing no slower than the db...

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: markwaddle [mailto:m...@markwaddle.com]
 Sent: Tuesday, October 26, 2010 2:25 PM
 To: solr-user@lucene.apache.org
 Subject: How does DIH multithreading work?


 I understand that the thread count is specified on root entities only. Does
 it spawn multiple threads per root entity? Or multiple threads per
 descendant entity? Can someone give an example of how you would make a
 database query in an entity with 4 threads that would select 1 row per
 thread?

 Thanks,
 Mark
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: solr stuck in writing to inexisting sockets

2010-11-01 Thread Lance Norskog
 Besides, I don't know how you'd stop Solr processing a query mid-way
 through,
 I don't know of any way to make that happen.
The timeAllowed parameter causes a timeout in the Solr server to kill
the searching thread. They uses that now.

But, yes, Erick is right- there is a fundamental problem you should
solve. Since they are all stuck in returning XML results, there is
something wrong in reading back results.

It is possible that there is a bug in timeAllowed, where the
kill-this-thread hits while returning the results and the handler for
this does not work correctly when returning results. It would be great
if someone wrote a unit test for this (not me) and posted it.

On Mon, Nov 1, 2010 at 8:44 AM, Erick Erickson erickerick...@gmail.com wrote:
 I'm going to nudge you in the direction of understanding why the queries
 take so long in the first place rather than going toward the blunt approach
 of cutting them off after some time. The fact that you don't control the
 queries submitted doesn't prevent you from trying to understand what
 is taking so long.

 The first thing I'd look for is whether the system is memory starved. What
 JVM are you using and what memory parameters are you giving it? What
 version of Solr are you using? Have you tried any performance monitoring
 to determine what is happening?

 The reason I'm pushing in this direction is that 5 minute searches are
 pathological. Once you're up in that range, virtually any fix you come up
 with will simply mask the underlying problems, and you'll be forever
 chasing the next manifestation of the underlying problem.

 Besides, I don't know how you'd stop Solr processing a query mid-way
 through,
 I don't know of any way to make that happen.

 Best
 Erick

 On Mon, Nov 1, 2010 at 9:30 AM, Roxana Angheluta anghelu...@yahoo.comwrote:

 Hi,

 Yes, sometimes it takes 5 minutes for a query. I agree this is not
 desirable. However, if the application has no control over the input queries
 other that closing the socket after a while, solr should not continue
 writing the response, but terminate the thread.

 In general, is there a way to quantify the complexity of a given query on a
 certain index? Some general guidelines which can be used by non-technical
 people?

 Thanks a lot,
 roxana

 --- On Sun, 10/31/10, Erick Erickson erickerick...@gmail.com wrote:

  From: Erick Erickson erickerick...@gmail.com
  Subject: Re: solr stuck in writing to inexisting sockets
  To: solr-user@lucene.apache.org
  Date: Sunday, October 31, 2010, 2:29 AM
  Are you saying that your Solr server
  is at times taking 5 minutes to
  complete? If so,
  I'd get to the bottom of that first off. My first guess
  would be you're
  either hitting
  memory issues and swapping horribly or..well, that would be
  my first guess.
 
  Best
  Erick
 
  On Thu, Oct 28, 2010 at 5:23 AM, Roxana Angheluta anghelu...@yahoo.com
 wrote:
 
   Hi all,
  
   We are using Solr over Jetty with a large index,
  sharded and distributed
   over multiple machines. Our queries are quite long,
  involving boolean and
   proximity operators. We cut the connection at the
  client side after 5
   minutes. Also, we are using parameter timeAllowed to
  stop executing it on
   the server after a while.
   We quite often run into situations when solr blocks.
  The load on the
   server increases and a thread dump on the solr process
  shows many threads
   like below:
  
  
   btpool0-49 prio=10 tid=0x7f73afe1d000 nid=0x3581
  runnable
   [0x451a]
     java.lang.Thread.State: RUNNABLE
          at
  java.io.PrintWriter.write(PrintWriter.java:362)
          at
  org.apache.solr.common.util.XML.escape(XML.java:206)
          at
  org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
          at
  org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:832)
          at
  org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:684)
          at
  org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:564)
          at
  org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:435)
          at
  org.apache.solr.request.XMLWriter$2.writeDocs(XMLWriter.java:514)
          at
  
  org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:485)
          at
  
 
 org.apache.solr.request.XMLWriter.writeSolrDocumentList(XMLWriter.java:494)
          at
  org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:588)
          at
  
  org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130)
          at
  
 
 org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
          at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
          at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
          at
  
 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
          at
  
  

Re: Design and Usage Questions

2010-11-01 Thread Lance Norskog
Yes, you can write your own app to read the file with SVNkit and post
it to the ExtractingRequestHandler. This would be easiest.

On Mon, Nov 1, 2010 at 5:49 AM, getagrip getag...@web.de wrote:
 Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
 I do not depend on Solr_J, any connection-method would suffice.

 On 11/01/2010 03:23 AM, Lance Norskog wrote:

 2.
 The SolrJ library handling of content streams is pull, not push.
 That is, you give it a reader and it pulls content when it feels like
 it. If your software to feed the connection wants to write the data,
 you have to either buffer the whole thing or do a dual-thread
 writer/reader pair.

 The easiest way to pull stuff from SVN is to use one of the web server
 apps. Solr takes a stream.url parameter. (Also stream.file.) Note
 that there is no outbound authentication supported; your web server
 has to be open (at least to the Solr instance).


 On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de  wrote:

 Hi,

 I've got some basic usage / design questions.

 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

 2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

 3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating  sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

 Any hints are welcome, Thanks!








-- 
Lance Norskog
goks...@gmail.com


Re: Design and Usage Questions

2010-11-01 Thread Xin Li
If you just want a quick way to query Solr server, Perl module
Webservice::Solr is pretty good.


On Mon, Nov 1, 2010 at 4:56 PM, Lance Norskog goks...@gmail.com wrote:

 Yes, you can write your own app to read the file with SVNkit and post
 it to the ExtractingRequestHandler. This would be easiest.

 On Mon, Nov 1, 2010 at 5:49 AM, getagrip getag...@web.de wrote:
  Ok, so if I did NOT use Solr_J I could PUSH a Stream to Solr somehow?
  I do not depend on Solr_J, any connection-method would suffice.
 
  On 11/01/2010 03:23 AM, Lance Norskog wrote:
 
  2.
  The SolrJ library handling of content streams is pull, not push.
  That is, you give it a reader and it pulls content when it feels like
  it. If your software to feed the connection wants to write the data,
  you have to either buffer the whole thing or do a dual-thread
  writer/reader pair.
 
  The easiest way to pull stuff from SVN is to use one of the web server
  apps. Solr takes a stream.url parameter. (Also stream.file.) Note
  that there is no outbound authentication supported; your web server
  has to be open (at least to the Solr instance).
 
 
  On Sun, Oct 31, 2010 at 4:06 PM, getagripgetag...@web.de  wrote:
 
  Hi,
 
  I've got some basic usage / design questions.
 
  1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
instance for all requests to avoid connection leaks.
So if I create a Singleton instance upon application-startup I can
securely use this instance for ALL queries/updates throughout my
application without running into performance issues?
 
  2. My System's documents are stored in a Subversion repository.
For fast searchresults I want to periodically index new documents
from the repository.
 
What I get from the repository is a ByteArrayOutputStream. How can I
pass this Stream to Solr?
 
I only see possibilities to pass Files but in my case it does not
make sense to write the ByteArrayOutputStream to disk again as this
would cause performance issues apart from making no sense anyway.
 
  3. Are there any disadvantages using Solrj over some other HTTP based
solution e.g. creating  sending my own HTTP requests? Do I even
have to use HTTP?
I see the EmbeddedSolrServer exists. Any drawbacks using that?
 
  Any hints are welcome, Thanks!
 
 
 
 
 



 --
 Lance Norskog
 goks...@gmail.com



Re: is my search fast ?! date search i need some feedback :D

2010-11-01 Thread Erick Erickson
Careful here. First searches are known to be slow, various caches
are filled up the first time they are used etc. So even though you're
measuring the second query, it's still perhaps filling caches.

And what are you measuring? The raw search time or the entire response
time? These can be quite different. Try running with debugQuery=on and one
of the things you'll get back is the search time (not including assembling
the response).

You're right, though, 9 seconds is far too long. If you have a relatively
small
number of currency_ids, think about the enum method (see:
http://wiki.apache.org/solr/SimpleFacetParameters#facet.method)

Also, think about autowarming and firstsearch queries to prepare your
solr instance for faster responses.

If none of that helps, please post the relevant parts of your schema.xml and
the results of running your query with debugQuery=on, that'll give us a lot
more info to go on.

Best
Erick

On Mon, Nov 1, 2010 at 5:37 AM, stockiii stock.jo...@gmail.com wrote:


 my index is 13M big and i have not index all of my documents. the index in
 production system should be about 30M Documents big.

 so with my test 13M Index i try a search over all documents, with
 first query: q:[2008-10-27 12:23:00:00 TO 2009-04-29 23:59:00:00]
 than i run the next query, for statistics. grouped by currency_id and get
 the amounts, of these Currencys.

 thats my result:
 - EUR Sum: 437.259.518,28 € Founded: 3712331
 - CHF Sum: 2.048.147,62 SFr. Founded: 10473
 - GBP Sum: 1.221,41 £ Founded: 181

 for getting the result solr needs 9 seconds. ... i dont think thats really
 fast =(
 what do you think ?


 for faster search i want to try change precisionStep=6 to -- for
 deleting
 the milliseconds. whats the value for deleting also the seconds ? we only
 need HH:MM and not HH:MM:SS:MSMS
 and i change the datesearch from q to fq ...

 thx


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/is-my-search-fast-date-search-i-need-some-feedback-D-tp1820821p1820821.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Which is faster -- delete or update?

2010-11-01 Thread Andy
My documents have a down_vote field. Every time a user votes down a document, 
I increment the down_vote field in my database and also re-index the document 
to Solr to reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say 
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have 
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to 
reflect the new down_vote value even if total down votes might have been more 
than 3. During query, add a fq to restrict results to documents with fewer 
than 3 down votes.
Which approach is better? Is it faster to delete a document from index or to 
update the document to reflect the new down_vote value?
Thanks.Andy


  

Re: Which is faster -- delete or update?

2010-11-01 Thread Peter Karich
 From the user perspective I wouldn't delete it, because it could be 
that down-voting by mistake or spam or something and up-voting can 
resurrect it.
It could be also wise to keep the docs to see which content (from which 
users?) are down voted to get spam accounts?


From the dev perspective you should benchmark it, if really necessary. 
(I guess updating is a more expensive because I think it is 
delete+completely-new-add)


Regards,
Peter.


My documents have a down_vote field. Every time a user votes down a document, I 
increment the down_vote field in my database and also re-index the document to Solr to 
reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say 
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have 
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to reflect the new 
down_vote value even if total down votes might have been more than 3. During query, add a 
fq to restrict results to documents with fewer than 3 down votes.
Which approach is better? Is it faster to delete a document from index or to 
update the document to reflect the new down_vote value?
Thanks.Andy





Re: Which is faster -- delete or update?

2010-11-01 Thread Erick Erickson
Just deleting a document is faster because all that really happens
is the document is marked as deleted. An update is really
a delete followed by an add of the same document, so by definition
an update will be slower...

But... does it really make a difference? How often to you expect this to
happen? Perter Karich added a note while I was typing this, and he
makes some cogent points.

I'm starting to think that I don't care about better unless and until my
users notice (or I have a reasonable expectation that they #will# notice).
I'm far more interested in simpler code that I can maintain than I am
shaving off another 4 milliseconds from the response time. That gives
me more chance to put in cool new features that the user will notice...

Best
Erick

On Mon, Nov 1, 2010 at 5:04 PM, Andy angelf...@yahoo.com wrote:

 My documents have a down_vote field. Every time a user votes down a
 document, I increment the down_vote field in my database and also re-index
 the document to Solr to reflect the new down_vote value.
 During searches, I want to restrict the results to only documents with, say
 fewer than 3 down_vote. 2 ways to implement that:
 1) When a user down vote a document, check to see if total down votes have
 reached 3. If it has, delete document from Solr index.
 2) When a user down vote a document, update the document in Solr index to
 reflect the new down_vote value even if total down votes might have been
 more than 3. During query, add a fq to restrict results to documents with
 fewer than 3 down votes.
 Which approach is better? Is it faster to delete a document from index or
 to update the document to reflect the new down_vote value?
 Thanks.Andy





Re: Which is faster -- delete or update?

2010-11-01 Thread Jonathan Rochkind
The actual time it takes to delete or update the document is unlikely to 
make a difference to you.


What might make a difference to you is the time it takes to actually 
finalize the commit, and the time it takes to re-warm your indexes after 
a commit, and especially the time it takes to run any warming queries 
you have set in newSearcher. Most of these probably won't differ between 
delete or update, but could be a problem either way; one way to find 
out, try it and measure it.


Whether you do a delete or an update, if you're planning on making 
changes to your index more often than, oh, 10 or 20 minute seperation, 
you may run into trouble. Solr isn't so good at frequent changes to the 
index like that.  I haven't looked at it myself, but the Solr patches 
that get called near real-time seem like they're intended to deal with 
this, among other things, and allow frequent commits without killing 
performance or RAM usage.


I am not sure how/if other people are effectively dealing with 
user-generated content that needs to be included in the index for 
filtering and searching against. Would be very curious if anyone has any 
successful strategies to share. Another example would be user-generated 
tagging.


Erick Erickson wrote:

Just deleting a document is faster because all that really happens
is the document is marked as deleted. An update is really
a delete followed by an add of the same document, so by definition
an update will be slower...

But... does it really make a difference? How often to you expect this to
happen? Perter Karich added a note while I was typing this, and he
makes some cogent points.

I'm starting to think that I don't care about better unless and until my
users notice (or I have a reasonable expectation that they #will# notice).
I'm far more interested in simpler code that I can maintain than I am
shaving off another 4 milliseconds from the response time. That gives
me more chance to put in cool new features that the user will notice...

Best
Erick

On Mon, Nov 1, 2010 at 5:04 PM, Andy angelf...@yahoo.com wrote:

  

My documents have a down_vote field. Every time a user votes down a
document, I increment the down_vote field in my database and also re-index
the document to Solr to reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to
reflect the new down_vote value even if total down votes might have been
more than 3. During query, add a fq to restrict results to documents with
fewer than 3 down votes.
Which approach is better? Is it faster to delete a document from index or
to update the document to reflect the new down_vote value?
Thanks.Andy






  


Field boosting in DataImportHandler transformer

2010-11-01 Thread Brad Kellett
It's not looking very promising, but is there something I'm missing to be able 
to apply a field boost from within a transformer in the DataImportHandler? Not 
a boost defined within the schema, but a boost applied to the field from the 
transformer itself.

I know you can do a document boost, but I can't see anything for a field boost.

~bck



Possible memory leaks with frequent replication

2010-11-01 Thread Simon Wistow
We've been trying to get a setup in which a slave replicates from a 
master every few seconds (ideally every second but currently we have it 
set at every 5s).

Everything seems to work fine until, periodically, the slave just stops 
responding from what looks like it running out of memory:

org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.OutOfMemoryError: Java heap space


(our monitoring seems to confirm this).

Looking around my suspicion is that it takes new Readers longer to warm 
than the gap between replication and thus they just build up until all 
memory is consumed (which, I suppose isn't really memory 'leaking' per 
se, more just resource consumption)

That said, we've tried turning off caching on the slave and that didn't 
help either so it's possible I'm wrong.

Is there anything we can do about this? I'm reluctant to increase the 
heap space since I suspect that will mean that there's just a longer 
period between failures. Might Zoie help here? Or should we just query 
against the Master?


Thanks,

Simon


Re: Re:Re: problem of solr replcation's speed

2010-11-01 Thread Lance Norskog
This is the time to replicate and open the new index, right? Opening a
new index can take a lot of time. How many autowarmers and queries are
there in the caches? Opening a new index re-runs all of the queries in
all of the caches.

2010/11/1 kafka0102 kafka0...@163.com:
 I suspected my app has some sleeping op every 1s, so
 I changed ReplicationHandler.PACKET_SZ to 1024 * 1024*10; // 10MB

 and log result is like thus :
 [2010-11-01 
 17:49:29][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3184
 [2010-11-01 
 17:49:32][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3426
 [2010-11-01 
 17:49:36][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3359
 [2010-11-01 
 17:49:39][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3166
 [2010-11-01 
 17:49:42][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3513
 [2010-11-01 
 17:49:46][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3140
 [2010-11-01 
 17:49:50][INFO][pool-6-thread-1][SnapPuller.java(1038)]readFully10485760 cost 
 3471

 That means It's still slow like before. what's wrong with my env

 At 2010-11-01 17:30:32,kafka0102 kafka0...@163.com wrote:
 I hacked SnapPuller to log the cost, and the log is like thus:
 [2010-11-01 
 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
 979
 [2010-11-01 
 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
 [2010-11-01 
 17:21:19][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
 [2010-11-01 
 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
 980
 [2010-11-01 
 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 4
 [2010-11-01 
 17:21:20][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 5
 [2010-11-01 
 17:21:21][INFO][pool-6-thread-1][SnapPuller.java(1037)]readFully1048576 cost 
 979


 It's saying it cost about 1000ms for transfering 1M data every 2 times. I 
 used jetty as server and embeded solr in my app.I'm so confused.What I have 
 done wrong?


 At 2010-11-01 10:12:38,Lance Norskog goks...@gmail.com wrote:

If you are copying from an indexer while you are indexing new content,
this would cause contention for the disk head. Does indexing slow down
during this period?

Lance

2010/10/31 Peter Karich peat...@yahoo.de:
  we have an identical-sized index and it takes ~5minutes


 It takes about one hour to replacate 6G index for solr in my env. But my
 network can transfer file about 10-20M/s using scp. So solr's http
 replcation is too slow, it's normal or I do something wrong?






--
Lance Norskog
goks...@gmail.com






-- 
Lance Norskog
goks...@gmail.com


Re: Possible memory leaks with frequent replication

2010-11-01 Thread Lance Norskog
You should query against the indexer. I'm impressed that you got 5s
replication to work reliably.

On Mon, Nov 1, 2010 at 4:27 PM, Simon Wistow si...@thegestalt.org wrote:
 We've been trying to get a setup in which a slave replicates from a
 master every few seconds (ideally every second but currently we have it
 set at every 5s).

 Everything seems to work fine until, periodically, the slave just stops
 responding from what looks like it running out of memory:

 org.apache.catalina.core.StandardWrapperValve invoke
 SEVERE: Servlet.service() for servlet jsp threw exception
 java.lang.OutOfMemoryError: Java heap space


 (our monitoring seems to confirm this).

 Looking around my suspicion is that it takes new Readers longer to warm
 than the gap between replication and thus they just build up until all
 memory is consumed (which, I suppose isn't really memory 'leaking' per
 se, more just resource consumption)

 That said, we've tried turning off caching on the slave and that didn't
 help either so it's possible I'm wrong.

 Is there anything we can do about this? I'm reluctant to increase the
 heap space since I suspect that will mean that there's just a longer
 period between failures. Might Zoie help here? Or should we just query
 against the Master?


 Thanks,

 Simon




-- 
Lance Norskog
goks...@gmail.com


Phrase Query Problem?

2010-11-01 Thread Tod
I have a number of fields I need to do an exact match on.  I've defined 
them as 'string' in my schema.xml.  I've noticed that I get back query 
results that don't have all of the words I'm using to search with.


For example:

q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json

Should, with an exact match, return only one entry but it returns five 
some of which don't have any of the fields I've specified.  I've tried 
this both with and without quotes.


What could I be doing wrong?


Thanks - Tod



Re: Phrase Query Problem?

2010-11-01 Thread Ken Stanley
On Mon, Nov 1, 2010 at 10:26 PM, Tod listac...@gmail.com wrote:

 I have a number of fields I need to do an exact match on.  I've defined
 them as 'string' in my schema.xml.  I've noticed that I get back query
 results that don't have all of the words I'm using to search with.

 For example:


 q=(((mykeywords:Compliance+With+Conduct+Standards)OR(mykeywords:All)OR(mykeywords:ALL)))start=0indent=truewt=json

 Should, with an exact match, return only one entry but it returns five some
 of which don't have any of the fields I've specified.  I've tried this both
 with and without quotes.

 What could I be doing wrong?


 Thanks - Tod



Tod,

Without knowing your exact field definition, my first guess would be your
first boolean query; because it is not quoted, what SOLR typically does is
to transform that type of query into something like (assuming your uniqueKey
is id): (mykeywords:Compliance id:With id:Conduct id:Standards). If you do
(mykeywords:Compliance+With+Conduct+Standards) you might see different
(better?) results. Otherwise, append debugQuery=on to your URL and you can
see exactly how SOLR is parsing your query. If none of that helps, what is
your field definition in your schema.xml?

- Ken


RE: Ensuring stable timestamp ordering

2010-11-01 Thread Dennis Gearon
how about a timrstamp with either a GUID appended on  the end of it?


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/31/10, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 From: Toke Eskildsen t...@statsbiblioteket.dk
 Subject: RE: Ensuring stable timestamp ordering
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Sunday, October 31, 2010, 12:18 PM
 Dennis Gearon [gear...@sbcglobal.net]
 wrote:
  Even microseconds may not be enough on some really
 good, fast machine.
 
 True, especially since the timer might not provide
 microsecond granularity although the returned value is in
 microseconds. However, an unique timestamp generator should
 keep track of the previous timestamp to guard against
 duplicates. Uniqueness can thus be guaranteed by waiting a
 bit or cheating on the decimals. With microseconds can
 produce 1 million timestamps / second. While I agree that
 duplicates within microseconds can occur on a fast machine,
 guaranteeing uniqueness by waiting should only be a
 performance problem when the number of duplicates is high.
 That's still a few years off, I think.
 
 As Michael pointed out, using normal timestamps as unique
 IDs might not be such a great idea as it effectively locks
 index-building to a single JVM. By going the ugly route and
 expressing the time in nanos with only microsecond
 granularity and use the last 3 decimals for a builder ID
 this could be fixed. Not very clean though, as the contract
 is not expressed in the data themselves but must
 nevertheless be obeyed by all builders to avoid collisions.
 It also raises the question of who should assign the builder
 IDs. Not trivial in an anarchistic setup where new builders
 can be added by different controllers.
 
 Pragmatists might use the PID % 1000 or similar for the
 builder ID as it does not require coordination, but this is
 where the Birthday Paradox hits us again: The chance of two
 processes on different machines having the same PID is 10%
 if just 15 machines are used (1% for 5 machines, 50% for 37
 machines). I don't like those odds and that's assuming that
 the PIDs will be randomly distributed, which they won't. It
 could be lowered by reserving more decimals for the salt,
 but then we would decrease the maximum amount of timestamps
 / second, still without guaranteed uniqueness. Guys a lot
 smarter than me has spend time on the unique ID problem and
 it's clearly not easy: Java's UUID takes up 128 bits.
 
 - Toke


Default file locking on trunk

2010-11-01 Thread Lance Norskog
Scenario:

Git update to current trunk (Nov 1, 2010).
Build all
Run solr in trunk/solr/example with 'java -jar start.jar'
Hi ^C
Jetty reports doing shutdown hook

There is now a data/index with a write lock file in it. I have not
attempted to read the index, let alone add something to it.
I start solr again, and it cannot open the index because of the write lock.

Why is there a write lock file when I have not tried to index anything?

-- 
Lance Norskog
goks...@gmail.com