Personalized Search

2010-05-20 Thread Rih
Has anybody done personalized search with Solr? I'm thinking of including
fields such as bought or like per member/visitor via dynamic fields to a
product search schema. Another option is to have a multi-value field that
can contain user IDs. What are the possible performance issues with this
setup?

Looking forward to your ideas.

Rih


RE: how to achieve filters

2010-05-20 Thread Doddamani, Prakash
Hi All 

I am getting Error in Solr
Error loading class 'Solr.TrieField'

I have added following in Types of schema file

fieldType name=tint class=solr.TrieField omitNorms=true /

And in custom fields of schema have added

field name=bitrate type=tint indexed=true stored=true /

I am using solr version 1.3 cant I handle filter(my example bitrate) with sint  
  

Thanks
Prakash



Ahmet wrote..

 Yep content is string, and bitrate is int.

bitrate should be trie based tint, not int, for range queries work correctly.

 I am digging more now Can we combine both the scenarios.
 
 q=rockfq={!field f=content}mp3
 q=rockfq:bitrate:[* TO 128]
 
 Say if I want only mp3 from 0 to 128

You can append filter queries (fq) as many as you want. 

q=rockfq={!field f=content}mp3fq=bitrate:[* TO 128]


 

-Original Message-
From: Doddamani, Prakash [mailto:prakash.doddam...@corp.aol.com] 
Sent: Tuesday, May 18, 2010 9:06 PM
To: solr-user@lucene.apache.org
Subject: RE: how to achieve filters

Hey

q=rockfq:bitrate:[* TO 128]

bitrate is int
This also return docs with more then 128 bitrate, Is there something I am doing 
wrong 

Regards
prakash

-Original Message-
From: Doddamani, Prakash [mailto:prakash.doddam...@corp.aol.com]
Sent: Tuesday, May 18, 2010 8:44 PM
To: solr-user@lucene.apache.org
Subject: RE: how to achieve filters

Thanks much Ahmet,

Yep content is string, and bitrate is int.

I am digging more now Can we combine both the scenarios.

q=rockfq={!field f=content}mp3
q=rockfq:bitrate:[* TO 128]

Say if I want only mp3 from 0 to 128

Regards
Prakash

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com]
Sent: Tuesday, May 18, 2010 8:24 PM
To: solr-user@lucene.apache.org
Subject: Re: how to achieve filters

 I am using dismax query to fetch docs from solr where I have set 
 some boost to the each fields,
 
  
 
 If I search for query Rock I get following docs with some boost 
 value which I have specified,
 
  
 
 doc
   float name=score19.494072/float
   int name=bitrate120/int
   str name=contentmp3/str
   str name=genreRock/str
   str name=id1/str
   str name=namest name 1/str
 /doc
 doc
   float name=score19.494052/float
   int name=bitrate248/int
   str name=contentaac+/str
   str name=genreRock/str
   str name=id2/str
   str name=namest name 2/str
 /doc
 doc
   float name=score19.494042/float
   int name=bitrate127/int
   str name=contentaac+/str
   str name=genreRock/str
   str name=id3/str
   str name=namest name 3/str
 /doc
 doc
   float name=score19.494032/float
   int name=bitrate256/int
   str name=contentmp3/str
   str name=genreRock/str
   str name=id4/str
   str name=namest name 5/str
 /doc
  
 
 I am looking for something below What is the best way to achieve them 
 ?

With filter queries. fq=

 1. Query=rock where content= mp3 where it should return only first and 
 last docs where content=mp3

Assuming that content is string typed. q=rockfq={!field f=content}mp3 

 2. Query=rock where bitrate128 where it should return only first and 
 third docs where bitrate128

q=rockfq:bitrate:[* TO 128] for this bitrate field must be tint type.



  


Re: Personalized Search

2010-05-20 Thread findbestopensource
Hi Rih,

You going to include either of the two field bought or like to per
member/visitor OR a unique field per member / visitor?

If it's one or two common fields are included then there will not be any
impact in performance. If you want to include unique field then you need to
consider multi value field otherwise you certainly hit the wall.

Regards
Aditya
www.findbestopensource.com




On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote:

 Has anybody done personalized search with Solr? I'm thinking of including
 fields such as bought or like per member/visitor via dynamic fields to
 a
 product search schema. Another option is to have a multi-value field that
 can contain user IDs. What are the possible performance issues with this
 setup?

 Looking forward to your ideas.

 Rih



Re: Personalized Search

2010-05-20 Thread dc tech
Another approach would be to do query time boosts of 'my' items under
the assumption that count is limited:
- keep the SOLR index independent of bought/like
- have a db table with user prefs on a per item basis
- at query time, specify boosts for 'my items' items

We are planning to do this in the context of document management where
documents in 'my (used/favorited ) folders' provide a boost factor
to the results.



On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote:
 Hi Rih,

 You going to include either of the two field bought or like to per
 member/visitor OR a unique field per member / visitor?

 If it's one or two common fields are included then there will not be any
 impact in performance. If you want to include unique field then you need to
 consider multi value field otherwise you certainly hit the wall.

 Regards
 Aditya
 www.findbestopensource.com




 On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote:

 Has anybody done personalized search with Solr? I'm thinking of including
 fields such as bought or like per member/visitor via dynamic fields to
 a
 product search schema. Another option is to have a multi-value field that
 can contain user IDs. What are the possible performance issues with this
 setup?

 Looking forward to your ideas.

 Rih



-- 
Sent from my mobile device


RE: how to achieve filters

2010-05-20 Thread Ahmet Arslan

 I am getting Error in Solr
 Error loading class 'Solr.TrieField'
 
 I have added following in Types of schema file
 
 fieldType name=tint class=solr.TrieField
 omitNorms=true /
 
 And in custom fields of schema have added
 
 field name=bitrate type=tint indexed=true
 stored=true /
     
 I am using solr version 1.3 cant I handle filter(my example
 bitrate) with sint    

Sure you can handle, if you are using solr 1.3.0 you need to use sint type.





RE: how to achieve filters

2010-05-20 Thread Doddamani, Prakash
Hey Ahmet

I have added 

field name=bitrate type=sint indexed=true stored=true default=0/
 
And the request I am passing is 
/solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3fq:bitrate:[*
 TO 127] start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.fl

Still I am seeing documents above bitarate 127 

Regards
Prakash

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, May 20, 2010 4:09 PM
To: solr-user@lucene.apache.org
Cc: Doddamani, Prakash
Subject: RE: how to achieve filters


 I am getting Error in Solr
 Error loading class 'Solr.TrieField'
 
 I have added following in Types of schema file
 
 fieldType name=tint class=solr.TrieField
 omitNorms=true /
 
 And in custom fields of schema have added
 
 field name=bitrate type=tint indexed=true
 stored=true /
     
 I am using solr version 1.3 cant I handle filter(my example
 bitrate) with sint

Sure you can handle, if you are using solr 1.3.0 you need to use sint type.


  


RE: how to achieve filters

2010-05-20 Thread Ahmet Arslan

  
 And the request I am passing is 
 /solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3fq:bitrate:[*
 TO 127]
 start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.fl
 
 Still I am seeing documents above bitarate 127 

There is a typo instead of fq: there should be fq= 

fq=bitrate:[* TO 127]


  


RE: how to achieve filters

2010-05-20 Thread Doddamani, Prakash
Oops my bad, 

Thanks much 

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, May 20, 2010 4:31 PM
To: solr-user@lucene.apache.org
Subject: RE: how to achieve filters


  
 And the request I am passing is
 /solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3f
 q:bitrate:[*
 TO 127]
 start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.f
 l
 
 Still I am seeing documents above bitarate 127

There is a typo instead of fq: there should be fq= 

fq=bitrate:[* TO 127]


  


Statistics exposed as JSON

2010-05-20 Thread Andreas Jung
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Are the Solr 1.4 statistics like #docs, #docsPending etc. exposed in
JSON format?

Andreas
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv1LkIACgkQCJIWIbr9KYyOBACg1DqCTNJrP8WaFTNPqPa9HLo1
0lkAn1XaIbhi5s4Wv2OiT+lMWeD8VLl+
=Qxjy
-END PGP SIGNATURE-


solr caches from external caching system like memcached

2010-05-20 Thread bharath venkatesh
Hi,

  Is it possible to use solr caches such as query cache , filter cache
and document cache from external caching system   like memcached as it
has several advantages such as centralized caching system  and reducing the
pause time  of JVM 's garbage collection  as we can assign less memory to
jvm .

Thanks,
Bharath


Re: index merge

2010-05-20 Thread uma m

Hi All,
 
   The problem is resolved. It is purely due to filesystem. My filesystem is
of 32-bit, running on 64 bit OS. I changed to 64 bit filesystem and all
works as expected.

Uma
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/index-merge-tp472904p832053.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Personalized Search

2010-05-20 Thread MitchK

Hi dc,



 - at query time, specify boosts for 'my items' items 
 
Do you mean something like document-boost or do you want to include
something like
OR myItemId:100^100
?

Can you tell us how you would specify document-boostings at query-time? Or
are you querying something like a boolean field (i.e. isFavorite:true^10) or
a numeric field?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Personalized Search

2010-05-20 Thread Ken Krugler


On May 19, 2010, at 11:43pm, Rih wrote:

Has anybody done personalized search with Solr? I'm thinking of  
including
fields such as bought or like per member/visitor via dynamic  
fields to a
product search schema. Another option is to have a multi-value field  
that
can contain user IDs. What are the possible performance issues with  
this

setup?


Mitch is right, what you're looking for here is a recommendation  
engine, if I understand your question properly.


And yes, Mahout should work though the Taste recommendation engine it  
supports is pretty new. But Sean Owen  Robin Anil have a Mahout in  
Action book that's in early release via Manning, and it has lots of  
good information about Mahout  recommender systems.


Assuming you have a list of recommendations for a given user, based on  
their past behavior and the recommendation engine, then you could use  
this to adjust search results. I'm waiting for Hoss to jump in here on  
how best to handle that :)


-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Machine utilization while indexing

2010-05-20 Thread Thijs

Hi.

I have a question about how I can get solr to index quicker then it does 
at the moment.


I have to index (and re-index) some 3-5 million documents. These 
documents are preprocessed by a java application that effectively 
combines multiple database tables with each-other to form the 
SolrInputDocument.


What I'm seeing however is that the queue of documents that are ready to 
be send to the solr server exceeds my preset limit. Telling me that Solr 
somehow can't process the documents fast enough.


(I have created my own queue in front of Solrj.StreamingUpdateSolrServer 
as it would not process the documents fast enough causing 
OutOfMemoryExceptions due to the large amount of documents building up 
in it's queue)


I have an index that for 95% consist of ID's (Long). We don't do any 
analysis on the fields that are being indexed. The schema is rather 
straight forward.


most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true 
required=true /
field name=listId type=long stored=false indexed=true 
multiValued=true/


the relevant solrconfig.xml
indexDefaults
useCompoundFilefalse/useCompoundFile
mergeFactor100/mergeFactor
RAMBufferSizeMB256/RAMBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10% 
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the 
others)

I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but 
this slows to 18-19ms per insert.


Are there any tips/tricks I can use to speed up my indexing? Because I 
have a feeling that my machine is capable of doing more (use more 
cpu's). I just can't figure-out how.


Thijs


seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Hey everyone,

I've recently been given a requirement that is giving me some trouble. I need 
to retrieve up to 100 documents, but I can't see a way to do it without making 
100 different queries.

My schema has a multi-valued field like 'listOfIds'. Each document has between 
0 and N of these ids associated to them.

My input is up to 100 of these ids at random, and I need to retrieve the most 
recent document for each id (N Ids as input, N docs returned). I'm currently 
planning on doing a single query for each id, requesting 1 row, and caching the 
result. This could work OK since some of these ids should repeat quite often. 
Of course I would prefer to find a way to do this in Solr, but I'm not sure 
it's capable.

Any ideas?

Thanks,
-Kallin Nagelberg


RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does 
at the moment.

I have to index (and re-index) some 3-5 million documents. These 
documents are preprocessed by a java application that effectively 
combines multiple database tables with each-other to form the 
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to 
be send to the solr server exceeds my preset limit. Telling me that Solr 
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer 
as it would not process the documents fast enough causing 
OutOfMemoryExceptions due to the large amount of documents building up 
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any 
analysis on the fields that are being indexed. The schema is rather 
straight forward.

most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true 
required=true /
field name=listId type=long stored=false indexed=true 
multiValued=true/

the relevant solrconfig.xml
indexDefaults
 useCompoundFilefalse/useCompoundFile
 mergeFactor100/mergeFactor
 RAMBufferSizeMB256/RAMBufferSizeMB
 maxMergeDocs2147483647/maxMergeDocs
 maxFieldLength1/maxFieldLength
 writeLockTimeout1000/writeLockTimeout
 commitLockTimeout1/commitLockTimeout
 lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10% 
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the 
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but 
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I 
have a feeling that my machine is capable of doing more (use more 
cpu's). I just can't figure-out how.

Thijs


Solr Delta Queries

2010-05-20 Thread Vladimir Sutskever
I have a indexed_timestamp field  in my index - which lets me know when 
document was indexed:

field name=indexed_timestamp type=date indexed=true stored=true 
default=NOW multiValued=false/


For some reason when doing delta indexing via DIH, this field is not being 
updated.

Are timestamp fields updated during DELTA updates?


Kind regards,

Vladimir Sutskever
Investment Bank - Technology
JPMorgan Chase, Inc.



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  

RE: Machine utilization while indexing

2010-05-20 Thread Dennis Gearon
It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing? Because I 
 have a feeling that my machine is capable of doing more
 (use more 
 cpu's). I just can't figure-out how.
 
 Thijs



Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Tim Gilbert
Hi guys/gals,

 

I am using apache-solr-1.4.0.war deployed to glassfishv3 on my development 
machine which is Ubuntu 9.10 64-bit.  I am using Solrj 1.4 using the 
CommonsHttpSolrServer connection to that Solr instance 
(http://localhost:8080/apache-solr-1.4.0) during my development.  To simplify 
things however, I have found that I can duplicate my issue directly from Solr 
example admin page so for ease of confirmation, I will use the Solr Example 
Admin page for this example:

 

I deployed the apache-solr-1.4.0/dist/apache-solr-1.4.0.war file to my 
glassfishv3 application server.  It deploys successfully.  I access 
http://localhost:8080/apache-solr-1.4.0/admin/form.jsp and enter into 
Solr/Lucene Statement textarea this word:

 

numéro  (Note the é)

 

When I check the server.log file, I see this:

 

INFO: [] webapp=/apache-solr-1.4.0 path=/select 
params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=}
 hits=0 status=0 QTime=16 

 

As well, the output from the Admin system is with the same incorrect decoding.

 

 

 

In my SolrJ using application, I have a test case which queries for numéro 
and succeeds if I use Embedded and fails if I use CommonsHttpSolrServer... I 
don't want to use embedded for a number of reasons including that its not 
recommended (http://wiki.apache.org/solr/EmbeddedSolr)

 

I am sorry if you'd dealt with this issue in the past, I've spent a few hours 
googling for solr utf-8 query and glassfishv3 utf-8 uri  plus other 
permutations/combinations but there were seemingly endless amounts of chaff 
that I couldn't find anything useful after scouring it for a few hours.  I 
can't decide whether it's a glassfish issue or not so I am not sure where to 
direct my energy.  Any tips or advice are appreciated! 

 

Thanks in advance,

 

Tim Gilbert



Re: Machine utilization while indexing

2010-05-20 Thread Thijs
I already have a blockingqueue in place (that's my custom queue) and 
luckily I'm indexing faster then what your doing.Currently it takes 
about 2hour to index the 5m documents I'm talking about. But I still 
feel as if my machine is under utilized.


Thijs


On 20-5-2010 17:16, Nagelberg, Kallin wrote:

How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com]
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does
at the moment.

I have to index (and re-index) some 3-5 million documents. These
documents are preprocessed by a java application that effectively
combines multiple database tables with each-other to form the
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to
be send to the solr server exceeds my preset limit. Telling me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any
analysis on the fields that are being indexed. The schema is rather
straight forward.

most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true
required=true /
field name=listId type=long stored=false indexed=true
multiValued=true/

the relevant solrconfig.xml
indexDefaults
  useCompoundFilefalse/useCompoundFile
  mergeFactor100/mergeFactor
  RAMBufferSizeMB256/RAMBufferSizeMB
  maxMergeDocs2147483647/maxMergeDocs
  maxFieldLength1/maxFieldLength
  writeLockTimeout1000/writeLockTimeout
  commitLockTimeout1/commitLockTimeout
  lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I
have a feeling that my machine is capable of doing more (use more
cpu's). I just can't figure-out how.

Thijs




Re: Machine utilization while indexing

2010-05-20 Thread Thijs
Why would I need faster hardware if my current hardware isn't reaching 
it's max capacity?


I'm already using a different machine for querying and indexing so while 
indexing the queries aren't affected. Pulling an optimized snapshot 
isn't even noticeable on the query-machines.


Thijs


On 20-5-2010 17:25, Dennis Gearon wrote:

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions.

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
   otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallinknagelb...@globeandmail.com  wrote:


From: Nagelberg, Kallinknagelb...@globeandmail.com
Subject: RE: Machine utilization while indexing
To: 'solr-user@lucene.apache.org'solr-user@lucene.apache.org
Date: Thursday, May 20, 2010, 8:16 AM
How about throwing a blockingqueue,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
between your document-creator and solrserver? Give it a size
of 10,000 or something, with one thread trying to feed it,
and one thread waiting for it to get near full then draining
it. Take the drained results and add them to the server
(maybe try not using streamingsolrserver). Something like
that worked well for me with about 5,000,000 documents each
~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com]

Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker
then it does
at the moment.

I have to index (and re-index) some 3-5 million documents.
These
documents are preprocessed by a java application that
effectively
combines multiple database tables with each-other to form
the
SolrInputDocument.

What I'm seeing however is that the queue of documents that
are ready to
be send to the solr server exceeds my preset limit. Telling
me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of
Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents
building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We
don't do any
analysis on the fields that are being indexed. The schema
is rather
straight forward.

most fields look like
fieldType name=long class=solr.LongField
omitNorms=true/
field name=objectId type=long stored=true
indexed=true
required=true /
field name=listId type=long stored=false
indexed=true
multiValued=true/

the relevant solrconfig.xml
indexDefaults

useCompoundFilefalse/useCompoundFile

mergeFactor100/mergeFactor

RAMBufferSizeMB256/RAMBufferSizeMB

maxMergeDocs2147483647/maxMergeDocs

maxFieldLength1/maxFieldLength

writeLockTimeout1000/writeLockTimeout

commitLockTimeout1/commitLockTimeout

lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @
2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr
version 1.4

What I'm seeing is that the network almost never reaches
more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is
used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
2730.68 MB

And that in the beginning (with a empty index) I get 2ms
per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my
indexing? Because I
have a feeling that my machine is capable of doing more
(use more
cpu's). I just can't figure-out how.

Thijs





RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs 
ram, and a doc size of 5-10k maybe substantially larger than what you have. 
They could be substantially smaller too. As another point of reference my index 
ends up being about 20Gigs with the 5 million docs. 

I should also point out I only need to do this once.. I'm not constantly 
reindexing everything. My indexed documents rarely change, and when they do we 
have a process that selectively updates those few that need it. Combine that 
with a constant trickle of new documents and indexing performance isn't much of 
a concern.

You should be able to experiment with a small subset of your documents to 
speedily test new schemas, etc. In my case I selected a representative sample 
and store them in my project for unit testing.

-Kallin Nagelberg


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Machine utilization while indexing

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing? 

RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
You're sure it's not blocking on indexing IO? If not then I guess it must be a 
thread waiting unnecessarily in solr or your loading program. To get my loader 
running at full speed I hooked it up to jprofiler's thread views to see where 
the stalls were and optimized from there. 

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing

I already have a blockingqueue in place (that's my custom queue) and 
luckily I'm indexing faster then what your doing.Currently it takes 
about 2hour to index the 5m documents I'm talking about. But I still 
feel as if my machine is under utilized.

Thijs


On 20-5-2010 17:16, Nagelberg, Kallin wrote:
 How about throwing a blockingqueue, 
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
  between your document-creator and solrserver? Give it a size of 10,000 or 
 something, with one thread trying to feed it, and one thread waiting for it 
 to get near full then draining it. Take the drained results and add them to 
 the server (maybe try not using streamingsolrserver). Something like that 
 worked well for me with about 5,000,000 documents each ~5k taking about 8 
 hours.

 -Kallin Nagelberg

 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing

 Hi.

 I have a question about how I can get solr to index quicker then it does
 at the moment.

 I have to index (and re-index) some 3-5 million documents. These
 documents are preprocessed by a java application that effectively
 combines multiple database tables with each-other to form the
 SolrInputDocument.

 What I'm seeing however is that the queue of documents that are ready to
 be send to the solr server exceeds my preset limit. Telling me that Solr
 somehow can't process the documents fast enough.

 (I have created my own queue in front of Solrj.StreamingUpdateSolrServer
 as it would not process the documents fast enough causing
 OutOfMemoryExceptions due to the large amount of documents building up
 in it's queue)

 I have an index that for 95% consist of ID's (Long). We don't do any
 analysis on the fields that are being indexed. The schema is rather
 straight forward.

 most fields look like
 fieldType name=long class=solr.LongField omitNorms=true/
 field name=objectId type=long stored=true indexed=true
 required=true /
 field name=listId type=long stored=false indexed=true
 multiValued=true/

 the relevant solrconfig.xml
 indexDefaults
   useCompoundFilefalse/useCompoundFile
   mergeFactor100/mergeFactor
   RAMBufferSizeMB256/RAMBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   lockTypesingle/lockType
 /indexDefaults


 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

 What I'm seeing is that the network almost never reaches more then 10%
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is used, not the
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

 And that in the beginning (with a empty index) I get 2ms per insert but
 this slows to 18-19ms per insert.

 Are there any tips/tricks I can use to speed up my indexing? Because I
 have a feeling that my machine is capable of doing more (use more
 cpu's). I just can't figure-out how.

 Thijs



RE: Machine utilization while indexing

2010-05-20 Thread Dennis Gearon
Here is a good article from IBM, with code, on how to do hybrid/cloud computing.

http://www.ibm.com/developerworks/library/x-cloudpt1/


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing? Because I 
 have a feeling that my machine is capable of doing more
 (use more 
 cpu's). I just can't figure-out how.
 
 Thijs



Re: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Ahmet Arslan
In my SolrJ using application, I have a
test case which queries for “numéro” and
succeeds if I use Embedded and fails if I use CommonsHttpSolrServer… I
don’t want to use embedded for a number of reasons including that its not
recommended (http://wiki.apache.org/solr/EmbeddedSolr) 

   

I am sorry if you’d dealt with this issue in the past,
I’ve spent a few hours googling for solr
utf-8 query and glassfishv3 utf-8
uri  plus other permutations/combinations but there were
seemingly endless amounts of chaff that I couldn’t find anything useful after 
scouring it for a few hours.  I can’t
decide whether it’s a glassfish issue or not so I am not sure where to
direct my energy.  Any tips or advice are appreciated!  

  I have never used glassfish but I am pretty sure it is a glassfish issue. The 
same thing happens in Tomcat if you don't set URIEncoing=UTF-8.
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Confighttp://forums.java.net/jive/thread.jspa?threadID=38020http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding
 



 





  

Re: seemingly impossible query

2010-05-20 Thread darren
Ok. I think I understand. What's impossible about this?

If you have a single field name called id that is multivalued
then you can retrieved the documents with something like:

id:1 OR id:2 OR id:56 ... id:100

then add limit 100.

There's probably a more succinct way to do this, but I'll leave that to
the experts.

If you also only want the documents within a certain time, then you also
create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
or something similar to this. Check the query syntax wiki for specifics.

Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1 row,
 and caching the result. This could work OK since some of these ids should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg




RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Thanks Darren,

The problem with that is that it may not return one document per id, which is 
what I need.  IE, I could give 100 ids in that OR query and retrieve 100 
documents, all containing just 1 of the IDs. 

-Kallin Nagelberg

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] 
Sent: Thursday, May 20, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Ok. I think I understand. What's impossible about this?

If you have a single field name called id that is multivalued
then you can retrieved the documents with something like:

id:1 OR id:2 OR id:56 ... id:100

then add limit 100.

There's probably a more succinct way to do this, but I'll leave that to
the experts.

If you also only want the documents within a certain time, then you also
create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
or something similar to this. Check the query syntax wiki for specifics.

Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1 row,
 and caching the result. This could work OK since some of these ids should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg




RE: seemingly impossible query

2010-05-20 Thread darren
I see. Well, now you're asking Solr to ignore its prime directive of
returning hits that match a query. Hehe.

I'm not sure if Solr has a unique attribute.

But this sounds, to me, like you will have to filter the results yourself.
But at least you hit Solr only once before doing so.

Good luck!

 Thanks Darren,

 The problem with that is that it may not return one document per id, which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve
 100 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1
 row,
 and caching the result. This could work OK since some of these ids
 should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg






Re: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Abdelhamid ABID
I had had the same issue  within tomcat, further to what Ahmet wrote I
recommend to plug a filter in your solr context that forces responses and
requests to be encodded in UTF8

On Thu, May 20, 2010 at 5:11 PM, Ahmet Arslan iori...@yahoo.com wrote:

 In my SolrJ using application, I have a
 test case which queries for “numéro” and
 succeeds if I use Embedded and fails if I use CommonsHttpSolrServer… I
 don’t want to use embedded for a number of reasons including that its not
 recommended (http://wiki.apache.org/solr/EmbeddedSolr)



 I am sorry if you’d dealt with this issue in the past,
 I’ve spent a few hours googling for solr
 utf-8 query and glassfishv3 utf-8
 uri  plus other permutations/combinations but there were
 seemingly endless amounts of chaff that I couldn’t find anything useful
 after scouring it for a few hours.  I can’t
 decide whether it’s a glassfish issue or not so I am not sure where to
 direct my energy.  Any tips or advice are appreciated!

   I have never used glassfish but I am pretty sure it is a glassfish issue.
 The same thing happens in Tomcat if you don't set URIEncoing=UTF-8.

 http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Confighttp://forums.java.net/jive/thread.jspa?threadID=38020http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding















-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB


Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Would each Id need to return a different doc?

If not:
you could probably use FieldCollapsing:
http://wiki.apache.org/solr/FieldCollapsing
http://wiki.apache.org/solr/FieldCollapsingi.e: - collapse on listOfIds
(see wiki entry for syntax)
 -  constrain the field to only return the id's you want e.g:
q= listOfIds:10 OR q= listOfIds:5,...,OR q= listOfIds:56

Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Thanks Darren,

 The problem with that is that it may not return one document per id, which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve 100
 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1 row,
  and caching the result. This could work OK since some of these ids should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 




RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Yeah I need something like:
(id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

I'm not sure how I can hit solr once. If I do try and do them all in one big OR 
query then I'm probably not going to get a hit for each ID. I would need to 
request probably 1000 documents to find all 100 and even then there's no 
guarantee and no way of knowing how deep to go.

-Kallin Nagelberg

-Original Message-
From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] 
Sent: Thursday, May 20, 2010 12:27 PM
To: solr-user@lucene.apache.org
Subject: RE: seemingly impossible query

I see. Well, now you're asking Solr to ignore its prime directive of
returning hits that match a query. Hehe.

I'm not sure if Solr has a unique attribute.

But this sounds, to me, like you will have to filter the results yourself.
But at least you hit Solr only once before doing so.

Good luck!

 Thanks Darren,

 The problem with that is that it may not return one document per id, which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve
 100 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble. I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1
 row,
 and caching the result. This could work OK since some of these ids
 should
 repeat quite often. Of course I would prefer to find a way to do this in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg






RE: seemingly impossible query

2010-05-20 Thread darren
The problem here, I think, is that you only want 1 of many _results_ for a
particular ID. How would Solr know which result you want to keep? And
which to throw away?

However...

You can do this in two queries if you want. Have a separate solr document
with unique ID equal to the listOfIds value as they are indexed (one for
each unique id then).

On that _id document_ store a field pointing to the ID of the real
document you want as they are indexed.

Each time the _id document_ is rewritten with a document id, it
overwrites any prior data for that unique _id document_.

Now, you first query the _id document_ using the 100 id's you receive.
Each has a reference to a _single_ real document. Then you retrieve the
document field of each of those to write a single query to get all the
last indexed real documents for those id's.

It would work.

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

 Thanks Darren,

 The problem with that is that it may not return one document per id,
 which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve
 100 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


 Hey everyone,

 I've recently been given a requirement that is giving me some trouble.
 I
 need to retrieve up to 100 documents, but I can't see a way to do it
 without making 100 different queries.

 My schema has a multi-valued field like 'listOfIds'. Each document has
 between 0 and N of these ids associated to them.

 My input is up to 100 of these ids at random, and I need to retrieve
 the
 most recent document for each id (N Ids as input, N docs returned). I'm
 currently planning on doing a single query for each id, requesting 1
 row,
 and caching the result. This could work OK since some of these ids
 should
 repeat quite often. Of course I would prefer to find a way to do this
 in
 Solr, but I'm not sure it's capable.

 Any ideas?

 Thanks,
 -Kallin Nagelberg








Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




Solr highlighter and custom queries?

2010-05-20 Thread Daniel Shane
Hi all!

I'm trying to do some simple highlighting, but I cannot seem to figure out how 
to make it work.

I'm using my own QueryParser which generates custom made queries. I would like 
Solr to be able to highlight them.

I've tried many options in the highlighter but cannot get any snippets to show.

However, if I change the QueryParser to the default solr parser it works.

There is certainly a place in the config or in the query parser where I can 
specify how Solr can highlight my custom queries?

I checked a bit in the source code, and in WeightedSpanTermExtractor class, in 
the method extract(Query query, Map terms), there is a huge list of 
instanceof's that check which type of query we are attempting to match.

Is that the only place where the conversion between query - highlighting 
happens? If so, its looks pretty hard coded and would not work with any other 
queries than the ones included in Lucene.

I guess there must be a good reason for this, but is there any other way of 
making the highlighter work without having to hard code all the possible 
queries in a big if / instanceofs?

If we could somehow reuse the code contained in each query to find possible 
matches, it would avoid having to recode the same logic elsewhere.

But as I said, there must be a good reason for doing it the way its already 
coded.

Any ideas on how to work this out with the existing code base would be greatly 
appreciated :)

Daniel Shane



RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Thanks, I'm going to take a look at fieldcollapsingquery as it seems like it 
should do the trick!

-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




Debugging - DIH Delta Queries-

2010-05-20 Thread Vladimir Sutskever
Hi All,

How can I see all of the queries sent to my DB during a Delta Import?

It seems like my documents are not being updated via delta import
When I use SOLR's DataIMport Handler Console - with delta-import selected I see

lst name=entity:getall
lst name=document#1/
/lst
−
lst name=entity:getall
lst name=document#1/
/lst
−
lst name=entity:getall
lst name=document#1/
/lst


But that’s not very helpful - I want to see the exact queries


Thank You


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  

Re: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Chris Hostetter

: I am using apache-solr-1.4.0.war deployed to glassfishv3 on my 
...
: INFO: [] webapp=/apache-solr-1.4.0 path=/select 
: 
params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=}
 
: hits=0 status=0 QTime=16
...
: In my SolrJ using application, I have a test case which queries for 
: numéro and succeeds if I use Embedded and fails if I use 
: CommonsHttpSolrServer... I don't want to use embedded for a number of 
...
: I am sorry if you'd dealt with this issue in the past, I've spent a few 
: hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other 
: permutations/combinations but there were seemingly endless amounts of 
: chaff that I couldn't find anything useful after scouring it for a few 
: hours.  I can't decide whether it's a glassfish issue or not so I am not 
: sure where to direct my energy.  Any tips or advice are appreciated!

I suspect if you switched to using POST instead of GET your problem would 
go away -- this stems from amiguity in the way HTTP servers/browsers deal 
with encoding UTF8 in URLs.  a quick search for glassfish url encoding 
turns up this thread...

  http://forums.java.net/jive/thread.jspa?threadID=38020

which refreneces...

http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding

...it looks like you want to modify the default-charset attribute of the 
parameter-encoding


-Hoss


Re: Machine utilization while indexing

2010-05-20 Thread Chris Hostetter

I'm really only guessing here, but based on your description of what you 
are doing it sounds like you only have one thread streaming documents to 
solr (via a single StreamingUpdateSolrServer instance which creates a 
single HTTP connection)

Have you at all attempted to have parallel threads in your client initiate 
parallel connections to Solr via multiple instances of 
StreamingUpdateSolrServer objects?)


-Hoss



RE: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Tim Gilbert
Chris,

You are the best.  Switching to POST solved the problem.  I hadn't noticed that 
option earlier but after finding: 
https://issues.apache.org/jira/browse/SOLR-612 I found the option in the code.

Thank you, you just made my day.

Secondly, in an effort to narrow down whether this was a glassfish issue or 
not, here is what I found.

Starting with glassfishv3 (I think) UTF-8 is the default for URI.  You can see 
this by going to the admin site, clicking on Network Config | Network Listeners 
| then select the listener.  Select the tab HTTP and about half way down, you 
will see URI Encoding: UTF-8.

HOWEVER, that doesn't appear to be correct because following Abdelhamid Abid's 
advice, I deployed Solr to Tomcat, then followed the direction here:
http://wiki.apache.org/solr/SolrTomcat to force tomcat to UTF-8 for URI.  Then 
I deployed Solr to tomcat, and using CommonsHttpSolrServer, connected to that 
tomcat served instance.  It worked- first time.

So, it appears that there is a problem with glassfishv3 and UTF-8 URI's for at 
least the apache-solr-1.4.0.war.  I wonder if I added that sun-web.xml file 
into the war to force UTF-8 it might work... not sure.  However, the workaround 
is to change the method to POST as Chris suggested.  You can do that in Solrj 
here:

server.query(solrQuery, METHOD.POST);

and it works as you'd expect.

Thanks for the advice/tips,

Tim

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, May 20, 2010 2:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Non-English query via Solr Example Admin corrupts text


: I am using apache-solr-1.4.0.war deployed to glassfishv3 on my 
...
: INFO: [] webapp=/apache-solr-1.4.0 path=/select 
: 
params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=}
 
: hits=0 status=0 QTime=16
...
: In my SolrJ using application, I have a test case which queries for 
: numéro and succeeds if I use Embedded and fails if I use 
: CommonsHttpSolrServer... I don't want to use embedded for a number of 
...
: I am sorry if you'd dealt with this issue in the past, I've spent a few 
: hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other 
: permutations/combinations but there were seemingly endless amounts of 
: chaff that I couldn't find anything useful after scouring it for a few 
: hours.  I can't decide whether it's a glassfish issue or not so I am not 
: sure where to direct my energy.  Any tips or advice are appreciated!

I suspect if you switched to using POST instead of GET your problem would 
go away -- this stems from amiguity in the way HTTP servers/browsers deal 
with encoding UTF8 in URLs.  a quick search for glassfish url encoding 
turns up this thread...

  http://forums.java.net/jive/thread.jspa?threadID=38020

which refreneces...

http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding

...it looks like you want to modify the default-charset attribute of the 
parameter-encoding


-Hoss


RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
StreamingUpdateSolrServer already has multiple threads and uses multiple 
connections under the covers. At least the api says ' Uses an internal 
MultiThreadedHttpConnectionManager to manage http connections'. The constructor 
allows you to specify the number of threads used, 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String,
 int, int) . 

-Kallin Nagelberg

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, May 20, 2010 3:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing


I'm really only guessing here, but based on your description of what you 
are doing it sounds like you only have one thread streaming documents to 
solr (via a single StreamingUpdateSolrServer instance which creates a 
single HTTP connection)

Have you at all attempted to have parallel threads in your client initiate 
parallel connections to Solr via multiple instances of 
StreamingUpdateSolrServer objects?)


-Hoss



RE: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Chris Hostetter

: Starting with glassfishv3 (I think) UTF-8 is the default for URI.  You 
: can see this by going to the admin site, clicking on Network Config | 
: Network Listeners | then select the listener.  Select the tab HTTP and 
: about half way down, you will see URI Encoding: UTF-8.
: 
: HOWEVER, that doesn't appear to be correct because following Abdelhamid 
...

I know nothing about glassfish, but according to that forum URL i 
mentioned before, the URI Encoding option in glassfish explicitly (and 
evidently  
contenciously) does not apply to hte query args -- only the path, hence 
the two different config options mentioned in the FAQ...


:   http://forums.java.net/jive/thread.jspa?threadID=38020
...
: http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding



-Hoss



Re: Subclassing DIH

2010-05-20 Thread Chris Hostetter

: I am trying to subclass DIH to add I am having a hard time trying to get
: access to the current Solr Context. How is this possible? 

I don't think DIH was particularly designed to be subclassed (i'm suprised 
it's not final) ... it was built with the assumption that people would 
write plugins (transformers, datasources, etc...)

If you elaborate a little bit more on what you hope to achieve by 
subclassing, people cna provide more insight into the best way to go about 
it...

http://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss



RE: Machine utilization while indexing

2010-05-20 Thread Chris Hostetter

: StreamingUpdateSolrServer already has multiple threads and uses multiple 
: connections under the covers. At least the api says ' Uses an internal 

Hmmm... i think one of us missunderstands the point behind 
StreamingUpdateSolrServer and it's internal threads/queues.  (it's very 
possible that it's me)

my understanding is that this allows it to manage the batching of multiple 
operations for you, reusing connections as it goes -- so the the 
queueSize is how many individual requests it buffers before sending the 
batch to Solr, and the threadCount controls how many batches it can send 
in parallel (in the event that one thread is still waiting for the 
response when the queue next fills up)

But if you are only using a single thread to feed SolrRequests to a single 
instance of StreamingUpdateSolrServer then there can still be lots of 
opportunities for Solr itself to be idle -- as i said, it's not clear to 
me if you are using multiple threads to write to your 
StreamingUpdateSolrServer ... even if if you reuse the same 
StreamingUpdateSolrServer instance, multiple threads in your client code 
may increse the throughput (assuming that at the moment the threads in 
StreamingUpdateSolrServer are largely idle)

But as i said ... this is all mostly a guess.  I'm not intimatiely 
familiar with solrj.


-Hoss



RE: seemingly impossible query

2010-05-20 Thread Nagelberg, Kallin
Yeah this looks perfect. Too bad it's not in 1.4, I guess I can build from 
trunk and patch it. This is probably a stupid question but is there any feeling 
as to when 1.5 might come out? 

Thanks,
-Kallin Nagelberg

-Original Message-
From: Geert-Jan Brits [mailto:gbr...@gmail.com] 
Sent: Thursday, May 20, 2010 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: seemingly impossible query

Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




Re: Subclassing DIH

2010-05-20 Thread Blargy

Ok to further explain myself.

Well first off I was experience a StackOverFlow error during my
delta-imports after doing a full-import. The strange thing was, it only
happened sometimes. Thread is here:
http://lucene.472066.n3.nabble.com/StackOverflowError-during-Delta-Import-td811053.html#a824780

I never did find a good solution to that bug however I did come up with a
workaround. I noticed if I removed my deletedPkQuery then the delta-import
would work as expected. Obviously I still have the need to delete items out
of the index during indexing so I wanted to subclass the DataImportHandler
to first update all documents then I would delete all the documents that my
deletedPkQuery would have deleted.

I can actually accomplish the above behavior using the onImportEnd
EventListener however I lose the ability to know how many documents were
actually deleted since my manual deletion of documents doesnt get pick up in
the data importer cumulativeStatistics. 

My hope was that I could subclass DIH and massage the cumulativeStatistics
after my manual deletion of documents.

FYI my manual deletion is accomplished by sending a deleteById query to an
instance of CommonsHttpSolrServer that I create from the current context of
the EventListener. Side question: How can I retrieve the # of items actually
removed from the index after a deletedById query???

Thoughts on the process? There just has to be an easier way.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Subclassing-DIH-tp830954p832684.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Tim Gilbert
I wanted to improve the documentation in the solr wiki by adding in my
findings.  However, when I try to log in and create a new account, I
receive this error message:

You are not allowed to do newaccount on this page. Login and try again.

Does anyone know how I can get permission to add a page to the
documentation?

Tim


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, May 20, 2010 3:21 PM
To: solr-user@lucene.apache.org
Subject: RE: Non-English query via Solr Example Admin corrupts text


: Starting with glassfishv3 (I think) UTF-8 is the default for URI.  You

: can see this by going to the admin site, clicking on Network Config | 
: Network Listeners | then select the listener.  Select the tab HTTP
and 
: about half way down, you will see URI Encoding: UTF-8.
: 
: HOWEVER, that doesn't appear to be correct because following
Abdelhamid 
...

I know nothing about glassfish, but according to that forum URL i 
mentioned before, the URI Encoding option in glassfish explicitly (and
evidently  
contenciously) does not apply to hte query args -- only the path, hence 
the two different config options mentioned in the FAQ...


:   http://forums.java.net/jive/thread.jspa?threadID=38020
...
:
http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEnco
ding



-Hoss



Re: Solr highlighter and custom queries?

2010-05-20 Thread Daniel Shane
Actually, its not as much a Solr problem as a Lucene one, as it turns out, the 
WeightedSpanTermExtractor is in Lucene and not Solr.

Why they decided to only highlight queries that are in Lucene I don't know, but 
what I did to solve this problem was simply to make my queries extends a Lucene 
query instead of just Query. 

So I decided to extend a BooleanQuery, which is the closest fit to what mine 
actually does.

This make the highlighting do something even though its not perfect.

Daniel Shane


Endeca vs Solr?

2010-05-20 Thread kkieser

First of all, I'd like to apologize in advance for being a pretty raw newbie
when it comes to search technologies, so please bear with me!

The situation:
My company has a system that moderates 15 character free form text fields.
We have a dictionary of words in our database that are banned due to various
legal reasons (profanity, copywrite issues, etc). Our system does an intial
check when the user is entering their choices for these fields and
auto-rejects anything that matches or comes close to matching (based on
phonetics, purposefully misspelled, etc) anything on our banned list. Once
the order is placed, the system checks again to see if the fields are exact
matches to anything on our auto-approve list (held in same database as
previous list) and passes those on through. Items that do not match either
list are moved to a review queue where a customer service rep manually
reviews the items. During the review the CSR can add a word to either list,
which will prevent future orders using the newly added value from needing to
be reviewed. 

My question:
Currently our system simply holds all the words in a hash map in memory, but
we're worried about scalability. I've been asked to try and find out more
about Solr and how it compares to Endeca, which another of our department
uses but I'm not very familiar with. I've been reading the wiki and other
articles I've found online, but it seems like there's a lot of overlap of
features between Solr and Endeca, the main difference just seems to be cost.
Endeca also seems to have better support of real time searches, and has a
stricter sorting algorithm. On the other hand, it sounds like Solr
re-indexes quickly enough that its quick enough for my purposes, and its
sorting algorithm can be tweaked to match what I need. Are there any other
technical differences between the two if used in the scenario I described
above? Also, are there any important hardware footprint differences? I'm no
admin, but I believe our system runs on Jboss on a Solaris box last I
checked.

Any help or insight you guys can provide would help greatly. Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p832826.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Shard - Strange results

2010-05-20 Thread TonyBray

I know this post is old but did you ever get a resolution to this problem?  I
am running into the exact same issue.  I even switched my id from text to
string and reindexed as that was the last suggestion and still no
resolution.

--Tony
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Shard-Strange-results-tp496373p832844.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Shard - Strange results

2010-05-20 Thread TonyBray

So are we the only ones who never got sharding working with multi-cores? 
Bummer...  Hopefully someone else will chime in with an answer.

--Tony
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Shard-Strange-results-tp496373p832863.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Endeca vs Solr?

2010-05-20 Thread David Smiley (@MITRE.org)

Hello kkieser.  
I've used both and my name may of come up in your searches.  For your
system, I would definitely not use Endeca as its too complicated for the
relatively simple needs that you have.  You asked if there are technical
differences and of course being two different systems, the answer is yes --
but both can fit your needs.

I'm not quite convinced that either would be worthwhile for what you
describe over something more home-grown with a database.  I could see you
re-using Lucene's analysis package to tokenize and the process each token
against and match against a hashtable.  By the way, Solr is going to use a
Hashtable as well on either index or query time to handle synonyms.  Your
scenario does not suggest that this list would be so large to be concerning. 
Of course if you want other features in Solr like highlighting and faceting
and the other goodies, then its clearly worthwhile.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p832972.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Non-English query via Solr Example Admin corrupts text

2010-05-20 Thread Dennis Gearon
rant_by_HTTP_Verb_Nazi

Using POST totally violates the access model for an entity in the HTTP Verb 
model.

Basically:

GET=READ
POST=CREATE
PUT=MODIFY
DELETE=(drum roll please)DELETE

Granted, the whole web uses POST for modify, but let's not make the situation 
worse by using it for everything.

/rant_by_HTTP_Verb_Nazi

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Chris Hostetter hossman_luc...@fucit.org wrote:

 From: Chris Hostetter hossman_luc...@fucit.org
 Subject: Re: Non-English query via Solr Example Admin corrupts text
 To: solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 11:40 AM
 
 : I am using apache-solr-1.4.0.war deployed to glassfishv3
 on my 
     ...
 : INFO: [] webapp=/apache-solr-1.4.0 path=/select 
 :
 params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=}
 
 : hits=0 status=0 QTime=16
     ...
 : In my SolrJ using application, I have a test case which
 queries for 
 : numéro and succeeds if I use Embedded and fails if I
 use 
 : CommonsHttpSolrServer... I don't want to use embedded for
 a number of 
     ...
 : I am sorry if you'd dealt with this issue in the past,
 I've spent a few 
 : hours googling for solr utf-8 query and glassfishv3 utf-8
 uri plus other 
 : permutations/combinations but there were seemingly
 endless amounts of 
 : chaff that I couldn't find anything useful after scouring
 it for a few 
 : hours.  I can't decide whether it's a glassfish
 issue or not so I am not 
 : sure where to direct my energy.  Any tips or advice
 are appreciated!
 
 I suspect if you switched to using POST instead of GET your
 problem would 
 go away -- this stems from amiguity in the way HTTP
 servers/browsers deal 
 with encoding UTF8 in URLs.  a quick search for
 glassfish url encoding 
 turns up this thread...
 
   http://forums.java.net/jive/thread.jspa?threadID=38020
 
 which refreneces...
 
 http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding
 
 ...it looks like you want to modify the default-charset
 attribute of the 
 parameter-encoding
 
 
 -Hoss



Re: Endeca vs Solr?

2010-05-20 Thread kkieser

Thanks for your response David! At the moment we have over 40,000 words on
our banned list, and only recently added the white list, so we anticipate
this number to jump quite quickly. I've heard Solr can handle up to around 2
million records before slowing down so I'm not too worried about hitting
that limit. Our database implementation has already started slowing down and
is causing complaints from the CSRs. This system is used on a public facing
website that gets quite a lot of traffic, which is why we're looking into
swapping from having the full database hashmap in memory to something with
an efficient index that can handle both the high traffic of users creating
designs as well as the CSRs reviewing the ones that arent auto approved or
auto rejected. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p833016.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Endeca vs Solr?

2010-05-20 Thread David Smiley (@MITRE.org)

kkieser,

It just occurred to me that Solr might actually fit the bill.  Your scenario
is definitely not present a use of Solr that is typical at all, but a novel
use of Solr I am about to describe could totally get what you want.

A Solr index is composed of documents which are typically similar to a
user document or database record or something like that.  But in your case,
the document would be one word that's either one of your good word or bad
words.  You could have a boolean indicating which type, and you could index
it several ways including phonetically.  When you want to compare a document
to see if it matches any words, you use Solr's More-Like-This feature,
configured appropriately, to tell you what matching documents (e.g. naughty
words) get matched.  You could even facet on the naughty boolean to know how
many of each.  What I described is definitely not a task for Endeca.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p833019.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: jmx issue with solr

2010-05-20 Thread Lance Norskog
http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX

Ask the wiki!

On Wed, May 19, 2010 at 6:19 AM, Na_D nabam...@zaloni.com wrote:

 Thanks for the info , using the above properties solved the issue .
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/jmx-issue-with-solr-tp828478p829057.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


How real-time are Soir/Lucene queries?

2010-05-20 Thread Thomas J. Buhr
Hello Soir,

Soir looks like an excellent API and its nice to have a tutorial that makes it 
easy to discover the basics of what Soir does, I'm impressed. I can see plenty 
of potential uses of Soir/Lucene and I'm interested now in just how real-time 
the queries made to an index can be?

For example, in my application I have time ordered data being processed by a 
paint method in real-time. Each piece of data is identified and its associated 
renderer is invoked. The Java2D renderer would then lookup any layout and style 
values it requires to render the current data it has received from the layout 
and style indexes. What I'm wondering is if this lookup which would be a Lucene 
search will be fast enough?

Would it be best to make Lucene queries for the relevant layout and style 
values required by the renderers ahead of rendering time and have the query 
results placed into the most performant collection (map/array) so renderer 
lookup would be as fast as possible? Or can Lucene handle many individual 
lookup queries fast enough so rendering is quick?

Best regards from Canada,

Thom




Special Circumstances for embedded Solr

2010-05-20 Thread Ken Krugler

Hi all,

We'd started using embedded Solr back in 2007, via a patched version  
of the in-progress 1.3 code base.


I recently was reading http://wiki.apache.org/solr/EmbeddedSolr, and  
wondered about the paragraph that said:
The simplest, safest, way to use Solr is via Solr's standard HTTP  
interfaces. Embedding Solr is less flexible, harder to support, not  
as well tested, and should be reserved for special circumstances.


Given the current state of SolrJ, and the expected roadmap for Solr in  
general, what would be some guidelines for special circumstances  
that warrant the use of SolrJ?


I know what ours were back in 2007 - namely:

- we had multiple indexes, but didn't want to run multiple webapps  
(now handled by multi-core)
- we needed efficient generation of updated indexes, without  
generating lots of HTTP traffic (now handled by DIH, maybe with  
specific extensions?)
- we wanted tighter coupling of the front-end API with the back-end  
Solr search system, since this was an integrated system in the hands  
of customers - no just restart the webapp container option if  
anything got wedged (might still be an issue?)


Any other commonly compelling reasons to use SolrJ?

Thanks,

-- Ken



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: How real-time are Soir/Lucene queries?

2010-05-20 Thread Walter Underwood
Solr is a very good engine, but it is not real-time. You can turn off the 
caches and reduce the delays, but it is fundamentally not real-time.

I work at MarkLogic, and we have a real-time transactional search engine (and 
respository). If you are curious, contact me directly.

I do like Solr for lots of applications -- I chose it when I was at Netflix.

wunder

On May 20, 2010, at 7:22 PM, Thomas J. Buhr wrote:

 Hello Soir,
 
 Soir looks like an excellent API and its nice to have a tutorial that makes 
 it easy to discover the basics of what Soir does, I'm impressed. I can see 
 plenty of potential uses of Soir/Lucene and I'm interested now in just how 
 real-time the queries made to an index can be?
 
 For example, in my application I have time ordered data being processed by a 
 paint method in real-time. Each piece of data is identified and its 
 associated renderer is invoked. The Java2D renderer would then lookup any 
 layout and style values it requires to render the current data it has 
 received from the layout and style indexes. What I'm wondering is if this 
 lookup which would be a Lucene search will be fast enough?
 
 Would it be best to make Lucene queries for the relevant layout and style 
 values required by the renderers ahead of rendering time and have the query 
 results placed into the most performant collection (map/array) so renderer 
 lookup would be as fast as possible? Or can Lucene handle many individual 
 lookup queries fast enough so rendering is quick?
 
 Best regards from Canada,
 
 Thom