Optional filter queries

2012-01-03 Thread Allistair Crossley
Evening all,

A subset of my documents have a field, filterMinutes, that some other documents 
do not. filterMinutes stores a number.

I often issue a query that contains a filter query range, e.g.

q=filterMinutes:[* TO 50]

I am finding that adding this query excludes all documents that do not feature 
this field, but what I want is for the filter query to act upon those documents 
that do have the field but also to return documents that don't have it at all.

Is this a possibility?

Best,

Allistair

Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Hi,

I am seeing an issue I do not understand and hope that someone can shed some 
light on this. The issue is that for a particular search we are seeing a 
particular result rank in position 3 on one machine and position 8 on the 
production machine. The position 3 is our desired and roughly expected ranking.

I have a local machine with solr and a version deployed on a production server. 
My local machine's solr and the production version are both checked out from 
our project's SVN trunk. They are identical files except for the data files 
(not in SVN) and database connection settings.

The index is populated exclusively via data import handler queries to a 
database. 

I have exported the production database as-is to my local development machine 
so that my local machine and production have access to the self same data.

I execute a total full-import on both.

Still, I see a different position for this document that should surely rank in 
the same location, all else being equal.

I ran debugQuery diff to see how the scores were being computed. See appendix 
at foot of this email.

As far as I can tell every single query normalisation block of the debug is 
marginally different, e.g.

-0.021368012 = queryNorm (local)
+0.009944122 = queryNorm (production)

Which leads to a final score of -2 versus +1 which is enough to skew the 
results from correct to incorrect (in terms of what we expect to see).

- -2.286596 (local)
+1.0651637 = (production)

I cannot explain this difference. The database is the same. The configuration 
is the same. I have fully imported from scratch on both servers. What am I 
missing?

Thank you for your time

Allistair

- snip

APPENDIX - debugQuery=on DIFF 

--- untitled
+++ (clipboard)
@@ -1,51 +1,49 @@
-str name=L12411p
+str name=L12411
 
-2.286596 = (MATCH) sum of:
-  1.6891675 = (MATCH) sum of:
-1.3198489 = (MATCH) max plus 0.01 times others of:
-  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
-0.011795795 = queryWeight(text:dubai^0.1), product of:
-  0.1 = boost
+1.0651637 = (MATCH) sum of:
+  0.7871359 = (MATCH) sum of:
+0.6151879 = (MATCH) max plus 0.01 times others of:
+  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
+0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
-0.32609802 = queryWeight(profile:dubai^2.0), product of:
+  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
+0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-0.36931866 = (MATCH) max plus 0.01 times others of:
-  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
-0.003954251 = queryWeight(text:product^0.1), product of:
-  0.1 = boost
+0.17194802 = (MATCH) max plus 0.01 times others of:
+  0.00851347 = (MATCH) weight(text:product in 1551), product of:
+0.018402064 = queryWeight(text:product), product of:
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of:
   1.0 = tf(termFreq(text:product)=1)
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
-  0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of:
-0.1725098 = queryWeight(profile:product^2.0), product of:
+  0.17186289 = (MATCH) weight(profile:product^2.0 in 1551), product of:
+0.08028162 = queryWeight(profile:product^2.0), product of:
   2.0 = boost
   4.036637 = idf(docFreq=290, maxDocs=6063)
-  0.021368012 = queryNorm
+  0.009944122 = queryNorm
 2.14075 = (MATCH) fieldWeight(profile:product in 1551), product of:
   1.4142135 = tf(termFreq(profile:product)=2)
   4.036637 = idf(docFreq=290, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
-  0.59742856 = (MATCH) max plus 0.01 times others of:
-0.59742856 = weight(profile:dubai product~10^0.5 in 1551), product of:
-  0.12465195 = queryWeight(profile:dubai product~10^0.5), product of:
+  

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Thanks. Good to know, but even so my problem remains - the end score should not 
be different and is causing a dramatically different ranking of a document (3 
versus 7 is dramatic for my client). This must be down to the scoring debug 
differences - it's the only difference I can find :(

On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:

 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed some 
 light on this. The issue is that for a particular search we are seeing a 
 particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self same 
 data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
 -  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -0.32609802 = queryWeight(profile:dubai^2.0), product of:
 +  0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 +0.15175761 = queryWeight(profile:dubai^2.0), product of:
   2.0 = boost
   7.6305184 = idf(docFreq=7, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of:
   1.4142135 = tf(termFreq(profile:dubai)=2)
   7.6305184 = idf(docFreq=7, maxDocs=6063)
   0.375 = fieldNorm(field=profile, doc=1551)
 -0.36931866 = (MATCH) max plus 0.01 times others of:
 -  0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of:
 -0.003954251 = queryWeight(text:product^0.1), product of:
 -  0.1 = boost
 +0.17194802 = (MATCH) max plus 0.01 times others of:
 +  0.00851347 = (MATCH) weight(text:product in 1551), product of:
 +0.018402064 = queryWeight(text:product), product of:
   1.8505468 = idf(docFreq=2589, maxDocs=6063)
 -  0.021368012

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
That's what I think, glad I am not going mad.

I've spent 1/2 a day comparing the config files, checking out from SVN again 
and ensuring the databases are identical. I cannot see what else I can do to 
make them equivalent. Both servers checkout directly from SVN, I am convinced 
the files are the same. The database is definately the same. 

Not sure what you mean about having identical indices - that's my problem - I 
don't - or do you mean something else I've missed? But yes everything else you 
mention is identical, I am as certain as I can be. 

I too think there must be a difference I have missed but I have run out of 
ideas for what to check!

Frustrating :)

On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote:

 Yes, but the identical index with the identical solrconfig.xml and the 
 identical query and the identical version of Solr on two different machines 
 should preduce identical results.
 
 So it's a legitimate question why it's not.  But perhaps queryNorm isn't 
 enough to answer that. Sorry, it's out of my league to try and figure out it 
 out.
 
 But are you absolutely sure you have identical indexes, identical 
 solrconfig.xml, identical queries, and identical versions of Solr and any 
 other installed Java libraries... on both machines?  One of these being 
 different seems more likely than a bug in Solr, although that's possible.
 
 On 3/9/2011 4:34 PM, Jayendra Patil wrote:
 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk  wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are seeing 
 a particular result rank in position 3 on one machine and position 8 on the 
 production machine. The position 3 is our desired and roughly expected 
 ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both checked 
 out from our project's SVN trunk. They are identical files except for the 
 data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely rank 
 in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug is 
 marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq(text:dubai)=2)
   5.520305 = idf(docFreq=65, maxDocs=6063)
   0.25 = fieldNorm(field=text, doc=1551)
 -  1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of:
 -

Re: Same index is ranking differently on 2 machines

2011-03-09 Thread Allistair Crossley
Oh wow, how did I miss that?

My apologies to anyone who read this post. I should have diffed my custom 
dismax handler. Looks like my SVN merge didn't work properly.

Embarassing.

Thanks everyone ;)

On Mar 9, 2011, at 4:51 PM, Yonik Seeley wrote:

 On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil
 jayendra.patil@gmail.com wrote:
 Are you sure you have the same config ...
 The boost seems different for the field text - text:dubai^0.1  text:dubai
 
 Yep...
 Try adding echoParams=all and see all the parameters solr is acting on.
 http://wiki.apache.org/solr/CoreQueryParameters#echoParams
 
 -Yonik
 http://lucidimagination.com
 
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote:
 Thanks. Good to know, but even so my problem remains - the end score should 
 not be different and is causing a dramatically different ranking of a 
 document (3 versus 7 is dramatic for my client). This must be down to the 
 scoring debug differences - it's the only difference I can find :(
 
 On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote:
 
 queryNorm is just a normalizing factor and is the same value across
 all the results for a query, to just make the scores comparable.
 So even if it varies in different environment, you should not worried 
 about.
 
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 -
 Defination - queryNorm(q) is just a normalizing factor used to make
 scores between queries comparable. This factor does not affect
 document ranking (since all ranked documents are multiplied by the
 same factor), but rather just attempts to make scores from different
 queries (or even different indexes) comparable
 
 Regards,
 Jayendra
 
 On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk 
 wrote:
 Hi,
 
 I am seeing an issue I do not understand and hope that someone can shed 
 some light on this. The issue is that for a particular search we are 
 seeing a particular result rank in position 3 on one machine and position 
 8 on the production machine. The position 3 is our desired and roughly 
 expected ranking.
 
 I have a local machine with solr and a version deployed on a production 
 server. My local machine's solr and the production version are both 
 checked out from our project's SVN trunk. They are identical files except 
 for the data files (not in SVN) and database connection settings.
 
 The index is populated exclusively via data import handler queries to a 
 database.
 
 I have exported the production database as-is to my local development 
 machine so that my local machine and production have access to the self 
 same data.
 
 I execute a total full-import on both.
 
 Still, I see a different position for this document that should surely 
 rank in the same location, all else being equal.
 
 I ran debugQuery diff to see how the scores were being computed. See 
 appendix at foot of this email.
 
 As far as I can tell every single query normalisation block of the debug 
 is marginally different, e.g.
 
 -0.021368012 = queryNorm (local)
 +0.009944122 = queryNorm (production)
 
 Which leads to a final score of -2 versus +1 which is enough to skew the 
 results from correct to incorrect (in terms of what we expect to see).
 
 - -2.286596 (local)
 +1.0651637 = (production)
 
 I cannot explain this difference. The database is the same. The 
 configuration is the same. I have fully imported from scratch on both 
 servers. What am I missing?
 
 Thank you for your time
 
 Allistair
 
 - snip
 
 APPENDIX - debugQuery=on DIFF
 
 --- untitled
 +++ (clipboard)
 @@ -1,51 +1,49 @@
 -str name=L12411p
 +str name=L12411
 
 -2.286596 = (MATCH) sum of:
 -  1.6891675 = (MATCH) sum of:
 -1.3198489 = (MATCH) max plus 0.01 times others of:
 -  0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of:
 -0.011795795 = queryWeight(text:dubai^0.1), product of:
 -  0.1 = boost
 +1.0651637 = (MATCH) sum of:
 +  0.7871359 = (MATCH) sum of:
 +0.6151879 = (MATCH) max plus 0.01 times others of:
 +  0.10713901 = (MATCH) weight(text:dubai in 1551), product of:
 +0.05489459 = queryWeight(text:dubai), product of:
   5.520305 = idf(docFreq=65, maxDocs=6063)
 -  0.021368012 = queryNorm
 +  0.009944122 = queryNorm
 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of:
   1.4142135 = tf(termFreq

Re: [Adding] Entities when indexing a DB

2010-12-15 Thread Allistair Crossley
mission.id and event.id if the same value will be overwriting the indexed 
document. your ids need to be unique across all documents. i usually have a 
field id_original that i map the table id to, and then for id per entity i 
usually prefix it with the entity name in the value mapped to the schema id 
field

On 15 Dec 2010, at 20:49, Adam Estrada wrote:

 All,
 
 I have successfully indexed a single entity but when I try multiple entities
 is the second is skipped all together. Is there something wrong with my
 config file?
 
 ?xml version=1.0 encoding=utf-8 ?
 dataConfig
  dataSource type=JdbcDataSource
   driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
   url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV
   user=adam
   password=password/
  document name=events
entity datasource=MISSIONS
query = SELECT IdMission AS id,
CoreGroup AS cat,
StrMissionname AS subject,
strDescription AS description,
DateCreated AS pubdate
FROM dbo.tblMission
  field column=id name=id /
  field column=cat name=cat /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity
entity datasource=EVENTS
 query = SELECT strsubject AS subject,
strsummary as description,
datecreated as date,
CoreGroup as cat,
idevent as id
FROM dbo.tblEvent
  field column=id name=id /
  field column=cat name=cat /
  field column=subject name=subject /
  field column=description name=description /
  field column=pubdate name=date /
/entity
  /document
 /dataConfig



Re: search over two independent tables

2010-10-14 Thread Allistair Crossley
your first example is correct

document
entity name=newsfeed/entity
entity name=message/entity
/document

i have the same config for indexing 5 different tables

what you don't have from what i can see is a field name mapped to each column, 
e.g.

field column=nf_text /

i always have to provide the destination field in schema.xml, e.g.

field column=nf_text name=the_field /


On Oct 14, 2010, at 5:22 AM, Anthony Maudry wrote:

 Hello,
 
 I'm using Solr with a postgreSQL database. I need to search across two tables 
 with no link between them.
 
 ie : I have got a messages table and a newsfeeds table, nothing liking 
 them.
 
 I tried to configure my data-config.xml to implement this but it seems that 
 tables can't be defined separately.
 
 The configuration I first tried was the following :
 
 dataConfig
 dataSource driver=org.postgresql.Driver 
 url=jdbc:postgresql://host/database user=user password=password /
 document
 entity name=newsfeeds query=select id as nf_id, text as nf_text, url, 
 note from newsfeeds 
 field column=nf_text /
 /entity
 entity name=messages query=select id as m_id, body from messages 
 field column=body /
 /entity
 /document
 /dataConfig
 
 Note that the two entities are at the same level. Only the first entity 
 (newsfeeds) will give results
 
 I then tried this config :
 
 dataConfig
 dataSource driver=org.postgresql.Driver 
 url=jdbc:postgresql://host/database user=user password=password /
 document
 entity name=newsfeeds query=select id as nf_id, text as nf_text, url, 
 note from newsfeeds 
 field column=nf_text /
 entity name=messages query=select id as m_id, body from messages 
 field column=body /
 /entity
 /entity
 /document
 /dataConfig
 
 As expected the results were crossed.
 
 I wonder how I could implement the search over two independent tables?
 
 Thanks for any answer.
 
 Anthony



Re: search over two independent tables

2010-10-14 Thread Allistair Crossley
actually your intention is unclear ... are you wanting to run a single search 
and get back results from BOTH newsfeed and message? or do you want one or the 
other? if you want one or the other you could use my strategy which is to store 
the entity type as a field when indexing, e.g. 

entityfield column=type name=type_field //entity
entityfield column=type name=type_field //entity

note, if you don't have a foo column for type, make it up in your query, e.g. 
select n.*, 'Newsfeed' as type from ...

then for silo'd searches you would want to ensure a filter of type:Newsfeed. 
another handy thing is to facet.field=type and search (without a filter) as 
then you'll get back counts for your Newsfeed Message results too.


On Oct 14, 2010, at 5:44 AM, Allistair Crossley wrote:

 your first example is correct
 
 document
 entity name=newsfeed/entity
 entity name=message/entity
 /document
 
 i have the same config for indexing 5 different tables
 
 what you don't have from what i can see is a field name mapped to each 
 column, e.g.
 
 field column=nf_text /
 
 i always have to provide the destination field in schema.xml, e.g.
 
 field column=nf_text name=the_field /
 
 
 On Oct 14, 2010, at 5:22 AM, Anthony Maudry wrote:
 
 Hello,
 
 I'm using Solr with a postgreSQL database. I need to search across two 
 tables with no link between them.
 
 ie : I have got a messages table and a newsfeeds table, nothing liking 
 them.
 
 I tried to configure my data-config.xml to implement this but it seems that 
 tables can't be defined separately.
 
 The configuration I first tried was the following :
 
 dataConfig
 dataSource driver=org.postgresql.Driver 
 url=jdbc:postgresql://host/database user=user password=password /
 document
 entity name=newsfeeds query=select id as nf_id, text as nf_text, url, 
 note from newsfeeds 
 field column=nf_text /
 /entity
 entity name=messages query=select id as m_id, body from messages 
 field column=body /
 /entity
 /document
 /dataConfig
 
 Note that the two entities are at the same level. Only the first entity 
 (newsfeeds) will give results
 
 I then tried this config :
 
 dataConfig
 dataSource driver=org.postgresql.Driver 
 url=jdbc:postgresql://host/database user=user password=password /
 document
 entity name=newsfeeds query=select id as nf_id, text as nf_text, url, 
 note from newsfeeds 
 field column=nf_text /
 entity name=messages query=select id as m_id, body from messages 
 field column=body /
 /entity
 /entity
 /document
 /dataConfig
 
 As expected the results were crossed.
 
 I wonder how I could implement the search over two independent tables?
 
 Thanks for any answer.
 
 Anthony
 



Re: search over two independent tables

2010-10-14 Thread Allistair Crossley
results from both tables with 1 search - your first suggestion with separate 
entities under document is right, or at least how i do it. things that i have 
often found ...

0. check stdout for SQL errors
1. verify that your SQL works when you run it direct on your database!
2. verify that your search would definately match - choose a keyword only in a 
message, or index a type field as I mentioned and use the 
solr/select?q=type:Message strategy to see into the index whether anything is 
there to confirm
3. when you make changes to schema.xml or dataimport.xml make sure you restart 
solr and fully re-index your changes (this often caught me out)
4. are you checking this on a single server or do you have a stage and 
production server too (this caught me out sometimes)
5. make sure if you are setting your unique ID field with a unique field from 
each entity otherwise one will overwrite the other. I have 2 ID fields, one 
called id and one called uid. UID is my unique field and in each entity I 
prefix the row id with a letter, e.g. N1, M1. then i store the actual id (you 
need to generate it in the sql, e.g. select id, concat('N', cast(id as 
char(50)) as uid from ... to make life easier. 

allistair

On Oct 14, 2010, at 6:06 AM, Anthony Maudry wrote:

 Thanks for your quick answer.
 
 Actually I need to get result from both tables from a single search.
 
 I tried to define correctly every fields as you told me in your previous 
 message but I only get result from one table (actualy Newsfeeds)
 
 
 Le 14/10/2010 11:49, Allistair Crossley a écrit :
 actually your intention is unclear ... are you wanting to run a single 
 search and get back results from BOTH newsfeed and message? or do you want 
 one or the other? if you want one or the other you could use my strategy 
 which is to store the entity type as a field when indexing, e.g.
 
 entityfield column=type name=type_field //entity
 entityfield column=type name=type_field //entity
 
 note, if you don't have a foo column for type, make it up in your query, 
 e.g. select n.*, 'Newsfeed' as type from ...
 
 then for silo'd searches you would want to ensure a filter of type:Newsfeed. 
 another handy thing is to facet.field=type and search (without a filter) as 
 then you'll get back counts for your Newsfeed Message results too.
 



Re: check if field CONTAINS a value, as opposed to IS of a value

2010-10-14 Thread Allistair Crossley
i think you need to look at ngram tokenizing

On Oct 14, 2010, at 7:55 AM, PeterKerk wrote:

 
 I try to determine if a certain word occurs within a field.
 
 http://localhost:8983/solr/db/select/?indent=onfacet=truefl=id,titleq=introtext:hi
 
 this works if an EXACT match was found on field introtext, thus the field
 value is just hi
 
 But if the field value woud be hi there, this is just some text, the above
 URL does no longer find this record.
 
 What is the queryparameter to ask solr to look inside the introtext field
 for a value (and even better also for synonyms)
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/check-if-field-CONTAINS-a-value-as-opposed-to-IS-of-a-value-tp1700495p1700495.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: check if field CONTAINS a value, as opposed to IS of a value

2010-10-14 Thread Allistair Crossley
actuall no you don't .. if you want hi in a sentence of hi there this is me 
this is just normal tokenizing and should work .. check your field 
type/analysers

On Oct 14, 2010, at 7:59 AM, Allistair Crossley wrote:

 i think you need to look at ngram tokenizing
 
 On Oct 14, 2010, at 7:55 AM, PeterKerk wrote:
 
 
 I try to determine if a certain word occurs within a field.
 
 http://localhost:8983/solr/db/select/?indent=onfacet=truefl=id,titleq=introtext:hi
 
 this works if an EXACT match was found on field introtext, thus the field
 value is just hi
 
 But if the field value woud be hi there, this is just some text, the above
 URL does no longer find this record.
 
 What is the queryparameter to ask solr to look inside the introtext field
 for a value (and even better also for synonyms)
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/check-if-field-CONTAINS-a-value-as-opposed-to-IS-of-a-value-tp1700495p1700495.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: What is the maximum number of documents that can be indexed ?

2010-10-14 Thread Allistair Crossley
i think you answered the question by yourself ... these questions usually get 
the response that there is no answer. solr/lucence scale and distribute to 
whatever hardware you want to throw them.

you probably want to turn the question around - what is the maximum number of 
documents that your system *wants/will need to* index over 1, 2, 5, 10 years? 
at what rate? wat what load? design your architecture/hardware to the 
requirement.


On Oct 14, 2010, at 8:01 AM, Marco Ciaramella wrote:

 Hi all,
 I am working on a performance specification document on a Solr/Lucene-based
 application; this document is intended for the final customer. My question
 is: what is the maximum number of document I can index assuming 10 or
 20kbytes for each document?
 
 I could not find a precise answer to this question, and I tend to consider
 that Solr index can be virtually limited only by the JVM, the Operating
 System (limits to large file support), or by hardware constraints (mainly
 RAM, etc. ... ).
 
 Thanks
 Marco



Re: search over two independent tables

2010-10-14 Thread Allistair Crossley
super

On Oct 14, 2010, at 8:00 AM, Anthony Maudry wrote:

 Sorry for the late answer.
 
 It works now thanks to you, Allistair.
 
 I needed to use your uid field, common to the two entities but built in 
 different ways.
 
 here is the result in a sample of the data-config.xml file
 
 ...
 document
 entity name=newsfeeds query=select id as nf_id, 'newsfeed ' || cast(id as 
 char(50)) as nf_uid, text as nf_text, url, note from newsfeeds 
 field column=nf_id name=news_id /
 field column=nf_uid name=uid /
...
 /entity
 entity name=messages query=select id as m_id, 'message ' || cast(id as 
 char(50)) as m_uid, body from messages 
 field column=m_id name=message_id /
 field column=m_uid name=uid /
...
 /entity
 /document
 ...
 
 uid is define as uniqueKey in the schema.xml file.
 
 Thank you for your help



Re: What is the maximum number of documents that can be indexed ?

2010-10-14 Thread Allistair Crossley
me also. great book, just wanted a bit more on complex DIH :)

On Oct 14, 2010, at 10:38 AM, Jason Brown wrote:

 Not related to the opening thread - but wante to thank Eric for his book. 
 Clarified a lot of stuff and very useful.
 
 
 -Original Message-
 From: Eric Pugh [mailto:ep...@opensourceconnections.com]
 Sent: Thu 14/10/2010 15:34
 To: solr-user@lucene.apache.org
 Subject: Re: What is the maximum number of documents that can be indexed ?
 
 I would recommend looking at the work the HathiTrust has done.  They have 
 published some really great blog articles about the work they have done in 
 scaling Solr, and have put in huge amounts of data.   
 
 The good news is that there isn't a exact number, because It depends.   The 
 bad news is that there isn't an exact number because it depends!
 
 Eric
 
 
 
 On Oct 13, 2010, at 8:58 PM, Otis Gospodnetic wrote:
 
 Marco (use solr-u...@lucene list to follow up, please),
 
 There are no precise answers to such questions.  Solr can keep indexing.  
 The 
 limit is, I think, the available disk space.  I've never pushed Solr or 
 Lucene 
 to the point where Lucene index segments would become a serious pain, but 
 even 
 that can be controlled.  Same thing with number of open files, large file 
 support, etc.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 From: Marco Ciaramella ciaramellama...@gmail.com
 To: d...@lucene.apache.org
 Sent: Wed, October 13, 2010 6:19:15 PM
 Subject: What is the maximum number of documents that can be indexed ?
 
 Hi all,
 I am working on a performance specification document on a Solr/Lucene-based 
 application; this document is intended for the final customer. My question 
 is: 
 what is the maximum number of document I can index assuming 10 or 20kbytes 
 for 
 each document? 
 
 
 I could not find a precise answer to this question, and I tend to consider 
 that 
 Solr index can be virtually limited only by the JVM, the Operating System 
 (limits to large file support), or by hardware constraints (mainly RAM, 
 etc. ... 
 ). 
 
 
 Thanks
 Marco
 
 
 
 
 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
 http://www.opensourceconnections.com
 Co-Author: Solr 1.4 Enterprise Search Server available from 
 http://www.packtpub.com/solr-1-4-enterprise-search-server
 Free/Busy: http://tinyurl.com/eric-cal
 
 
 
 
 
 
 
 
 
 
 If you wish to view the St. James's Place email disclaimer, please use the 
 link below
 
 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer



Re: which schema.xml to modify ?

2010-10-14 Thread Allistair Crossley
you will find it in the distribution at example/solr/config

On Oct 14, 2010, at 3:04 PM, Ibrahim Diop wrote:

 Hi All,
 
 I'm a new solr user and I just want to know which schema.xml file to modify 
 for this tutorial : http://lucene.apache.org/solr/tutorial.html
 
 Thanks,
 
 Ibrahim.



Re: Synchronizing Solr with a PostgreDB

2010-10-14 Thread Allistair Crossley
i would not cross-reference solr results with your database to merge unless you 
want to spank your database. nor would i load solr with all your data. what i 
have found is that the search results page is generally a small subset of data 
relating to the fuller document/result. therefore i store only the data 
required to present the search results wholly from solr. the user can choose to 
click into a specific result which then uses just the database to present it.

use data import handler - define an xml config to import as many entities into 
your document as you need and map columns to fields in schema.xml. use the Wiki 
page on DIH - it's all there, as well as example config in the solr distro.

allistair

On Oct 14, 2010, at 6:13 PM, Juan Manuel Alvarez wrote:

 Hello everyone! I am new to Solr and Lucene and I would like to ask
 you a couple of questions.
 
 I am working on an existing system that has the data saved in a
 Postgre DB and now I am trying to integrate Solr to use full-text
 search and faceted search, but I am having a couple of doubts about
 it.
 
 1) I see two ways of storing the data and make the search:
 - Duplicate all the DB data in Solr, so complete results are returned
 from a search query, or...
 - Put in Solr just the data that I need to search and, after finding
 the elements with a Solr query, use the result to make a more specific
 query to the DB.
 
 Which is the way this is normally done?
 
 2) How do I synchronize Solr and Postgre? Do I have to use the
 DataImportHandler or when I do the INSERT command into Postgre, I have
 to execute a command into Solr?
 
 Thanks for your time!
 
 Cheers!
 Juan M.



Getting an ngram fieldtype to work

2010-10-08 Thread Allistair Crossley
Morning all,

I would like to ngram a company name field in our index. I have read about the 
costs of doing so in the great David Smiley Solr 1.4 book and just to get 
started I have followed his example in setting up an ngram field type as 
follows:

fieldType name=text_substring class=solr.TextField 
positionIncrementGap=100 stored=false multiValued=true
analyzer type=index
tokenizer 
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.NGramFilterFactory 
minGramSize=4 maxGramSize=15 /
/analyzer
analyzer type=query
tokenizer 
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType

I have restarted/reindexed everything but I still cannot search

hoot

and get back the company named Shooter. searching shooter is fine.

I have followed other examples on the internet regards an ngram field type. 
Some examples seem to use an index analyzer that has an ngram tokenizer rather 
than filter if this makes a difference. But in all cases I am not seeing the 
expected result, just 0 results.

Is there anything else I should be considering here? I feel like I must be very 
close, it doesn't seem complicated but yet it's not working like everything 
else I have done with solr to date :)

Any guidance appreciated,

Allistair

Re: Getting an ngram fieldtype to work

2010-10-08 Thread Allistair Crossley
Hi,

Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as expected, 
so it's not my configuration that is at fault (he says)

Index Analyzer
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
Query Analyzer

sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter
sh  ho  oo  ot  te  er  sho hoo oot ote 
ter shoohootooteotershoot   hoote   ooter   shoote  hooter


Yet, searching either

/solr/select?q=hoot

or

/solr/select?q=name:hoot

does not yield results.

When searching for shooter I see 2 results with names:

1. str name=nameShooters International Inc/str
2. str name=nameHong Kong Shooter/str

Yours, puzzled :)

On Oct 8, 2010, at 8:38 AM, Jan Høydahl / Cominvent wrote:

 Hi,
 
 The first thing I would try is to go to the analysis page, enter your test 
 data, and report back what each analysis stage prints out:
 http://localhost:8983/solr/admin/analysis.jsp
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 On 8. okt. 2010, at 14.19, Allistair Crossley wrote:
 
 Morning all,
 
 I would like to ngram a company name field in our index. I have read about 
 the costs of doing so in the great David Smiley Solr 1.4 book and just to 
 get started I have followed his example in setting up an ngram field type as 
 follows:
 
  fieldType name=text_substring class=solr.TextField 
 positionIncrementGap=100 stored=false multiValued=true
  analyzer type=index
  tokenizer 
 class=solr.StandardTokenizerFactory /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.NGramFilterFactory 
 minGramSize=4 maxGramSize=15 /
  /analyzer
  analyzer type=query
  tokenizer 
 class=solr.StandardTokenizerFactory /
  filter class=solr.LowerCaseFilterFactory /
  /analyzer
  /fieldType
 
 I have restarted/reindexed everything but I still cannot search
 
 hoot
 
 and get back the company named Shooter. searching shooter is fine.
 
 I have followed other examples on the internet regards an ngram field type. 
 Some examples seem to use an index analyzer that has an ngram tokenizer 
 rather than filter if this makes a difference. But in all cases I am not 
 seeing the expected result, just 0 results.
 
 Is there anything else I should be considering here? I feel like I must be 
 very close, it doesn't seem complicated but yet it's not working like 
 everything else I have done with solr to date :)
 
 Any guidance appreciated,
 
 Allistair
 



Re: Getting an ngram fieldtype to work

2010-10-08 Thread Allistair Crossley
Oh my. I am basically being a total monkey. Every time I was changing my 
schema.xml to try new things out I was then reindexing our staging server's 
index instead of my local dev index so no changes were occurring locally.

Dear me. 

This is working now, surprise.

On Oct 8, 2010, at 8:53 AM, Markus Jelsma wrote:

 How come your query analyser spits out grams? It isn't configured to do so or 
 you posted an older field definition. Anyway,  do you actually search on your 
 new field?
 
 On Friday, October 08, 2010 02:46:08 pm Allistair Crossley wrote:
 Hi,
 
 Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as
 expected, so it's not my configuration that is at fault (he says)
 
 Index Analyzer
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote ooter
  shoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter Query Analyzer
 
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote ooter
  shoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter
 sh   ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hoote oote
 rshoote  hooter
 
 
 Yet, searching either
 
 /solr/select?q=hoot
 
 or
 
 /solr/select?q=name:hoot
 
 does not yield results.
 
 When searching for shooter I see 2 results with names:
 
 1. str name=nameShooters International Inc/str
 2. str name=nameHong Kong Shooter/str
 
 Yours, puzzled :)
 
 On Oct 8, 2010, at 8:38 AM, Jan Høydahl / Cominvent wrote:
 Hi,
 
 The first thing I would try is to go to the analysis page, enter your
 test data, and report back what each analysis stage prints out:
 http://localhost:8983/solr/admin/analysis.jsp
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 On 8. okt. 2010, at 14.19, Allistair Crossley wrote:
 Morning all,
 
 I would like to ngram a company name field in our index. I have read about 
 the costs of doing so in the great David Smiley Solr 1.4 book and just to get 
 started I have followed his example in setting up an ngram field type as 
 follows:
fieldType name=text_substring class=solr.TextField
positionIncrementGap=100 stored=false multiValued=true

analyzer type=index

tokenizer 
 class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.NGramFilterFactory 
 minGramSize=4
maxGramSize=15 /

/analyzer
analyzer type=query

tokenizer 
 class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /

/analyzer

/fieldType
 
 I have restarted/reindexed everything but I still cannot search
 
 hoot
 
 and get back the company named Shooter. searching shooter is fine.
 
 I have followed other examples on the internet regards an ngram field
 type. Some examples seem to use an index analyzer that has an ngram
 tokenizer rather than filter if this makes a difference. But in all
 cases I am not seeing the expected result, just 0 results.
 
 Is there anything else I should be considering here? I feel like I must
 be very close, it doesn't seem complicated but yet it's not working
 like everything else I have done with solr to date :)
 
 Any guidance appreciated,
 
 Allistair
 
 -- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: Getting an ngram fieldtype to work

2010-10-08 Thread Allistair Crossley
Well, a lot of this is working but not all.

Consider the company name Shooters Inc

My ngram field is able to match queries to the name for shoot and hoot and so 
on. This works.

However consider the company name

Location Scotland

If I query scot I get one result back - but it's for a company called Prescott 
Inc

I looked at the analyzer and realised that the NGramTokenizer was generating 
substrings from the start (left) of the *whole phrase*

location scotland

Because my max was set to 15 it was not generating a token for scot

So I figured I would change to a whitespace tokenizer first and then apply the 
ngram as a filter.

This now looks like it is generating scot in the tokens as shown below:
Index Analyzer

org.apache.solr.analysis.WhitespaceTokenizerFactory {}

term position   1   2
term text   locationscotland
term type   wordword
source start,end0,8 9,17
payload 
org.apache.solr.analysis.NGramFilterFactory {maxGramSize=15, minGramSize=4}

term position   1   2   3   4   5   6   7   8   
9   10  11  12  13  14  15  16  17  18  
19  20  21  22  23  24  25  26  27  28  
29  30
term text   locaocatcatiatiotionlocat   ocati   catio   
ation   locati  ocatio  cation  locatio ocation locationscotcotl
otlatlanlandscotl   cotla   otlan   tland   scotla  cotlan  otland  
scotlan cotland scotland
term type   wordwordwordwordwordwordwordword
wordwordwordwordwordwordwordwordwordword
wordwordwordwordwordwordwordwordwordword
wordword
source start,end0,4 1,5 2,6 3,7 4,8 0,5 1,6 
2,7 3,8 0,6 1,7 2,8 0,7 1,8 0,8 9,1310,14   
11,15   12,16   13,17   9,1410,15   11,16   12,17   9,1510,16   11,17   
9,1610,17   9,17
payload 


Query Analyzer

scot
scot

BUT it still results no results for scot, but does continue to return the 
Prescott result.

So ngramming is working but it is not working when the query is something far 
to the right of the indexed value. 

Is this another user-error or have I missed something else here?

Cheers


On Oct 8, 2010, at 9:02 AM, Allistair Crossley wrote:

 Oh my. I am basically being a total monkey. Every time I was changing my 
 schema.xml to try new things out I was then reindexing our staging server's 
 index instead of my local dev index so no changes were occurring locally.
 
 Dear me. 
 
 This is working now, surprise.
 
 On Oct 8, 2010, at 8:53 AM, Markus Jelsma wrote:
 
 How come your query analyser spits out grams? It isn't configured to do so 
 or 
 you posted an older field definition. Anyway,  do you actually search on 
 your 
 new field?
 
 On Friday, October 08, 2010 02:46:08 pm Allistair Crossley wrote:
 Hi,
 
 Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as
 expected, so it's not my configuration that is at fault (he says)
 
 Index Analyzer
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteooter
 shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter Query Analyzer
 
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteooter
 shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter
 sh  ho  oo  ot  te  er  sho hoo oot ote 
 ter shoohootooteotershoot   
 hooteoote
 r   shoote  hooter
 
 
 Yet, searching either
 
 /solr/select?q=hoot
 
 or
 
 /solr/select?q=name:hoot
 
 does not yield results.
 
 When searching for shooter I see 2 results with names:
 
 1. str name=nameShooters International Inc/str
 2. str name=nameHong

Re: Strategy for re-indexing

2010-10-08 Thread Allistair Crossley
Thanks for your time responding to this. I have decided also to go down the 
route of cron-scheduled Perl LWP pings to DIH + deltaQueries. This seems to 
work inline with what the business requires and for the index size.

Thanks again

On Oct 7, 2010, at 7:46 AM, Shawn Heisey wrote:

 On 10/6/2010 10:49 AM, Allistair Crossley wrote:
 Hi,
 
 I was interested in gaining some insight into how you guys schedule updates 
 for your Solr index (I have a single index).
 
 Right now during development I have added deltaQuery specifications to data 
 import entities to control the number of rows being queries on re-indexes.
 
 However in terms of *when* to reindex we have a lot going on in the system - 
 there are 4 sub-systems: custom application data, a CMS, a forum and a blog. 
 It's all being indexed and at any given time there will be users and 
 administrators all updating various parts of the sub-systems.
 
 For the time being during development I have been issuing reindexes to the 
 data import handler on each CRUD on any given sub-system. This has been 
 working fine to be honest. It does need to be as immediate as possible - a 
 scheduled update won't work for us. Even every 10 minutes is probably not 
 fast enough.
 
 So I wonder what others do. Is anyone else in a similar situation?
 
 And what happens if 4 users generate 4 different requests to the data import 
 handler to update for different types of data?  The DIH will be running 
 already let's say for request 1, then request 2 comes in - is it rejected? 
 Or is it queued?
 
 I need it to be queued and serviced because the request 1 re-index may have 
 already run its queries but missed the data added by the user for request 2. 
 Same then goes for the requests 3 and 4.
 
 I can't say whether the DIH will properly handle concurrent requests or not.  
 I figure it's always best to assume that things like this won't work and find 
 an elegant way to design around it.
 
 I wrote my build system in perl (using LWP and LWP::Simple), and assumed that 
 the DIH would not let me run concurrent delta-imports.  We settled on every 
 two minutes for our update frequency, and use cron for scheduling.  Two of my 
 servers (VMs, actually) are a heartbeat cluster running HAProxy for load 
 balancing, which I implemented purely for redundancy, not for scalability.  
 Whichever host in the heartbeat cluster is online is the one that runs the 
 cronjobs.
 
 I have the following processes and schedules:
 
 idxUpdate: Runs every two minutes.  This script imports new data, based on an 
 autoincrement primary key in the database, the field is DID.  From the 
 database perspective, changed data looks like new data - it gets its DID 
 updated but another unique field (TAG_ID) stays the same.  Solr uses TAG_ID 
 as its uniqueKey.  Updates go into an incremental shard that is relatively 
 small - usually less than 1GB and 500,000 documents.  At the top of the hour, 
 the update includes a call to optimize.
 
 idxDelete: Runs every ten minutes starting at xx:01.  This script gets the 
 list of newly deleted documents by DID.  Then, 1024 of them at a time, it 
 queries every shard for this list and issues a delete if they are found.  
 After the entire list is complete, it issues a commit to any shard that was 
 actually changed.  This increases the lifespan of indexSearchers and Solr 
 caches.  At the top of each hour, it reads the entire list of deletes instead 
 of new ones, and trims the delete list to the last 48 hours.
 
 idxRrdUpdate: Runs once an hour. This simply records the current MAX(DID) 
 from the database into an RRD database.  I keep it in both a counter and a 
 gauge.  One day I will track other statistical data about my system and make 
 it all into pretty graphs.
 
 idxDistribute: Runs once a day.  This uses the historical data in the RRD 
 database to decide which incremental data is older than one week.  Once it 
 has that information (a DID range), it distributes those records to each of 
 the six static index shards and deletes them from the incremental shard.  If 
 that process is successful, it updates the stored minimum DID value for the 
 incremental.  Each day, one of the static indexes (currently 13GB and 7.6 
 million records) is optimized.
 
 You might wonder how we deal with the fact that when a record is changed, the 
 old one might remain in the index for as long as 11 minutes before the delete 
 process finally removes it.  We assume that the incremental index, being less 
 than 10% of the size of the static indexes, will always respond faster.  
 Since the updated copy of the record will always be in the incremental, it 
 should respond first to the distributed query and therefore be the one that 
 is included in the results.  That assumption seems to be correct so far.
 
 Shawn
 



Re: multi level faceting

2010-10-04 Thread Allistair Crossley
I think that is just sending 2 fq facet queries through. In Solr PHP I would do 
that with, e.g.

$params['facet'] = true;
$params['facet.fields'] = array('Size');
$params['fq'] = array('sex' = array('Men', 'Women'));

but yes i think you'd have to send through what the current facet query is and 
add it to your next drill-down

On Oct 4, 2010, at 9:36 AM, Nguyen, Vincent (CDC/OD/OADS) (CTR) wrote:

 Hi,
 
 
 
 I was wondering if there's a way to display facet options based on
 previous facet values.  For example, I've seen many shopping sites where
 a user can facet by Mens or Womens apparel, then be shown sizes to
 facet by (for Men or Women only - whichever they chose).  
 
 
 
 Is this something that would have to be handled at the application
 level?
 
 
 
 Vincent Vu Nguyen
 
 
 
 
 



Re: DIH sub-entity not indexing

2010-10-04 Thread Allistair Crossley
Hey,

Yes that tool doesn't work too well for me. I can load it up and get the forms 
on the left, but when I run a debug the right hand side tells me that the page 
is not found. I *think* this is because I use a custom query string parameter 
in my DIH XML for use with delta querying and this being missing is failing the 
tool and it doesn't support adding custom query string params.

Cheers, Allistair

On Oct 4, 2010, at 9:20 AM, Ephraim Ofir wrote:

 The closest you can get to debugging (without actually debugging...) is
 to look at the logs and use
 http://wiki.apache.org/solr/DataImportHandler#Interactive_Development_Mo
 de
 
 Ephraim Ofir
 
 
 -Original Message-
 From: Allistair Crossley [mailto:a...@roxxor.co.uk] 
 Sent: Monday, October 04, 2010 3:09 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DIH sub-entity not indexing
 
 Thanks Ephraim. I tried your suggestion with the ID but capitalising it
 did not work. 
 
 Indeed, I have a column that already works using a lower-case id. I wish
 I could debug it somehow - see the SQL? Something particular about this
 config it is not liking.
 
 I read the post you linked to. This is more a performance-related thing
 for him. I would be happy just to see low performance and my contacts
 populated right now!! :D
 
 Thanks again
 
 On Oct 4, 2010, at 9:00 AM, Ephraim Ofir wrote:
 
 Make sure you're not running into a case sensitivity problem, some
 stuff
 in DIH is case sensitive (and some stuff gets capitalized by the
 jdbc).
 Try using listing.ID instead of listing.id.
 On a side note, if you're using mysql, you might want to look at the
 CONCAT_WS function.
 You might also want to look into a different approach than
 sub-entities
 -
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3
 
 c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com
 %3E
 
 Ephraim Ofir
 
 -Original Message-
 From: Allistair Crossley [mailto:a...@roxxor.co.uk] 
 Sent: Monday, October 04, 2010 2:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DIH sub-entity not indexing
 
 I have tried a more elaborate join also following the features example
 of the DIH example but same result - SQL works fine directly but Solr
 is
 not indexing the array of full_names per Listing, e.g.
 
 entity name=listing ...
 
  entity name=listing_contact
   query=select * from listing_contacts where
 listing_id = '${listing.id}'
   entity name=contact
  query=select concat(first_name,
 concat(' ', last_name)) as full_name from contacts where id =
 '${listing_contact.contact_id}'
  field name=contacts column=full_name /
  /entity
   /entity
 
 /entity
 
 Am I missing the obvious?
 
 On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote:
 
 Hello list,
 
 I've been successful with DIH to a large extent but a seemingly
 simple
 extra column I need is posing problems. In a nutshell I have 2
 entities
 let's say - Listing habtm Contact. I have copied the relevant parts of
 the configs below.
 
 I have run my SQL for the sub-entity Contact and this is produces
 correct results. No errors are given by Solr on running the import.
 Yet
 no records are being set with the contacts array.
 
 I have taken out my sub-entity config and replaced it with a simple
 template value just to check and values then come through OK.
 
 So it certainly seems limited to my query or query config somehow. I
 followed roughly the example of the DIH bundled example.
 
 DIH.xml
 ===
 
 entity name=listing ...
 ...
 entity name=contacts
 query=select concat(c.first_name, concat(' ', c.last_name)) as
 full_name from contacts c inner join listing_contacts lc on c.id =
 lc.contact_id where lc.listing_id = '${listing.id}'
 field name=contacts column=full_name /
 /entity
 
 SCHEMA.XML
 
 field name=contacts type=text indexed=true stored=true
 multiValued=true required=false /
 
 
 Any tips appreciated.
 
 



Re: DIH sub-entity not indexing

2010-10-04 Thread Allistair Crossley
Thanks Ephraim. I tried your suggestion with the ID but capitalising it did not 
work. 

Indeed, I have a column that already works using a lower-case id. I wish I 
could debug it somehow - see the SQL? Something particular about this config it 
is not liking.

I read the post you linked to. This is more a performance-related thing for 
him. I would be happy just to see low performance and my contacts populated 
right now!! :D

Thanks again

On Oct 4, 2010, at 9:00 AM, Ephraim Ofir wrote:

 Make sure you're not running into a case sensitivity problem, some stuff
 in DIH is case sensitive (and some stuff gets capitalized by the jdbc).
 Try using listing.ID instead of listing.id.
 On a side note, if you're using mysql, you might want to look at the
 CONCAT_WS function.
 You might also want to look into a different approach than sub-entities
 -
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3
 c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com
 %3E
 
 Ephraim Ofir
 
 -Original Message-
 From: Allistair Crossley [mailto:a...@roxxor.co.uk] 
 Sent: Monday, October 04, 2010 2:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DIH sub-entity not indexing
 
 I have tried a more elaborate join also following the features example
 of the DIH example but same result - SQL works fine directly but Solr is
 not indexing the array of full_names per Listing, e.g.
 
 entity name=listing ...
 
   entity name=listing_contact
query=select * from listing_contacts where
 listing_id = '${listing.id}'
entity name=contact
   query=select concat(first_name,
 concat(' ', last_name)) as full_name from contacts where id =
 '${listing_contact.contact_id}'
   field name=contacts column=full_name /
   /entity
/entity
 
 /entity
 
 Am I missing the obvious?
 
 On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote:
 
 Hello list,
 
 I've been successful with DIH to a large extent but a seemingly simple
 extra column I need is posing problems. In a nutshell I have 2 entities
 let's say - Listing habtm Contact. I have copied the relevant parts of
 the configs below.
 
 I have run my SQL for the sub-entity Contact and this is produces
 correct results. No errors are given by Solr on running the import. Yet
 no records are being set with the contacts array.
 
 I have taken out my sub-entity config and replaced it with a simple
 template value just to check and values then come through OK.
 
 So it certainly seems limited to my query or query config somehow. I
 followed roughly the example of the DIH bundled example.
 
 DIH.xml
 ===
 
 entity name=listing ...
 ...
 entity name=contacts
 query=select concat(c.first_name, concat(' ', c.last_name)) as
 full_name from contacts c inner join listing_contacts lc on c.id =
 lc.contact_id where lc.listing_id = '${listing.id}'
 field name=contacts column=full_name /
 /entity
 
 SCHEMA.XML
 
 field name=contacts type=text indexed=true stored=true
 multiValued=true required=false /
 
 
 Any tips appreciated.
 



DIH sub-entity not indexing

2010-10-04 Thread Allistair Crossley
Hello list,

I've been successful with DIH to a large extent but a seemingly simple extra 
column I need is posing problems. In a nutshell I have 2 entities let's say - 
Listing habtm Contact. I have copied the relevant parts of the configs below.

I have run my SQL for the sub-entity Contact and this is produces correct 
results. No errors are given by Solr on running the import. Yet no records are 
being set with the contacts array.

I have taken out my sub-entity config and replaced it with a simple template 
value just to check and values then come through OK.

So it certainly seems limited to my query or query config somehow. I followed 
roughly the example of the DIH bundled example.

DIH.xml
===

entity name=listing ...
  ...
  entity name=contacts
query=select concat(c.first_name, concat(' ', c.last_name)) as full_name from 
contacts c inner join listing_contacts lc on c.id = lc.contact_id where 
lc.listing_id = '${listing.id}'
field name=contacts column=full_name /
/entity

SCHEMA.XML

field name=contacts type=text indexed=true stored=true 
multiValued=true required=false /


Any tips appreciated.

Re: DIH sub-entity not indexing

2010-10-04 Thread Allistair Crossley
Very clever thinking indeed. Well, that's certainly revealed the problem ... 
${listing.id} is empty on my sub-entity query ... 

And this because I prefix the indexed ID with a letter

field column=id name=id template=L${listing.id} /

This appears to modify the internal value of $listing.id for subsequent uses.

Well, I can work around this now. Thanks!

On Oct 4, 2010, at 9:35 AM, Stefan Matheis wrote:

 Allistair,
 
 Indeed, I have a column that already works using a lower-case id. I wish
 I could debug it somehow - see the SQL? Something particular about this
 config it is not liking.
 
 
 you may want to try the MySQL Query-Log, to check which Queries are
 performed?
 http://dev.mysql.com/doc/refman/5.1/en/query-log.html



Re: solr-user

2010-10-04 Thread Allistair Crossley
I updated the SolrJ JAR requirements to be clearer on the wiki page given how 
many of these SolrJ emails I saw coming through since joining the list. I just 
created a test java class and imported the removed JARs until I found out the 
minimal set required.

On Oct 4, 2010, at 8:27 AM, Erick Erickson wrote:

 I suspect you're not actually including the path to those jars.
 SolrException should be in your solrj jar file. You can test this
 by executing jar -tf apacheBLAHBLAH.jar which will dump
 all the class names in the jar file. I'm assuming that you're
 really including the version for the * in the solrj jar file here
 
 So I'd guess it's a classpath issue and you're not really including
 what you think you are
 
 HTH
 Erick
 
 On Fri, Oct 1, 2010 at 11:28 PM, ankita shinde 
 ankitashinde...@gmail.comwrote:
 
 -- Forwarded message --
 From: ankita shinde ankitashinde...@gmail.com
 Date: Sat, Oct 2, 2010 at 8:54 AM
 Subject: solr-user
 To: solr-user@lucene.apache.org
 
 
 hello,
 
 I am trying to use solrj for interfacing with solr. I am trying to run the
 SolrjTest example. I have included all the following  jar files-
 
 
  - commons-codec-1.3.jar
  - commons-fileupload-1.2.1.jar
  - commons-httpclient-3.1.jar
  - commons-io-1.4.jar
  - geronimo-stax-api_1.0_spec-1.0.1.jar
  - apache-solr-solrj-*.jar
  - wstx-asl-3.2.7.jar
  - slf4j-api-1.5.5.jar
  - slf4j-simple-1.5.5.jar
 
 
 
 
 But its giving me error as 'NoClassDefFoundError:
 org/apache/solr/client/solrj/SolrServerException'.
 Can anyone tell me where did i go wrong?
 



Re: DIH sub-entity not indexing

2010-10-04 Thread Allistair Crossley
I have tried a more elaborate join also following the features example of the 
DIH example but same result - SQL works fine directly but Solr is not indexing 
the array of full_names per Listing, e.g.

entity name=listing ...

entity name=listing_contact
query=select * from listing_contacts where listing_id = 
'${listing.id}'
entity name=contact
query=select concat(first_name, 
concat(' ', last_name)) as full_name from contacts where id = 
'${listing_contact.contact_id}'
field name=contacts column=full_name /
/entity
/entity

/entity

Am I missing the obvious?

On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote:

 Hello list,
 
 I've been successful with DIH to a large extent but a seemingly simple extra 
 column I need is posing problems. In a nutshell I have 2 entities let's say - 
 Listing habtm Contact. I have copied the relevant parts of the configs below.
 
 I have run my SQL for the sub-entity Contact and this is produces correct 
 results. No errors are given by Solr on running the import. Yet no records 
 are being set with the contacts array.
 
 I have taken out my sub-entity config and replaced it with a simple template 
 value just to check and values then come through OK.
 
 So it certainly seems limited to my query or query config somehow. I followed 
 roughly the example of the DIH bundled example.
 
 DIH.xml
 ===
 
 entity name=listing ...
  ...
  entity name=contacts
 query=select concat(c.first_name, concat(' ', c.last_name)) as full_name 
 from contacts c inner join listing_contacts lc on c.id = lc.contact_id where 
 lc.listing_id = '${listing.id}'
 field name=contacts column=full_name /
 /entity
 
 SCHEMA.XML
 
 field name=contacts type=text indexed=true stored=true 
 multiValued=true required=false /
 
 
 Any tips appreciated.



Re: solrj

2010-10-04 Thread Allistair Crossley
i rewrote the top jar section at

http://wiki.apache.org/solr/Solrj

and the following code then runs fine.

import java.net.MalformedURLException;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;

class TestSolrQuery {

public static void main(String[] args) {

String url = http://localhost:8983/solr;;
SolrServer server = null;

try { 
server = new CommonsHttpSolrServer(url);
} catch (MalformedURLException e) {
System.out.println(e);
System.exit(1);
}

SolrQuery query = new SolrQuery();
query.setQuery(*:*);

QueryResponse rsp = null;
try { 
rsp = server.query(query);
} catch(SolrServerException e) {
System.out.println(e);
System.exit(1);
}

SolrDocumentList docs = rsp.getResults();
for (SolrDocument doc : docs) {
System.out.println(doc.toString());
}
}
}


On Oct 4, 2010, at 11:26 AM, Xin Li wrote:

 I asked the exact question the day before. If you or anyone else has
 pointer to the solution, please share on the mail list. For now, I am
 using Perl script instead to query Solr server.
 
 Thanks,
 Xin
 
 -Original Message-
 From: ankita shinde [mailto:ankitashinde...@gmail.com] 
 Sent: Saturday, October 02, 2010 2:30 PM
 To: solr-user@lucene.apache.org
 Subject: solrj
 
 hello,
 
 I am trying to use solrj for interfacing with solr. I am trying to run
 the
 SolrjTest example. I have included all the following  jar files-
 
 
   - commons-codec-1.3.jar
   - commons-fileupload-1.2.1.jar
   - commons-httpclient-3.1.jar
   - commons-io-1.4.jar
   - geronimo-stax-api_1.0_spec-1.0.1.jar
   - apache-solr-solrj-*.jar
   - wstx-asl-3.2.7.jar
   - slf4j-api-1.5.5.jar
   - slf4j-simple-1.5.5.jar
 
 
 *My SolrjTest file is as follows:*
 
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrDocument;
 import java.util.Map;
 import org.apache.solr.common.SolrDocumentList;
 import org.apache.solr.common.SolrDocument;
 import java.util.Map;
 import java.util.Iterator;
 import java.util.List;
 import java.util.ArrayList;
 import java.util.HashMap;
 
 import org.apache.solr.client.solrj.SolrServerException;
 import org.apache.solr.client.solrj.SolrQuery;
 import org.apache.solr.client.solrj.response.QueryResponse;
 import org.apache.solr.client.solrj.response.FacetField;
 
 
 class SolrjTest
 {
public void query(String q)
{
CommonsHttpSolrServer server = null;
 
try
{
server = new
 CommonsHttpSolrServer(http://localhost:8983/solr/
 );
}
catch(Exception e)
{
e.printStackTrace();
}
 
SolrQuery query = new SolrQuery();
query.setQuery(q);
query.setQueryType(dismax);
query.setFacet(true);
query.addFacetField(lastname);
query.addFacetField(locality4);
query.setFacetMinCount(2);
query.setIncludeScore(true);
 
try
{
QueryResponse qr = server.query(query);
 
SolrDocumentList sdl = qr.getResults();
 
System.out.println(Found:  + sdl.getNumFound());
System.out.println(Start:  + sdl.getStart());
System.out.println(Max Score:  + sdl.getMaxScore());
System.out.println();
 
ArrayListHashMapString, Object hitsOnPage = new
 ArrayListHashMapString, Object();
 
for(SolrDocument d : sdl)
{
HashMapString, Object values = new HashMapString,
 Object();
 
for(IteratorMap.EntryString, Object i =
 d.iterator();
 i.hasNext(); )
{
Map.EntryString, Object e2 = i.next();
 
values.put(e2.getKey(), e2.getValue());
}
 
hitsOnPage.add(values);
System.out.println(values.get(displayname) +  ( +
 values.get(displayphone) + ));
}
 
ListFacetField facets = qr.getFacetFields();
 
for(FacetField facet : facets)
{
ListFacetField.Count facetEntries = facet.getValues();
 
for(FacetField.Count fcount : facetEntries)
{
System.out.println(fcount.getName() + :  +
 fcount.getCount());
}
}
}
catch 

Re: Any way to append new text to an existing indexed field?

2010-10-01 Thread Allistair Crossley
i would say question and answer are 2 different entities. if you are using the 
data import handler, i would personally create them as separate entities with 
their own queries to the database using the deltaQuery method to pick up only 
new rows. i guess it depends if you need question + answers to actually come 
back out to be used for display (i.e. you stored their data), or whether it's 
good enough to match on question/answer separately and then just link to a 
question ID in your UI to drill-down from the database.

disclaimer: i am a solr novice - just started, so i'd see what others think too 
;)

On Oct 1, 2010, at 7:38 AM, Andy wrote:

 I'm building a QA application. There's a Question database table and an 
 Answer table.
 
 For each question, I'm putting the question itself plus all the answers into 
 a single field text to be indexed and searched.
 
 Say I have a question that has 10 existing answers that are already indexed. 
 If a new answer is submitted for that question, is there any way I could just 
 append the new answer to the text field? Or is the only way to implement 
 this is to retrieve the original question and the 10 existing answers from 
 the database, combine them with the newly submitted 11th answer, and re-index 
 everything from scratch?
 
 The latter option just seems inefficient. Is there a better design that could 
 be used for this use case?
 
 Andy
 
 
 



Re: Any way to append new text to an existing indexed field?

2010-10-01 Thread Allistair Crossley
if your question + answers form a compound document then the whole document 
(with given unique id, e.g. question id) needs to be reindexed i think. best i 
could find with google was this ...

https://issues.apache.org/jira/browse/SOLR-139

On Oct 1, 2010, at 8:23 AM, Andy wrote:

 Well I want to just display the title of the question in my search results 
 and users can then just click on it to see the detals of the question and all 
 the answers.
 
 For example, say a question has the title What is the meaning of life? and 
 then one of the answers to that question is solr. If someone searches for 
 solr, I want to display the question title What is the meaning of life? 
 in the search results. If the user clicks on the question title and drills 
 down, he can then see that one of the answers is solr.
 
 I'm not sure it makes sense to index the question and each answer separately 
 because I don't want to get duplicate questions in the search results. In the 
 above example, let's say there's another answer solr is the meaning. If 
 each answer is indexed separately, I'd get two What is the meaning of life? 
 in my search results when someone searches for solr.
 
 --- On Fri, 10/1/10, Allistair Crossley a...@roxxor.co.uk wrote:
 
 From: Allistair Crossley a...@roxxor.co.uk
 Subject: Re: Any way to append new text to an existing indexed field?
 To: solr-user@lucene.apache.org
 Date: Friday, October 1, 2010, 7:46 AM
 i would say question and answer are 2
 different entities. if you are using the data import
 handler, i would personally create them as separate entities
 with their own queries to the database using the deltaQuery
 method to pick up only new rows. i guess it depends if you
 need question + answers to actually come back out to be used
 for display (i.e. you stored their data), or whether it's
 good enough to match on question/answer separately and then
 just link to a question ID in your UI to drill-down from the
 database.
 
 disclaimer: i am a solr novice - just started, so i'd see
 what others think too ;)
 
 On Oct 1, 2010, at 7:38 AM, Andy wrote:
 
 I'm building a QA application. There's a
 Question database table and an Answer table.
 
 For each question, I'm putting the question itself
 plus all the answers into a single field text to be
 indexed and searched.
 
 Say I have a question that has 10 existing answers
 that are already indexed. If a new answer is submitted for
 that question, is there any way I could just append the
 new answer to the text field? Or is the only way to
 implement this is to retrieve the original question and the
 10 existing answers from the database, combine them with the
 newly submitted 11th answer, and re-index everything from
 scratch?
 
 The latter option just seems inefficient. Is there a
 better design that could be used for this use case?
 
 Andy
 
 
 
 
 
 
 
 



Re: any working SolrJ code example for Solr 1.4.1

2010-10-01 Thread Allistair Crossley
no example anyone gives you will solve your class not found exception .. you 
need to ensure the relevant jars (in dist) are included in your solr instance's 
lib folder i guess?

On Oct 1, 2010, at 10:50 AM, Xin Li wrote:

 Hi, there, 
 
 Just picked up SolrJ few days ago. I have my Solr Server set up, data
 loaded, and everything worked fine with the web admin page. Then problem
 came when I was trying to use SolrJ to interact with the Solr server. I
 was stuck with NoClassNotFoundException yesterday. Being new to the
 domain is a factor, but SolrJ could really use some more updated
 documentation. 
 
 .. Long story short, does anyone have a minimal working SolrJ example
 interacting with Solr 1.4.1? It would be nice to know the JARs too since
 the errors I got were probably more related to JARs than the code
 itself. 
 
 Thanks,
 Xin 
 
 
 
 
 
 
 
 This electronic mail message contains information that (a) is or
 may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
 PROTECTED
 BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
 the addressee(s) named herein.  If you are not an intended
 recipient, please contact the sender immediately and take the
 steps necessary to delete the message completely from your
 computer system.
 
 Not Intended as a Substitute for a Writing: Notwithstanding the
 Uniform Electronic Transaction Act or any other law of similar
 effect, absent an express statement to the contrary, this e-mail
 message, its contents, and any attachments hereto are not
 intended
 to represent an offer or acceptance to enter into a contract and
 are not otherwise intended to bind this sender,
 barnesandnoble.com
 llc, barnesandnoble.com inc. or any other person or entity.



Re: any working SolrJ code example for Solr 1.4.1

2010-10-01 Thread Allistair Crossley
did you miss the page here http://wiki.apache.org/solr/Solrj ? this tells you 
the jars required for your classpath as well as usage examples 

On Oct 1, 2010, at 11:57 AM, Xin Li wrote:

 That's precisely the reason I was asking about JARs too. It seems that I
 am the minority that ran into SolrJ issue. If that's the case, I will
 grab Perl solution, and come back to SolrJ later. 
 
 Thanks,
 Xin
 
 -Original Message-
 From: Allistair Crossley [mailto:a...@roxxor.co.uk] 
 Sent: Friday, October 01, 2010 11:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: any working SolrJ code example for Solr 1.4.1
 
 no example anyone gives you will solve your class not found exception ..
 you need to ensure the relevant jars (in dist) are included in your solr
 instance's lib folder i guess?
 
 On Oct 1, 2010, at 10:50 AM, Xin Li wrote:
 
 Hi, there, 
 
 Just picked up SolrJ few days ago. I have my Solr Server set up, data
 loaded, and everything worked fine with the web admin page. Then
 problem
 came when I was trying to use SolrJ to interact with the Solr server.
 I
 was stuck with NoClassNotFoundException yesterday. Being new to the
 domain is a factor, but SolrJ could really use some more updated
 documentation. 
 
 .. Long story short, does anyone have a minimal working SolrJ example
 interacting with Solr 1.4.1? It would be nice to know the JARs too
 since
 the errors I got were probably more related to JARs than the code
 itself. 
 
 Thanks,
 Xin 
 
 
 
 
 
 
 
 This electronic mail message contains information that (a) is or
 may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
 PROTECTED
 BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
 the addressee(s) named herein.  If you are not an intended
 recipient, please contact the sender immediately and take the
 steps necessary to delete the message completely from your
 computer system.
 
 Not Intended as a Substitute for a Writing: Notwithstanding the
 Uniform Electronic Transaction Act or any other law of similar
 effect, absent an express statement to the contrary, this e-mail
 message, its contents, and any attachments hereto are not
 intended
 to represent an offer or acceptance to enter into a contract and
 are not otherwise intended to bind this sender,
 barnesandnoble.com
 llc, barnesandnoble.com inc. or any other person or entity.
 
 
 This electronic mail message contains information that (a) is or
 may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
 PROTECTED
 BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
 the
 addressee(s) named herein.  If you are not an intended recipient,
 please contact the sender immediately and take the steps
 necessary
 to delete the message completely from your computer system.
 
 Not Intended as a Substitute for a Writing: Notwithstanding the
 Uniform Electronic Transaction Act or any other law of similar
 effect, absent an express statement to the contrary, this e-mail
 message, its contents, and any attachments hereto are not
 intended
 to represent an offer or acceptance to enter into a contract and
 are not otherwise intended to bind this sender,
 barnesandnoble.com
 llc, barnesandnoble.com inc. or any other person or entity.



Re: SolrJ

2010-09-30 Thread Allistair Crossley
it's in the dist folder with the name provided by the wiki page you refer to

On Sep 30, 2010, at 3:01 PM, Christopher Gross wrote:

 Where can I get SolrJ?  The wiki makes reference to it, and says that it is
 a part of the Solr builds that you download, but I can't find it in the jars
 that come with it.  Can anyone shed some light on this for me?
 
 Thanks!
 
 -- Chris



Missing facet values for zero counts

2010-09-29 Thread Allistair Crossley
Hello list,

I am implementing a directory using Solr. The user is able to search with a 
free-text query or 2 filters (provided as pick-lists) for country. A directory 
entry only has one country.

I am using Solr facets for country and I use the facet counts generated 
initially by a *:* search to generate my pick-list.

This is working fairly well but there are a couple of issues I am facing.

Specifically the countries pick-list does not contain ALL possible countries. 
It only contains those that have been indexed against a document. 

I have looked at facet.missing but I cannot see how this will work - if no 
documents have a country of Sweden, then how would Solr know to generate a 
missing total of zero for Sweden - it's never heard of it.

I feel I am missing something - is there a way by which you tell Solr all 
possible countries rather than relying on counts generated from the index? 

The countries in question reside in a database table belonging to our 
application.

Thanks, Allistair

Re: Missing facet values for zero counts

2010-09-29 Thread Allistair Crossley
Hi,

For us this is a usability concern. You either don't show Sweden in a pick-list 
called Country and some users go away thinking you don't *ever* support Sweden 
(not true). OR you allow a user to execute an empty result search - but at 
least they know you do support Sweden.

It is we believe undesirable for a pick-list to change from day to day as the 
index changes - we have a category pick-list also that acts the same. One day a 
user could see Productions, the next day nothing. Regular users would see this 
as odd.

We believe that usability dictates we show all possible values and add a zero 
after to prevent the user executing searches but at least they see the 
possibilities. The best of both worlds we hope.

I have solved this using earlier suggestions of merging a database list query 
with the Solr facet counts. I like your idea though - good thinking but the way 
I've done is working great also :)

Thanks and best wishes, Allistair

On 29 Sep 2010, at 14:08, kenf_nc wrote:

 
 I don't understand why you would want to show Sweden if it isn't in the
 index, what will your UI do if the user selects Sweden?
 
 However, one way to handle this would be to make a second document type.
 Have a field called type or some such, and make the new document type be
 'dummy' or 'system' or something like that. You can put documents in here
 with fields for any pick-lists you want to facet on and include all possible
 values from your database.
 
 Do your facets on either just this doc, or all docs, either way should work.
 However on your search queries always include   fq=-type:system
 basically exclude all documents of type system from all your searches. 
 Messy, but should do what you want.
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Missing-facet-values-for-zero-counts-tp1602276p1603893.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr rate limiting / DoS attacks

2010-09-29 Thread Allistair Crossley
This kind of thing is not limited to Solr and you normally wouldn't solve it in 
software - it's more a network concern. I'd be looking at a web server solution 
such as Apache mod_evasive combined with a good firewall for more conventional 
DOS attacks. Just hide your Solr install behind the firewall and communicate 
with it locally from your web application or whatever.

Rate limiting sounds like something Solr should or could provide but I don't 
know the answer to that. 

Cheers

On Sep 29, 2010, at 2:52 PM, Ian Upright wrote:

 Hi, I'm curious as to what approaches one would take to defend against users
 attacking a Solr service, especially if exposed to the internet as opposed
 to an intranet.  I'm fairly new to Solr, is there anything built in?
 
 Is there anything in place to prevent the search engine from getting
 overwhelmed by a particular user or group of users, submitting loads of
 time-consuming queries as some form of a DoS attack?  
 
 Additionally, is there a way of rate-limiting it so that only a certain
 number of queries per user/per hour can be submitted, etc?  (for example, to
 prevent programmatic access to the search engine as opposed to a human user)
 
 Thanks, Ian