Optional filter queries
Evening all, A subset of my documents have a field, filterMinutes, that some other documents do not. filterMinutes stores a number. I often issue a query that contains a filter query range, e.g. q=filterMinutes:[* TO 50] I am finding that adding this query excludes all documents that do not feature this field, but what I want is for the filter query to act upon those documents that do have the field but also to return documents that don't have it at all. Is this a possibility? Best, Allistair
Same index is ranking differently on 2 machines
Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: +0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of: 1.0 = tf(termFreq(text:product)=1) 1.8505468 = idf(docFreq=2589, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of: -0.1725098 = queryWeight(profile:product^2.0), product of: + 0.17186289 = (MATCH) weight(profile:product^2.0 in 1551), product of: +0.08028162 = queryWeight(profile:product^2.0), product of: 2.0 = boost 4.036637 = idf(docFreq=290, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 2.14075 = (MATCH) fieldWeight(profile:product in 1551), product of: 1.4142135 = tf(termFreq(profile:product)=2) 4.036637 = idf(docFreq=290, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) - 0.59742856 = (MATCH) max plus 0.01 times others of: -0.59742856 = weight(profile:dubai product~10^0.5 in 1551), product of: - 0.12465195 = queryWeight(profile:dubai product~10^0.5), product of: +
Re: Same index is ranking differently on 2 machines
Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: +0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012
Re: Same index is ranking differently on 2 machines
That's what I think, glad I am not going mad. I've spent 1/2 a day comparing the config files, checking out from SVN again and ensuring the databases are identical. I cannot see what else I can do to make them equivalent. Both servers checkout directly from SVN, I am convinced the files are the same. The database is definately the same. Not sure what you mean about having identical indices - that's my problem - I don't - or do you mean something else I've missed? But yes everything else you mention is identical, I am as certain as I can be. I too think there must be a difference I have missed but I have run out of ideas for what to check! Frustrating :) On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote: Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -
Re: Same index is ranking differently on 2 machines
Oh wow, how did I miss that? My apologies to anyone who read this post. I should have diffed my custom dismax handler. Looks like my SVN merge didn't work properly. Embarassing. Thanks everyone ;) On Mar 9, 2011, at 4:51 PM, Yonik Seeley wrote: On Wed, Mar 9, 2011 at 4:49 PM, Jayendra Patil jayendra.patil@gmail.com wrote: Are you sure you have the same config ... The boost seems different for the field text - text:dubai^0.1 text:dubai Yep... Try adding echoParams=all and see all the parameters solr is acting on. http://wiki.apache.org/solr/CoreQueryParameters#echoParams -Yonik http://lucidimagination.com -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: Regards, Jayendra On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote: Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq
Re: [Adding] Entities when indexing a DB
mission.id and event.id if the same value will be overwriting the indexed document. your ids need to be unique across all documents. i usually have a field id_original that i map the table id to, and then for id per entity i usually prefix it with the entity name in the value mapped to the schema id field On 15 Dec 2010, at 20:49, Adam Estrada wrote: All, I have successfully indexed a single entity but when I try multiple entities is the second is skipped all together. Is there something wrong with my config file? ?xml version=1.0 encoding=utf-8 ? dataConfig dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://10.0.2.93;databaseName=50_DEV user=adam password=password/ document name=events entity datasource=MISSIONS query = SELECT IdMission AS id, CoreGroup AS cat, StrMissionname AS subject, strDescription AS description, DateCreated AS pubdate FROM dbo.tblMission field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity entity datasource=EVENTS query = SELECT strsubject AS subject, strsummary as description, datecreated as date, CoreGroup as cat, idevent as id FROM dbo.tblEvent field column=id name=id / field column=cat name=cat / field column=subject name=subject / field column=description name=description / field column=pubdate name=date / /entity /document /dataConfig
Re: search over two independent tables
your first example is correct document entity name=newsfeed/entity entity name=message/entity /document i have the same config for indexing 5 different tables what you don't have from what i can see is a field name mapped to each column, e.g. field column=nf_text / i always have to provide the destination field in schema.xml, e.g. field column=nf_text name=the_field / On Oct 14, 2010, at 5:22 AM, Anthony Maudry wrote: Hello, I'm using Solr with a postgreSQL database. I need to search across two tables with no link between them. ie : I have got a messages table and a newsfeeds table, nothing liking them. I tried to configure my data-config.xml to implement this but it seems that tables can't be defined separately. The configuration I first tried was the following : dataConfig dataSource driver=org.postgresql.Driver url=jdbc:postgresql://host/database user=user password=password / document entity name=newsfeeds query=select id as nf_id, text as nf_text, url, note from newsfeeds field column=nf_text / /entity entity name=messages query=select id as m_id, body from messages field column=body / /entity /document /dataConfig Note that the two entities are at the same level. Only the first entity (newsfeeds) will give results I then tried this config : dataConfig dataSource driver=org.postgresql.Driver url=jdbc:postgresql://host/database user=user password=password / document entity name=newsfeeds query=select id as nf_id, text as nf_text, url, note from newsfeeds field column=nf_text / entity name=messages query=select id as m_id, body from messages field column=body / /entity /entity /document /dataConfig As expected the results were crossed. I wonder how I could implement the search over two independent tables? Thanks for any answer. Anthony
Re: search over two independent tables
actually your intention is unclear ... are you wanting to run a single search and get back results from BOTH newsfeed and message? or do you want one or the other? if you want one or the other you could use my strategy which is to store the entity type as a field when indexing, e.g. entityfield column=type name=type_field //entity entityfield column=type name=type_field //entity note, if you don't have a foo column for type, make it up in your query, e.g. select n.*, 'Newsfeed' as type from ... then for silo'd searches you would want to ensure a filter of type:Newsfeed. another handy thing is to facet.field=type and search (without a filter) as then you'll get back counts for your Newsfeed Message results too. On Oct 14, 2010, at 5:44 AM, Allistair Crossley wrote: your first example is correct document entity name=newsfeed/entity entity name=message/entity /document i have the same config for indexing 5 different tables what you don't have from what i can see is a field name mapped to each column, e.g. field column=nf_text / i always have to provide the destination field in schema.xml, e.g. field column=nf_text name=the_field / On Oct 14, 2010, at 5:22 AM, Anthony Maudry wrote: Hello, I'm using Solr with a postgreSQL database. I need to search across two tables with no link between them. ie : I have got a messages table and a newsfeeds table, nothing liking them. I tried to configure my data-config.xml to implement this but it seems that tables can't be defined separately. The configuration I first tried was the following : dataConfig dataSource driver=org.postgresql.Driver url=jdbc:postgresql://host/database user=user password=password / document entity name=newsfeeds query=select id as nf_id, text as nf_text, url, note from newsfeeds field column=nf_text / /entity entity name=messages query=select id as m_id, body from messages field column=body / /entity /document /dataConfig Note that the two entities are at the same level. Only the first entity (newsfeeds) will give results I then tried this config : dataConfig dataSource driver=org.postgresql.Driver url=jdbc:postgresql://host/database user=user password=password / document entity name=newsfeeds query=select id as nf_id, text as nf_text, url, note from newsfeeds field column=nf_text / entity name=messages query=select id as m_id, body from messages field column=body / /entity /entity /document /dataConfig As expected the results were crossed. I wonder how I could implement the search over two independent tables? Thanks for any answer. Anthony
Re: search over two independent tables
results from both tables with 1 search - your first suggestion with separate entities under document is right, or at least how i do it. things that i have often found ... 0. check stdout for SQL errors 1. verify that your SQL works when you run it direct on your database! 2. verify that your search would definately match - choose a keyword only in a message, or index a type field as I mentioned and use the solr/select?q=type:Message strategy to see into the index whether anything is there to confirm 3. when you make changes to schema.xml or dataimport.xml make sure you restart solr and fully re-index your changes (this often caught me out) 4. are you checking this on a single server or do you have a stage and production server too (this caught me out sometimes) 5. make sure if you are setting your unique ID field with a unique field from each entity otherwise one will overwrite the other. I have 2 ID fields, one called id and one called uid. UID is my unique field and in each entity I prefix the row id with a letter, e.g. N1, M1. then i store the actual id (you need to generate it in the sql, e.g. select id, concat('N', cast(id as char(50)) as uid from ... to make life easier. allistair On Oct 14, 2010, at 6:06 AM, Anthony Maudry wrote: Thanks for your quick answer. Actually I need to get result from both tables from a single search. I tried to define correctly every fields as you told me in your previous message but I only get result from one table (actualy Newsfeeds) Le 14/10/2010 11:49, Allistair Crossley a écrit : actually your intention is unclear ... are you wanting to run a single search and get back results from BOTH newsfeed and message? or do you want one or the other? if you want one or the other you could use my strategy which is to store the entity type as a field when indexing, e.g. entityfield column=type name=type_field //entity entityfield column=type name=type_field //entity note, if you don't have a foo column for type, make it up in your query, e.g. select n.*, 'Newsfeed' as type from ... then for silo'd searches you would want to ensure a filter of type:Newsfeed. another handy thing is to facet.field=type and search (without a filter) as then you'll get back counts for your Newsfeed Message results too.
Re: check if field CONTAINS a value, as opposed to IS of a value
i think you need to look at ngram tokenizing On Oct 14, 2010, at 7:55 AM, PeterKerk wrote: I try to determine if a certain word occurs within a field. http://localhost:8983/solr/db/select/?indent=onfacet=truefl=id,titleq=introtext:hi this works if an EXACT match was found on field introtext, thus the field value is just hi But if the field value woud be hi there, this is just some text, the above URL does no longer find this record. What is the queryparameter to ask solr to look inside the introtext field for a value (and even better also for synonyms) -- View this message in context: http://lucene.472066.n3.nabble.com/check-if-field-CONTAINS-a-value-as-opposed-to-IS-of-a-value-tp1700495p1700495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: check if field CONTAINS a value, as opposed to IS of a value
actuall no you don't .. if you want hi in a sentence of hi there this is me this is just normal tokenizing and should work .. check your field type/analysers On Oct 14, 2010, at 7:59 AM, Allistair Crossley wrote: i think you need to look at ngram tokenizing On Oct 14, 2010, at 7:55 AM, PeterKerk wrote: I try to determine if a certain word occurs within a field. http://localhost:8983/solr/db/select/?indent=onfacet=truefl=id,titleq=introtext:hi this works if an EXACT match was found on field introtext, thus the field value is just hi But if the field value woud be hi there, this is just some text, the above URL does no longer find this record. What is the queryparameter to ask solr to look inside the introtext field for a value (and even better also for synonyms) -- View this message in context: http://lucene.472066.n3.nabble.com/check-if-field-CONTAINS-a-value-as-opposed-to-IS-of-a-value-tp1700495p1700495.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What is the maximum number of documents that can be indexed ?
i think you answered the question by yourself ... these questions usually get the response that there is no answer. solr/lucence scale and distribute to whatever hardware you want to throw them. you probably want to turn the question around - what is the maximum number of documents that your system *wants/will need to* index over 1, 2, 5, 10 years? at what rate? wat what load? design your architecture/hardware to the requirement. On Oct 14, 2010, at 8:01 AM, Marco Ciaramella wrote: Hi all, I am working on a performance specification document on a Solr/Lucene-based application; this document is intended for the final customer. My question is: what is the maximum number of document I can index assuming 10 or 20kbytes for each document? I could not find a precise answer to this question, and I tend to consider that Solr index can be virtually limited only by the JVM, the Operating System (limits to large file support), or by hardware constraints (mainly RAM, etc. ... ). Thanks Marco
Re: search over two independent tables
super On Oct 14, 2010, at 8:00 AM, Anthony Maudry wrote: Sorry for the late answer. It works now thanks to you, Allistair. I needed to use your uid field, common to the two entities but built in different ways. here is the result in a sample of the data-config.xml file ... document entity name=newsfeeds query=select id as nf_id, 'newsfeed ' || cast(id as char(50)) as nf_uid, text as nf_text, url, note from newsfeeds field column=nf_id name=news_id / field column=nf_uid name=uid / ... /entity entity name=messages query=select id as m_id, 'message ' || cast(id as char(50)) as m_uid, body from messages field column=m_id name=message_id / field column=m_uid name=uid / ... /entity /document ... uid is define as uniqueKey in the schema.xml file. Thank you for your help
Re: What is the maximum number of documents that can be indexed ?
me also. great book, just wanted a bit more on complex DIH :) On Oct 14, 2010, at 10:38 AM, Jason Brown wrote: Not related to the opening thread - but wante to thank Eric for his book. Clarified a lot of stuff and very useful. -Original Message- From: Eric Pugh [mailto:ep...@opensourceconnections.com] Sent: Thu 14/10/2010 15:34 To: solr-user@lucene.apache.org Subject: Re: What is the maximum number of documents that can be indexed ? I would recommend looking at the work the HathiTrust has done. They have published some really great blog articles about the work they have done in scaling Solr, and have put in huge amounts of data. The good news is that there isn't a exact number, because It depends. The bad news is that there isn't an exact number because it depends! Eric On Oct 13, 2010, at 8:58 PM, Otis Gospodnetic wrote: Marco (use solr-u...@lucene list to follow up, please), There are no precise answers to such questions. Solr can keep indexing. The limit is, I think, the available disk space. I've never pushed Solr or Lucene to the point where Lucene index segments would become a serious pain, but even that can be controlled. Same thing with number of open files, large file support, etc. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Marco Ciaramella ciaramellama...@gmail.com To: d...@lucene.apache.org Sent: Wed, October 13, 2010 6:19:15 PM Subject: What is the maximum number of documents that can be indexed ? Hi all, I am working on a performance specification document on a Solr/Lucene-based application; this document is intended for the final customer. My question is: what is the maximum number of document I can index assuming 10 or 20kbytes for each document? I could not find a precise answer to this question, and I tend to consider that Solr index can be virtually limited only by the JVM, the Operating System (limits to large file support), or by hardware constraints (mainly RAM, etc. ... ). Thanks Marco - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server Free/Busy: http://tinyurl.com/eric-cal If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: which schema.xml to modify ?
you will find it in the distribution at example/solr/config On Oct 14, 2010, at 3:04 PM, Ibrahim Diop wrote: Hi All, I'm a new solr user and I just want to know which schema.xml file to modify for this tutorial : http://lucene.apache.org/solr/tutorial.html Thanks, Ibrahim.
Re: Synchronizing Solr with a PostgreDB
i would not cross-reference solr results with your database to merge unless you want to spank your database. nor would i load solr with all your data. what i have found is that the search results page is generally a small subset of data relating to the fuller document/result. therefore i store only the data required to present the search results wholly from solr. the user can choose to click into a specific result which then uses just the database to present it. use data import handler - define an xml config to import as many entities into your document as you need and map columns to fields in schema.xml. use the Wiki page on DIH - it's all there, as well as example config in the solr distro. allistair On Oct 14, 2010, at 6:13 PM, Juan Manuel Alvarez wrote: Hello everyone! I am new to Solr and Lucene and I would like to ask you a couple of questions. I am working on an existing system that has the data saved in a Postgre DB and now I am trying to integrate Solr to use full-text search and faceted search, but I am having a couple of doubts about it. 1) I see two ways of storing the data and make the search: - Duplicate all the DB data in Solr, so complete results are returned from a search query, or... - Put in Solr just the data that I need to search and, after finding the elements with a Solr query, use the result to make a more specific query to the DB. Which is the way this is normally done? 2) How do I synchronize Solr and Postgre? Do I have to use the DataImportHandler or when I do the INSERT command into Postgre, I have to execute a command into Solr? Thanks for your time! Cheers! Juan M.
Getting an ngram fieldtype to work
Morning all, I would like to ngram a company name field in our index. I have read about the costs of doing so in the great David Smiley Solr 1.4 book and just to get started I have followed his example in setting up an ngram field type as follows: fieldType name=text_substring class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.NGramFilterFactory minGramSize=4 maxGramSize=15 / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I have restarted/reindexed everything but I still cannot search hoot and get back the company named Shooter. searching shooter is fine. I have followed other examples on the internet regards an ngram field type. Some examples seem to use an index analyzer that has an ngram tokenizer rather than filter if this makes a difference. But in all cases I am not seeing the expected result, just 0 results. Is there anything else I should be considering here? I feel like I must be very close, it doesn't seem complicated but yet it's not working like everything else I have done with solr to date :) Any guidance appreciated, Allistair
Re: Getting an ngram fieldtype to work
Hi, Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as expected, so it's not my configuration that is at fault (he says) Index Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter Query Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter Yet, searching either /solr/select?q=hoot or /solr/select?q=name:hoot does not yield results. When searching for shooter I see 2 results with names: 1. str name=nameShooters International Inc/str 2. str name=nameHong Kong Shooter/str Yours, puzzled :) On Oct 8, 2010, at 8:38 AM, Jan Høydahl / Cominvent wrote: Hi, The first thing I would try is to go to the analysis page, enter your test data, and report back what each analysis stage prints out: http://localhost:8983/solr/admin/analysis.jsp -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 8. okt. 2010, at 14.19, Allistair Crossley wrote: Morning all, I would like to ngram a company name field in our index. I have read about the costs of doing so in the great David Smiley Solr 1.4 book and just to get started I have followed his example in setting up an ngram field type as follows: fieldType name=text_substring class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.NGramFilterFactory minGramSize=4 maxGramSize=15 / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I have restarted/reindexed everything but I still cannot search hoot and get back the company named Shooter. searching shooter is fine. I have followed other examples on the internet regards an ngram field type. Some examples seem to use an index analyzer that has an ngram tokenizer rather than filter if this makes a difference. But in all cases I am not seeing the expected result, just 0 results. Is there anything else I should be considering here? I feel like I must be very close, it doesn't seem complicated but yet it's not working like everything else I have done with solr to date :) Any guidance appreciated, Allistair
Re: Getting an ngram fieldtype to work
Oh my. I am basically being a total monkey. Every time I was changing my schema.xml to try new things out I was then reindexing our staging server's index instead of my local dev index so no changes were occurring locally. Dear me. This is working now, surprise. On Oct 8, 2010, at 8:53 AM, Markus Jelsma wrote: How come your query analyser spits out grams? It isn't configured to do so or you posted an older field definition. Anyway, do you actually search on your new field? On Friday, October 08, 2010 02:46:08 pm Allistair Crossley wrote: Hi, Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as expected, so it's not my configuration that is at fault (he says) Index Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter Query Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote ooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hoote oote rshoote hooter Yet, searching either /solr/select?q=hoot or /solr/select?q=name:hoot does not yield results. When searching for shooter I see 2 results with names: 1. str name=nameShooters International Inc/str 2. str name=nameHong Kong Shooter/str Yours, puzzled :) On Oct 8, 2010, at 8:38 AM, Jan Høydahl / Cominvent wrote: Hi, The first thing I would try is to go to the analysis page, enter your test data, and report back what each analysis stage prints out: http://localhost:8983/solr/admin/analysis.jsp -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 8. okt. 2010, at 14.19, Allistair Crossley wrote: Morning all, I would like to ngram a company name field in our index. I have read about the costs of doing so in the great David Smiley Solr 1.4 book and just to get started I have followed his example in setting up an ngram field type as follows: fieldType name=text_substring class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.NGramFilterFactory minGramSize=4 maxGramSize=15 / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I have restarted/reindexed everything but I still cannot search hoot and get back the company named Shooter. searching shooter is fine. I have followed other examples on the internet regards an ngram field type. Some examples seem to use an index analyzer that has an ngram tokenizer rather than filter if this makes a difference. But in all cases I am not seeing the expected result, just 0 results. Is there anything else I should be considering here? I feel like I must be very close, it doesn't seem complicated but yet it's not working like everything else I have done with solr to date :) Any guidance appreciated, Allistair -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: Getting an ngram fieldtype to work
Well, a lot of this is working but not all. Consider the company name Shooters Inc My ngram field is able to match queries to the name for shoot and hoot and so on. This works. However consider the company name Location Scotland If I query scot I get one result back - but it's for a company called Prescott Inc I looked at the analyzer and realised that the NGramTokenizer was generating substrings from the start (left) of the *whole phrase* location scotland Because my max was set to 15 it was not generating a token for scot So I figured I would change to a whitespace tokenizer first and then apply the ngram as a filter. This now looks like it is generating scot in the tokens as shown below: Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 term text locationscotland term type wordword source start,end0,8 9,17 payload org.apache.solr.analysis.NGramFilterFactory {maxGramSize=15, minGramSize=4} term position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 term text locaocatcatiatiotionlocat ocati catio ation locati ocatio cation locatio ocation locationscotcotl otlatlanlandscotl cotla otlan tland scotla cotlan otland scotlan cotland scotland term type wordwordwordwordwordwordwordword wordwordwordwordwordwordwordwordwordword wordwordwordwordwordwordwordwordwordword wordword source start,end0,4 1,5 2,6 3,7 4,8 0,5 1,6 2,7 3,8 0,6 1,7 2,8 0,7 1,8 0,8 9,1310,14 11,15 12,16 13,17 9,1410,15 11,16 12,17 9,1510,16 11,17 9,1610,17 9,17 payload Query Analyzer scot scot BUT it still results no results for scot, but does continue to return the Prescott result. So ngramming is working but it is not working when the query is something far to the right of the indexed value. Is this another user-error or have I missed something else here? Cheers On Oct 8, 2010, at 9:02 AM, Allistair Crossley wrote: Oh my. I am basically being a total monkey. Every time I was changing my schema.xml to try new things out I was then reindexing our staging server's index instead of my local dev index so no changes were occurring locally. Dear me. This is working now, surprise. On Oct 8, 2010, at 8:53 AM, Markus Jelsma wrote: How come your query analyser spits out grams? It isn't configured to do so or you posted an older field definition. Anyway, do you actually search on your new field? On Friday, October 08, 2010 02:46:08 pm Allistair Crossley wrote: Hi, Yep, I was just looking at the analyzer jsp. The ngrams *do* exist as expected, so it's not my configuration that is at fault (he says) Index Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter Query Analyzer sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteooter shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter sh ho oo ot te er sho hoo oot ote ter shoohootooteotershoot hooteoote r shoote hooter Yet, searching either /solr/select?q=hoot or /solr/select?q=name:hoot does not yield results. When searching for shooter I see 2 results with names: 1. str name=nameShooters International Inc/str 2. str name=nameHong
Re: Strategy for re-indexing
Thanks for your time responding to this. I have decided also to go down the route of cron-scheduled Perl LWP pings to DIH + deltaQueries. This seems to work inline with what the business requires and for the index size. Thanks again On Oct 7, 2010, at 7:46 AM, Shawn Heisey wrote: On 10/6/2010 10:49 AM, Allistair Crossley wrote: Hi, I was interested in gaining some insight into how you guys schedule updates for your Solr index (I have a single index). Right now during development I have added deltaQuery specifications to data import entities to control the number of rows being queries on re-indexes. However in terms of *when* to reindex we have a lot going on in the system - there are 4 sub-systems: custom application data, a CMS, a forum and a blog. It's all being indexed and at any given time there will be users and administrators all updating various parts of the sub-systems. For the time being during development I have been issuing reindexes to the data import handler on each CRUD on any given sub-system. This has been working fine to be honest. It does need to be as immediate as possible - a scheduled update won't work for us. Even every 10 minutes is probably not fast enough. So I wonder what others do. Is anyone else in a similar situation? And what happens if 4 users generate 4 different requests to the data import handler to update for different types of data? The DIH will be running already let's say for request 1, then request 2 comes in - is it rejected? Or is it queued? I need it to be queued and serviced because the request 1 re-index may have already run its queries but missed the data added by the user for request 2. Same then goes for the requests 3 and 4. I can't say whether the DIH will properly handle concurrent requests or not. I figure it's always best to assume that things like this won't work and find an elegant way to design around it. I wrote my build system in perl (using LWP and LWP::Simple), and assumed that the DIH would not let me run concurrent delta-imports. We settled on every two minutes for our update frequency, and use cron for scheduling. Two of my servers (VMs, actually) are a heartbeat cluster running HAProxy for load balancing, which I implemented purely for redundancy, not for scalability. Whichever host in the heartbeat cluster is online is the one that runs the cronjobs. I have the following processes and schedules: idxUpdate: Runs every two minutes. This script imports new data, based on an autoincrement primary key in the database, the field is DID. From the database perspective, changed data looks like new data - it gets its DID updated but another unique field (TAG_ID) stays the same. Solr uses TAG_ID as its uniqueKey. Updates go into an incremental shard that is relatively small - usually less than 1GB and 500,000 documents. At the top of the hour, the update includes a call to optimize. idxDelete: Runs every ten minutes starting at xx:01. This script gets the list of newly deleted documents by DID. Then, 1024 of them at a time, it queries every shard for this list and issues a delete if they are found. After the entire list is complete, it issues a commit to any shard that was actually changed. This increases the lifespan of indexSearchers and Solr caches. At the top of each hour, it reads the entire list of deletes instead of new ones, and trims the delete list to the last 48 hours. idxRrdUpdate: Runs once an hour. This simply records the current MAX(DID) from the database into an RRD database. I keep it in both a counter and a gauge. One day I will track other statistical data about my system and make it all into pretty graphs. idxDistribute: Runs once a day. This uses the historical data in the RRD database to decide which incremental data is older than one week. Once it has that information (a DID range), it distributes those records to each of the six static index shards and deletes them from the incremental shard. If that process is successful, it updates the stored minimum DID value for the incremental. Each day, one of the static indexes (currently 13GB and 7.6 million records) is optimized. You might wonder how we deal with the fact that when a record is changed, the old one might remain in the index for as long as 11 minutes before the delete process finally removes it. We assume that the incremental index, being less than 10% of the size of the static indexes, will always respond faster. Since the updated copy of the record will always be in the incremental, it should respond first to the distributed query and therefore be the one that is included in the results. That assumption seems to be correct so far. Shawn
Re: multi level faceting
I think that is just sending 2 fq facet queries through. In Solr PHP I would do that with, e.g. $params['facet'] = true; $params['facet.fields'] = array('Size'); $params['fq'] = array('sex' = array('Men', 'Women')); but yes i think you'd have to send through what the current facet query is and add it to your next drill-down On Oct 4, 2010, at 9:36 AM, Nguyen, Vincent (CDC/OD/OADS) (CTR) wrote: Hi, I was wondering if there's a way to display facet options based on previous facet values. For example, I've seen many shopping sites where a user can facet by Mens or Womens apparel, then be shown sizes to facet by (for Men or Women only - whichever they chose). Is this something that would have to be handled at the application level? Vincent Vu Nguyen
Re: DIH sub-entity not indexing
Hey, Yes that tool doesn't work too well for me. I can load it up and get the forms on the left, but when I run a debug the right hand side tells me that the page is not found. I *think* this is because I use a custom query string parameter in my DIH XML for use with delta querying and this being missing is failing the tool and it doesn't support adding custom query string params. Cheers, Allistair On Oct 4, 2010, at 9:20 AM, Ephraim Ofir wrote: The closest you can get to debugging (without actually debugging...) is to look at the logs and use http://wiki.apache.org/solr/DataImportHandler#Interactive_Development_Mo de Ephraim Ofir -Original Message- From: Allistair Crossley [mailto:a...@roxxor.co.uk] Sent: Monday, October 04, 2010 3:09 PM To: solr-user@lucene.apache.org Subject: Re: DIH sub-entity not indexing Thanks Ephraim. I tried your suggestion with the ID but capitalising it did not work. Indeed, I have a column that already works using a lower-case id. I wish I could debug it somehow - see the SQL? Something particular about this config it is not liking. I read the post you linked to. This is more a performance-related thing for him. I would be happy just to see low performance and my contacts populated right now!! :D Thanks again On Oct 4, 2010, at 9:00 AM, Ephraim Ofir wrote: Make sure you're not running into a case sensitivity problem, some stuff in DIH is case sensitive (and some stuff gets capitalized by the jdbc). Try using listing.ID instead of listing.id. On a side note, if you're using mysql, you might want to look at the CONCAT_WS function. You might also want to look into a different approach than sub-entities - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3 c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com %3E Ephraim Ofir -Original Message- From: Allistair Crossley [mailto:a...@roxxor.co.uk] Sent: Monday, October 04, 2010 2:49 PM To: solr-user@lucene.apache.org Subject: Re: DIH sub-entity not indexing I have tried a more elaborate join also following the features example of the DIH example but same result - SQL works fine directly but Solr is not indexing the array of full_names per Listing, e.g. entity name=listing ... entity name=listing_contact query=select * from listing_contacts where listing_id = '${listing.id}' entity name=contact query=select concat(first_name, concat(' ', last_name)) as full_name from contacts where id = '${listing_contact.contact_id}' field name=contacts column=full_name / /entity /entity /entity Am I missing the obvious? On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote: Hello list, I've been successful with DIH to a large extent but a seemingly simple extra column I need is posing problems. In a nutshell I have 2 entities let's say - Listing habtm Contact. I have copied the relevant parts of the configs below. I have run my SQL for the sub-entity Contact and this is produces correct results. No errors are given by Solr on running the import. Yet no records are being set with the contacts array. I have taken out my sub-entity config and replaced it with a simple template value just to check and values then come through OK. So it certainly seems limited to my query or query config somehow. I followed roughly the example of the DIH bundled example. DIH.xml === entity name=listing ... ... entity name=contacts query=select concat(c.first_name, concat(' ', c.last_name)) as full_name from contacts c inner join listing_contacts lc on c.id = lc.contact_id where lc.listing_id = '${listing.id}' field name=contacts column=full_name / /entity SCHEMA.XML field name=contacts type=text indexed=true stored=true multiValued=true required=false / Any tips appreciated.
Re: DIH sub-entity not indexing
Thanks Ephraim. I tried your suggestion with the ID but capitalising it did not work. Indeed, I have a column that already works using a lower-case id. I wish I could debug it somehow - see the SQL? Something particular about this config it is not liking. I read the post you linked to. This is more a performance-related thing for him. I would be happy just to see low performance and my contacts populated right now!! :D Thanks again On Oct 4, 2010, at 9:00 AM, Ephraim Ofir wrote: Make sure you're not running into a case sensitivity problem, some stuff in DIH is case sensitive (and some stuff gets capitalized by the jdbc). Try using listing.ID instead of listing.id. On a side note, if you're using mysql, you might want to look at the CONCAT_WS function. You might also want to look into a different approach than sub-entities - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3 c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com %3E Ephraim Ofir -Original Message- From: Allistair Crossley [mailto:a...@roxxor.co.uk] Sent: Monday, October 04, 2010 2:49 PM To: solr-user@lucene.apache.org Subject: Re: DIH sub-entity not indexing I have tried a more elaborate join also following the features example of the DIH example but same result - SQL works fine directly but Solr is not indexing the array of full_names per Listing, e.g. entity name=listing ... entity name=listing_contact query=select * from listing_contacts where listing_id = '${listing.id}' entity name=contact query=select concat(first_name, concat(' ', last_name)) as full_name from contacts where id = '${listing_contact.contact_id}' field name=contacts column=full_name / /entity /entity /entity Am I missing the obvious? On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote: Hello list, I've been successful with DIH to a large extent but a seemingly simple extra column I need is posing problems. In a nutshell I have 2 entities let's say - Listing habtm Contact. I have copied the relevant parts of the configs below. I have run my SQL for the sub-entity Contact and this is produces correct results. No errors are given by Solr on running the import. Yet no records are being set with the contacts array. I have taken out my sub-entity config and replaced it with a simple template value just to check and values then come through OK. So it certainly seems limited to my query or query config somehow. I followed roughly the example of the DIH bundled example. DIH.xml === entity name=listing ... ... entity name=contacts query=select concat(c.first_name, concat(' ', c.last_name)) as full_name from contacts c inner join listing_contacts lc on c.id = lc.contact_id where lc.listing_id = '${listing.id}' field name=contacts column=full_name / /entity SCHEMA.XML field name=contacts type=text indexed=true stored=true multiValued=true required=false / Any tips appreciated.
DIH sub-entity not indexing
Hello list, I've been successful with DIH to a large extent but a seemingly simple extra column I need is posing problems. In a nutshell I have 2 entities let's say - Listing habtm Contact. I have copied the relevant parts of the configs below. I have run my SQL for the sub-entity Contact and this is produces correct results. No errors are given by Solr on running the import. Yet no records are being set with the contacts array. I have taken out my sub-entity config and replaced it with a simple template value just to check and values then come through OK. So it certainly seems limited to my query or query config somehow. I followed roughly the example of the DIH bundled example. DIH.xml === entity name=listing ... ... entity name=contacts query=select concat(c.first_name, concat(' ', c.last_name)) as full_name from contacts c inner join listing_contacts lc on c.id = lc.contact_id where lc.listing_id = '${listing.id}' field name=contacts column=full_name / /entity SCHEMA.XML field name=contacts type=text indexed=true stored=true multiValued=true required=false / Any tips appreciated.
Re: DIH sub-entity not indexing
Very clever thinking indeed. Well, that's certainly revealed the problem ... ${listing.id} is empty on my sub-entity query ... And this because I prefix the indexed ID with a letter field column=id name=id template=L${listing.id} / This appears to modify the internal value of $listing.id for subsequent uses. Well, I can work around this now. Thanks! On Oct 4, 2010, at 9:35 AM, Stefan Matheis wrote: Allistair, Indeed, I have a column that already works using a lower-case id. I wish I could debug it somehow - see the SQL? Something particular about this config it is not liking. you may want to try the MySQL Query-Log, to check which Queries are performed? http://dev.mysql.com/doc/refman/5.1/en/query-log.html
Re: solr-user
I updated the SolrJ JAR requirements to be clearer on the wiki page given how many of these SolrJ emails I saw coming through since joining the list. I just created a test java class and imported the removed JARs until I found out the minimal set required. On Oct 4, 2010, at 8:27 AM, Erick Erickson wrote: I suspect you're not actually including the path to those jars. SolrException should be in your solrj jar file. You can test this by executing jar -tf apacheBLAHBLAH.jar which will dump all the class names in the jar file. I'm assuming that you're really including the version for the * in the solrj jar file here So I'd guess it's a classpath issue and you're not really including what you think you are HTH Erick On Fri, Oct 1, 2010 at 11:28 PM, ankita shinde ankitashinde...@gmail.comwrote: -- Forwarded message -- From: ankita shinde ankitashinde...@gmail.com Date: Sat, Oct 2, 2010 at 8:54 AM Subject: solr-user To: solr-user@lucene.apache.org hello, I am trying to use solrj for interfacing with solr. I am trying to run the SolrjTest example. I have included all the following jar files- - commons-codec-1.3.jar - commons-fileupload-1.2.1.jar - commons-httpclient-3.1.jar - commons-io-1.4.jar - geronimo-stax-api_1.0_spec-1.0.1.jar - apache-solr-solrj-*.jar - wstx-asl-3.2.7.jar - slf4j-api-1.5.5.jar - slf4j-simple-1.5.5.jar But its giving me error as 'NoClassDefFoundError: org/apache/solr/client/solrj/SolrServerException'. Can anyone tell me where did i go wrong?
Re: DIH sub-entity not indexing
I have tried a more elaborate join also following the features example of the DIH example but same result - SQL works fine directly but Solr is not indexing the array of full_names per Listing, e.g. entity name=listing ... entity name=listing_contact query=select * from listing_contacts where listing_id = '${listing.id}' entity name=contact query=select concat(first_name, concat(' ', last_name)) as full_name from contacts where id = '${listing_contact.contact_id}' field name=contacts column=full_name / /entity /entity /entity Am I missing the obvious? On Oct 4, 2010, at 8:22 AM, Allistair Crossley wrote: Hello list, I've been successful with DIH to a large extent but a seemingly simple extra column I need is posing problems. In a nutshell I have 2 entities let's say - Listing habtm Contact. I have copied the relevant parts of the configs below. I have run my SQL for the sub-entity Contact and this is produces correct results. No errors are given by Solr on running the import. Yet no records are being set with the contacts array. I have taken out my sub-entity config and replaced it with a simple template value just to check and values then come through OK. So it certainly seems limited to my query or query config somehow. I followed roughly the example of the DIH bundled example. DIH.xml === entity name=listing ... ... entity name=contacts query=select concat(c.first_name, concat(' ', c.last_name)) as full_name from contacts c inner join listing_contacts lc on c.id = lc.contact_id where lc.listing_id = '${listing.id}' field name=contacts column=full_name / /entity SCHEMA.XML field name=contacts type=text indexed=true stored=true multiValued=true required=false / Any tips appreciated.
Re: solrj
i rewrote the top jar section at http://wiki.apache.org/solr/Solrj and the following code then runs fine. import java.net.MalformedURLException; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; import org.apache.solr.client.solrj.response.QueryResponse; import org.apache.solr.common.SolrDocument; import org.apache.solr.common.SolrDocumentList; class TestSolrQuery { public static void main(String[] args) { String url = http://localhost:8983/solr;; SolrServer server = null; try { server = new CommonsHttpSolrServer(url); } catch (MalformedURLException e) { System.out.println(e); System.exit(1); } SolrQuery query = new SolrQuery(); query.setQuery(*:*); QueryResponse rsp = null; try { rsp = server.query(query); } catch(SolrServerException e) { System.out.println(e); System.exit(1); } SolrDocumentList docs = rsp.getResults(); for (SolrDocument doc : docs) { System.out.println(doc.toString()); } } } On Oct 4, 2010, at 11:26 AM, Xin Li wrote: I asked the exact question the day before. If you or anyone else has pointer to the solution, please share on the mail list. For now, I am using Perl script instead to query Solr server. Thanks, Xin -Original Message- From: ankita shinde [mailto:ankitashinde...@gmail.com] Sent: Saturday, October 02, 2010 2:30 PM To: solr-user@lucene.apache.org Subject: solrj hello, I am trying to use solrj for interfacing with solr. I am trying to run the SolrjTest example. I have included all the following jar files- - commons-codec-1.3.jar - commons-fileupload-1.2.1.jar - commons-httpclient-3.1.jar - commons-io-1.4.jar - geronimo-stax-api_1.0_spec-1.0.1.jar - apache-solr-solrj-*.jar - wstx-asl-3.2.7.jar - slf4j-api-1.5.5.jar - slf4j-simple-1.5.5.jar *My SolrjTest file is as follows:* import org.apache.solr.common.SolrDocumentList; import org.apache.solr.common.SolrDocument; import java.util.Map; import org.apache.solr.common.SolrDocumentList; import org.apache.solr.common.SolrDocument; import java.util.Map; import java.util.Iterator; import java.util.List; import java.util.ArrayList; import java.util.HashMap; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.response.QueryResponse; import org.apache.solr.client.solrj.response.FacetField; class SolrjTest { public void query(String q) { CommonsHttpSolrServer server = null; try { server = new CommonsHttpSolrServer(http://localhost:8983/solr/ ); } catch(Exception e) { e.printStackTrace(); } SolrQuery query = new SolrQuery(); query.setQuery(q); query.setQueryType(dismax); query.setFacet(true); query.addFacetField(lastname); query.addFacetField(locality4); query.setFacetMinCount(2); query.setIncludeScore(true); try { QueryResponse qr = server.query(query); SolrDocumentList sdl = qr.getResults(); System.out.println(Found: + sdl.getNumFound()); System.out.println(Start: + sdl.getStart()); System.out.println(Max Score: + sdl.getMaxScore()); System.out.println(); ArrayListHashMapString, Object hitsOnPage = new ArrayListHashMapString, Object(); for(SolrDocument d : sdl) { HashMapString, Object values = new HashMapString, Object(); for(IteratorMap.EntryString, Object i = d.iterator(); i.hasNext(); ) { Map.EntryString, Object e2 = i.next(); values.put(e2.getKey(), e2.getValue()); } hitsOnPage.add(values); System.out.println(values.get(displayname) + ( + values.get(displayphone) + )); } ListFacetField facets = qr.getFacetFields(); for(FacetField facet : facets) { ListFacetField.Count facetEntries = facet.getValues(); for(FacetField.Count fcount : facetEntries) { System.out.println(fcount.getName() + : + fcount.getCount()); } } } catch
Re: Any way to append new text to an existing indexed field?
i would say question and answer are 2 different entities. if you are using the data import handler, i would personally create them as separate entities with their own queries to the database using the deltaQuery method to pick up only new rows. i guess it depends if you need question + answers to actually come back out to be used for display (i.e. you stored their data), or whether it's good enough to match on question/answer separately and then just link to a question ID in your UI to drill-down from the database. disclaimer: i am a solr novice - just started, so i'd see what others think too ;) On Oct 1, 2010, at 7:38 AM, Andy wrote: I'm building a QA application. There's a Question database table and an Answer table. For each question, I'm putting the question itself plus all the answers into a single field text to be indexed and searched. Say I have a question that has 10 existing answers that are already indexed. If a new answer is submitted for that question, is there any way I could just append the new answer to the text field? Or is the only way to implement this is to retrieve the original question and the 10 existing answers from the database, combine them with the newly submitted 11th answer, and re-index everything from scratch? The latter option just seems inefficient. Is there a better design that could be used for this use case? Andy
Re: Any way to append new text to an existing indexed field?
if your question + answers form a compound document then the whole document (with given unique id, e.g. question id) needs to be reindexed i think. best i could find with google was this ... https://issues.apache.org/jira/browse/SOLR-139 On Oct 1, 2010, at 8:23 AM, Andy wrote: Well I want to just display the title of the question in my search results and users can then just click on it to see the detals of the question and all the answers. For example, say a question has the title What is the meaning of life? and then one of the answers to that question is solr. If someone searches for solr, I want to display the question title What is the meaning of life? in the search results. If the user clicks on the question title and drills down, he can then see that one of the answers is solr. I'm not sure it makes sense to index the question and each answer separately because I don't want to get duplicate questions in the search results. In the above example, let's say there's another answer solr is the meaning. If each answer is indexed separately, I'd get two What is the meaning of life? in my search results when someone searches for solr. --- On Fri, 10/1/10, Allistair Crossley a...@roxxor.co.uk wrote: From: Allistair Crossley a...@roxxor.co.uk Subject: Re: Any way to append new text to an existing indexed field? To: solr-user@lucene.apache.org Date: Friday, October 1, 2010, 7:46 AM i would say question and answer are 2 different entities. if you are using the data import handler, i would personally create them as separate entities with their own queries to the database using the deltaQuery method to pick up only new rows. i guess it depends if you need question + answers to actually come back out to be used for display (i.e. you stored their data), or whether it's good enough to match on question/answer separately and then just link to a question ID in your UI to drill-down from the database. disclaimer: i am a solr novice - just started, so i'd see what others think too ;) On Oct 1, 2010, at 7:38 AM, Andy wrote: I'm building a QA application. There's a Question database table and an Answer table. For each question, I'm putting the question itself plus all the answers into a single field text to be indexed and searched. Say I have a question that has 10 existing answers that are already indexed. If a new answer is submitted for that question, is there any way I could just append the new answer to the text field? Or is the only way to implement this is to retrieve the original question and the 10 existing answers from the database, combine them with the newly submitted 11th answer, and re-index everything from scratch? The latter option just seems inefficient. Is there a better design that could be used for this use case? Andy
Re: any working SolrJ code example for Solr 1.4.1
no example anyone gives you will solve your class not found exception .. you need to ensure the relevant jars (in dist) are included in your solr instance's lib folder i guess? On Oct 1, 2010, at 10:50 AM, Xin Li wrote: Hi, there, Just picked up SolrJ few days ago. I have my Solr Server set up, data loaded, and everything worked fine with the web admin page. Then problem came when I was trying to use SolrJ to interact with the Solr server. I was stuck with NoClassNotFoundException yesterday. Being new to the domain is a factor, but SolrJ could really use some more updated documentation. .. Long story short, does anyone have a minimal working SolrJ example interacting with Solr 1.4.1? It would be nice to know the JARs too since the errors I got were probably more related to JARs than the code itself. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: any working SolrJ code example for Solr 1.4.1
did you miss the page here http://wiki.apache.org/solr/Solrj ? this tells you the jars required for your classpath as well as usage examples On Oct 1, 2010, at 11:57 AM, Xin Li wrote: That's precisely the reason I was asking about JARs too. It seems that I am the minority that ran into SolrJ issue. If that's the case, I will grab Perl solution, and come back to SolrJ later. Thanks, Xin -Original Message- From: Allistair Crossley [mailto:a...@roxxor.co.uk] Sent: Friday, October 01, 2010 11:52 AM To: solr-user@lucene.apache.org Subject: Re: any working SolrJ code example for Solr 1.4.1 no example anyone gives you will solve your class not found exception .. you need to ensure the relevant jars (in dist) are included in your solr instance's lib folder i guess? On Oct 1, 2010, at 10:50 AM, Xin Li wrote: Hi, there, Just picked up SolrJ few days ago. I have my Solr Server set up, data loaded, and everything worked fine with the web admin page. Then problem came when I was trying to use SolrJ to interact with the Solr server. I was stuck with NoClassNotFoundException yesterday. Being new to the domain is a factor, but SolrJ could really use some more updated documentation. .. Long story short, does anyone have a minimal working SolrJ example interacting with Solr 1.4.1? It would be nice to know the JARs too since the errors I got were probably more related to JARs than the code itself. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: SolrJ
it's in the dist folder with the name provided by the wiki page you refer to On Sep 30, 2010, at 3:01 PM, Christopher Gross wrote: Where can I get SolrJ? The wiki makes reference to it, and says that it is a part of the Solr builds that you download, but I can't find it in the jars that come with it. Can anyone shed some light on this for me? Thanks! -- Chris
Missing facet values for zero counts
Hello list, I am implementing a directory using Solr. The user is able to search with a free-text query or 2 filters (provided as pick-lists) for country. A directory entry only has one country. I am using Solr facets for country and I use the facet counts generated initially by a *:* search to generate my pick-list. This is working fairly well but there are a couple of issues I am facing. Specifically the countries pick-list does not contain ALL possible countries. It only contains those that have been indexed against a document. I have looked at facet.missing but I cannot see how this will work - if no documents have a country of Sweden, then how would Solr know to generate a missing total of zero for Sweden - it's never heard of it. I feel I am missing something - is there a way by which you tell Solr all possible countries rather than relying on counts generated from the index? The countries in question reside in a database table belonging to our application. Thanks, Allistair
Re: Missing facet values for zero counts
Hi, For us this is a usability concern. You either don't show Sweden in a pick-list called Country and some users go away thinking you don't *ever* support Sweden (not true). OR you allow a user to execute an empty result search - but at least they know you do support Sweden. It is we believe undesirable for a pick-list to change from day to day as the index changes - we have a category pick-list also that acts the same. One day a user could see Productions, the next day nothing. Regular users would see this as odd. We believe that usability dictates we show all possible values and add a zero after to prevent the user executing searches but at least they see the possibilities. The best of both worlds we hope. I have solved this using earlier suggestions of merging a database list query with the Solr facet counts. I like your idea though - good thinking but the way I've done is working great also :) Thanks and best wishes, Allistair On 29 Sep 2010, at 14:08, kenf_nc wrote: I don't understand why you would want to show Sweden if it isn't in the index, what will your UI do if the user selects Sweden? However, one way to handle this would be to make a second document type. Have a field called type or some such, and make the new document type be 'dummy' or 'system' or something like that. You can put documents in here with fields for any pick-lists you want to facet on and include all possible values from your database. Do your facets on either just this doc, or all docs, either way should work. However on your search queries always include fq=-type:system basically exclude all documents of type system from all your searches. Messy, but should do what you want. -- View this message in context: http://lucene.472066.n3.nabble.com/Missing-facet-values-for-zero-counts-tp1602276p1603893.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr rate limiting / DoS attacks
This kind of thing is not limited to Solr and you normally wouldn't solve it in software - it's more a network concern. I'd be looking at a web server solution such as Apache mod_evasive combined with a good firewall for more conventional DOS attacks. Just hide your Solr install behind the firewall and communicate with it locally from your web application or whatever. Rate limiting sounds like something Solr should or could provide but I don't know the answer to that. Cheers On Sep 29, 2010, at 2:52 PM, Ian Upright wrote: Hi, I'm curious as to what approaches one would take to defend against users attacking a Solr service, especially if exposed to the internet as opposed to an intranet. I'm fairly new to Solr, is there anything built in? Is there anything in place to prevent the search engine from getting overwhelmed by a particular user or group of users, submitting loads of time-consuming queries as some form of a DoS attack? Additionally, is there a way of rate-limiting it so that only a certain number of queries per user/per hour can be submitted, etc? (for example, to prevent programmatic access to the search engine as opposed to a human user) Thanks, Ian