Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
On 20.04.11 18:51, Robert Muir wrote: Hi, there is a proposed patch uploaded to the issue. Maybe you can help by reviewing/testing it? if i succeed in compiling solr, i can test the patch. Is this the right starting point for such an endeavour ? http://wiki.apache.org/solr/HackingSolr -robert 2011/4/20 Robert Gründlerrob...@dubture.com: Hi all, i'm getting the following exception when using highlighting for a field containing HTMLStripCharFilterFactory: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ... exceeds length of provided text sized 21 It seems this is a know issue: https://issues.apache.org/jira/browse/LUCENE-2208 Does anyone know if there's a fix implemented yet in solr? thanks! -robert
Re: Indexing 20M documents from MySQL with DIH
we're indexing around 10M records from a mysql database into a single solr core. The DataImportHandler needs to join 3 sub-entities to denormalize the data. We've run into some troubles for the first 2 attempts, but setting batchSize=-1 for the dataSource resolved the issues. Do you need a lot of complex joins to import the data from mysql? -robert On 4/21/11 8:08 PM, Scott Bigelow wrote: I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query? I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!
HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException
Hi all, i'm getting the following exception when using highlighting for a field containing HTMLStripCharFilterFactory: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ... exceeds length of provided text sized 21 It seems this is a know issue: https://issues.apache.org/jira/browse/LUCENE-2208 Does anyone know if there's a fix implemented yet in solr? thanks! -robert
DataImportHandlerDeltaQueryViaFullImport and delete query
Hi, when using http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport to periodically run a delta-import, is it necessary to run a separate normal delta-import after it to delete entries from the index (using deletedPkQuery)? If so, what's the point of using this method for running delta-imports? If not, how can i delete specific entries with this delta-import method? regards -robert
Re: DataImportHandlerDeltaQueryViaFullImport and delete query
On 18.04.11 09:23, Bill Bell wrote: It runs delta imports faster. Normally you need to get the Pks that changed, and then run it through query= which is slow when you have a lot of Ids but the query= only adds/updates entries. I'm not sure how to delete entries by running a query like select ... from ... where deleted = 1 . as far as i understand there's *postImportDeleteQuery and **deletedPkQuery *to achieve this.* *Where according to the wiki *deletedPkQuery* is only used by delta-imports, and *postImportDeleteQuery* is used after a full import. And from my understanding using dataimport?command=full-importclean=false matches neither of the two, or am i wrong with that?* * thanks, -robert * *
DisMaxQueryParser: Unknown function min in FunctionQuery
Hi all, i'm trying to implement a FunctionQuery using the bf parameter of the DisMaxQueryParser, however, i'm getting an exception: Unknown function min in FunctionQuery('min(1,2)', pos=4) The request that causes the error looks like this: http://localhost:2345/solr/main/select?qt=dismaxqf=name^0.1qf=name_exact^10.0debugQuery=truebf=min(1,2)version=1.2wt=jsonjson.nl=mapq=+foostart=0rows=3 I'm not sure where the pos=4 part of the FunctionQuery is coming from. My Solr version is 1.4.1. Has anyone a hint why i'm getting this error? thanks! -robert
Conditional Scoring (was: Re: DisMaxQueryParser: Unknown function min in FunctionQuery)
sorry, didn't see that. So, as also the relevance functions are only available in solr 4.0 (http://wiki.apache.org/solr/FunctionQuery#Relevance_Functions), i'm not sure if i can solve our requirement in one query ( i thought i could use a function query for this). Here's our Problem: We have 3 Fields: 1. exact_match ( text ) 2. fuzzy_match ( text ) 3. popularity ( integer ) Our requirement looks as follows: All results which have a match in exact_match MUST score higher than results without a match in exact_match, regardless of the value in the popularity field. All results which have no match in exact_match should use the popularity field for scoring. Is this possible without using a function query ? thanks. -robert On 29.03.11 16:34, Erik Hatcher wrote: On Mar 29, 2011, at 10:01 , Robert Gründler wrote: Hi all, i'm trying to implement a FunctionQuery using the bf parameter of the DisMaxQueryParser, however, i'm getting an exception: Unknown function min in FunctionQuery('min(1,2)', pos=4) The request that causes the error looks like this: http://localhost:2345/solr/main/select?qt=dismaxqf=name^0.1qf=name_exact^10.0debugQuery=truebf=min(1,2)version=1.2wt=jsonjson.nl=mapq=+foostart=0rows=3 I'm not sure where the pos=4 part of the FunctionQuery is coming from. My Solr version is 1.4.1. Has anyone a hint why i'm getting this error? From http://wiki.apache.org/solr/FunctionQuery#min - min() is 3.2 (though I think that really means 3.1 now, right??). Definitely not in 1.4.1. Erik
MySQL queries high when using delta-import
Hi, we have 3 solr cores, each of them is running a delta-import every 2 minutes on a MySQL database. We've noticed a significant increase of MySQL queries per second, since we've started the delta updates. Before that, the database server received between 50 and 100 queries per second, since the Delta-Imports the query count has rised up to 100 to 200 queries per second. I've temporarily disabled the delta imports for 2 hours, and the queries per second immediately decreased again to 50-100 per second. I followed the Wiki entry which only uses one Query for the delta import: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport I did not expect the queries per seconds to the database increase that high, so i'm wondering if others experienced similar issues. cheers -robert
Dataimport performance
Hi, we're looking for some comparison-benchmarks for importing large tables from a mysql database (full import). Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 3 hours, on a QuadCore Machine with 16 GB of ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, where it is the only app. The tomcat instance has the following memory-related java_opts: -Xms4096M -Xmx5120M The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity We have the feeling that 3 hours for this import is quite long - regarding the performance of the server running solr/mysql. Are we wrong with that assumption, or do people experience similar import times with this amount of data to be imported? thanks! -robert
Re: Dataimport performance
What version of Solr are you using? Solr Specification Version: 1.4.1 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Specification Version: 2.9.3 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 -robert Adam 2010/12/15 Robert Gründler rob...@dubture.com Hi, we're looking for some comparison-benchmarks for importing large tables from a mysql database (full import). Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 3 hours, on a QuadCore Machine with 16 GB of ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, where it is the only app. The tomcat instance has the following memory-related java_opts: -Xms4096M -Xmx5120M The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${ track.id} field column=artist name=artists_t / /entity /entity We have the feeling that 3 hours for this import is quite long - regarding the performance of the server running solr/mysql. Are we wrong with that assumption, or do people experience similar import times with this amount of data to be imported? thanks! -robert
Re: Dataimport performance
i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: 2010/12/15 Robert Gründler rob...@dubture.com: The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. For my own purposes, I wrote a custom data import using SolrJ to improve the performance (from 3 hours to 10 minutes). Just as a test, how long does it take if you comment out the artists entity?
Copying the index from one solr instance to another
Hi again, let's say you have 2 solr Instances, which have both exactly the same configuration (schema, solrconfig, etc). Could it cause any troubles if we import an index from a SQL database on solr instance A, and copy the whole index to the datadir of solr instance B (both solr instances run on different servers) ?. As far as i can tell, this should work and solr instance B should have the exact same index as solr instance A after the copy-process. Do we miss something, or is this workflow safe to go with? -robert
Re: Copying the index from one solr instance to another
thanks for your feedback. we can shutdown both solr servers for the time of the copy-process, and both solr instances run the same version, so we should be ok. i'll let you know if we encounter any troubles. -robert On Dec 15, 2010, at 18:11 , Shawn Heisey wrote: On 12/15/2010 10:05 AM, Robert Gründler wrote: Hi again, let's say you have 2 solr Instances, which have both exactly the same configuration (schema, solrconfig, etc). Could it cause any troubles if we import an index from a SQL database on solr instance A, and copy the whole index to the datadir of solr instance B (both solr instances run on different servers) ?. As far as i can tell, this should work and solr instance B should have the exact same index as solr instance A after the copy-process. I believe this should work, but I would take a couple of precautions. I'd stop Solr before putting the new index into place. If you can't have it down for the entirety of the copy process, then copy it into an adjacent directory, shut down solr, rename the directories, and restart Solr. If the Solr that built the index (specifically, the Lucene that comes with it) is newer than the one that you are copying to, it won't work. If you've checked all that and if you're still having trouble, let us know. Shawn
Dataimport destroys our harddisks
Hi, we have a serious harddisk problem, and it's definitely related to a full-import from a relational database into a solr index. The first time it happened on our development server, where the raidcontroller crashed during a full-import of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of the harddisks where the solr index files are located stopped working (we needed to replace them). After the crash of the raid controller, we decided to move the development of solr/index related stuff to our local development machines. Yesterday i was running another full-import of ~10 Million documents on my local development machine, and during the import, a harddisk failure occurred. Since this failure, my harddisk activity seems to be around 100% all the time, even if no solr server is running at all. I've been googling the last 2 days to find some info about solr related harddisk problems, but i didn't find anything useful. Are there any steps we need to take care of in respect to harddisk failures when doing a full-import? Right now, our steps look like this: 1. Delete the current index 2. Restart solr, to load the updated schemas 3. Start the full import Initially, the solr index and the relational database were located on the same harddisk. After the crash, we moved the index to a separate harddisk, but nevertheless this harddisk crashed too. I'd really appreciate any hints on what we might do wrong when importing data, as we can't release this on our production servers when there's the risk of harddisk failures. thanks. -robert
Re: Dataimport destroys our harddisks
The very first thing I'd ask is how much free space is on your disk when this occurs? Is it possible that you're simply filling up your disk? no, i've checked that already. all disks have plenty of space (they have a capacity of 2TB, and are currently filled up to 20%. do note that an optimize may require up to 2X the size of your index if/when it occurs. Are you sure you aren't optimizing as you add items to your index? index size is not a problem in our case. Our index currently has about 3GB. What do you mean with optimizing as you add items to your index? But I've never heard of Solr causing hard disk crashes, neither did we, and google is the same opinion. One thing that i've found is the mergeFactor value: http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor Our sysadmin speculates that maybe the chunk size of our raid/harddisks and the segment size of the lucene index does not play well together. Does the lucene segment size affect how the data is written to the disk? thanks for your help. -robert Best Erick 2010/12/2 Robert Gründler rob...@dubture.com Hi, we have a serious harddisk problem, and it's definitely related to a full-import from a relational database into a solr index. The first time it happened on our development server, where the raidcontroller crashed during a full-import of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of the harddisks where the solr index files are located stopped working (we needed to replace them). After the crash of the raid controller, we decided to move the development of solr/index related stuff to our local development machines. Yesterday i was running another full-import of ~10 Million documents on my local development machine, and during the import, a harddisk failure occurred. Since this failure, my harddisk activity seems to be around 100% all the time, even if no solr server is running at all. I've been googling the last 2 days to find some info about solr related harddisk problems, but i didn't find anything useful. Are there any steps we need to take care of in respect to harddisk failures when doing a full-import? Right now, our steps look like this: 1. Delete the current index 2. Restart solr, to load the updated schemas 3. Start the full import Initially, the solr index and the relational database were located on the same harddisk. After the crash, we moved the index to a separate harddisk, but nevertheless this harddisk crashed too. I'd really appreciate any hints on what we might do wrong when importing data, as we can't release this on our production servers when there's the risk of harddisk failures. thanks. -robert
Re: Dataimport destroys our harddisks
On Dec 2, 2010, at 15:43 , Sven Almgren wrote: What Raid controller do you use, and what kernel version? (Assuming Linux). We hade problems during high load with a 3Ware raid controller and the current kernel for Ubuntu 10.04, we hade to downgrade the kernel... The problem was a bug in the driver that only showed up with very high disk load (as is the case when doing imports) We're running freebsd: RaidController 3ware 9500S-8 Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU Freebsd 7.2, UFS Filesystem. /Sven 2010/12/2 Robert Gründler rob...@dubture.com: The very first thing I'd ask is how much free space is on your disk when this occurs? Is it possible that you're simply filling up your disk? no, i've checked that already. all disks have plenty of space (they have a capacity of 2TB, and are currently filled up to 20%. do note that an optimize may require up to 2X the size of your index if/when it occurs. Are you sure you aren't optimizing as you add items to your index? index size is not a problem in our case. Our index currently has about 3GB. What do you mean with optimizing as you add items to your index? But I've never heard of Solr causing hard disk crashes, neither did we, and google is the same opinion. One thing that i've found is the mergeFactor value: http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor Our sysadmin speculates that maybe the chunk size of our raid/harddisks and the segment size of the lucene index does not play well together. Does the lucene segment size affect how the data is written to the disk? thanks for your help. -robert Best Erick 2010/12/2 Robert Gründler rob...@dubture.com Hi, we have a serious harddisk problem, and it's definitely related to a full-import from a relational database into a solr index. The first time it happened on our development server, where the raidcontroller crashed during a full-import of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of the harddisks where the solr index files are located stopped working (we needed to replace them). After the crash of the raid controller, we decided to move the development of solr/index related stuff to our local development machines. Yesterday i was running another full-import of ~10 Million documents on my local development machine, and during the import, a harddisk failure occurred. Since this failure, my harddisk activity seems to be around 100% all the time, even if no solr server is running at all. I've been googling the last 2 days to find some info about solr related harddisk problems, but i didn't find anything useful. Are there any steps we need to take care of in respect to harddisk failures when doing a full-import? Right now, our steps look like this: 1. Delete the current index 2. Restart solr, to load the updated schemas 3. Start the full import Initially, the solr index and the relational database were located on the same harddisk. After the crash, we moved the index to a separate harddisk, but nevertheless this harddisk crashed too. I'd really appreciate any hints on what we might do wrong when importing data, as we can't release this on our production servers when there's the risk of harddisk failures. thanks. -robert
Is this sort order possible in a single query?
Hi, we have a requirement for one of our search results which has a quite complex sorting strategy. Let me explain the document first, using an example: The document is a book. It has several indexed text fields: Title, Author, Distributor. It has two integer columns, where one reflects the number of sold copies (num_copies), and the other reflects the number of comments on the website (num_comments). The Requirement for the relevancy looks like this: * Documents which have exact matches in the Author field, should be ranked highest, disregarding their values in num_copies and num_comments fields * After the exact matches, the sorting should be based on the value in the field num_copies, but only for documents, where this field is set * After the num_copies matches, the sorting should be based on num_comments I'm wondering is this kind of sort order can be implemented in a single query, or if i need to break it down into several queries and merge the results on application level. -robert
Re: Is this sort order possible in a single query?
thanks a lot for the explanation. i'm a little confused about solr 1.5, especially after finding this wiki page: http://wiki.apache.org/solr/Solr1.5 Is there a stable build available for version 1.5, so i can test your suggestion using functionquery? -robert On Nov 24, 2010, at 1:53 PM, Geert-Jan Brits wrote: You could do it with sorting on a functionquery (which is supported from solr 1.5) http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function Consider the search: http://localhost:8093/solr/select?author:'j.k.rowling' sorting like you specified would involve: 1. introducing an extra field: 'author_exact' of type 'string' which takes care of the exact matching. (You can populate it by defining it as a copyfield of Author so your indexing-code doesn't change) 2. set sortMissingLast=true for 'num_copies' and 'num_comments' like: fieldType name=num_copies sorMissingLast=true this makes sure that documents which don't have the value set end up at the end of the sort when sorted on that particular field. 3. construct a functionquery that scores either 0 (no match) or x (not sure what x is (1?) , but it should always be the same for all exact matches ) This gives http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc which scores all exact matches before all partial matches. 4. now just concatenate the other sorts giving: http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc, num_copies desc, num_comments desc That should do it. Please note that 'num_copies' and 'num_comments' still kick in to break the tie for documents that exactly match on 'author_exact'. I assume this is ok. I can't see a way to do it without functionqueries at the moment, which doesn't mean there isn't any. Hope that helps, Geert-Jan *query({!dismax qf=text v='solr rocks'})* * * 2010/11/24 Robert Gründler rob...@dubture.com Hi, we have a requirement for one of our search results which has a quite complex sorting strategy. Let me explain the document first, using an example: The document is a book. It has several indexed text fields: Title, Author, Distributor. It has two integer columns, where one reflects the number of sold copies (num_copies), and the other reflects the number of comments on the website (num_comments). The Requirement for the relevancy looks like this: * Documents which have exact matches in the Author field, should be ranked highest, disregarding their values in num_copies and num_comments fields * After the exact matches, the sorting should be based on the value in the field num_copies, but only for documents, where this field is set * After the num_copies matches, the sorting should be based on num_comments I'm wondering is this kind of sort order can be implemented in a single query, or if i need to break it down into several queries and merge the results on application level. -robert
Respect token order in matches
Hi, is there a way to make solr respect the order of token matches when the query is a multi-term string? Here's an example: Query String: John C Indexed Strings: - John Cage - Cargill John This will return both indexed strings as a result. However, Cargill John should not match in that case, because the order of the tokens is not the same as in the query. Here's the fieldtype: fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType Is there a way to achieve this using this fieldtype? thanks!
LockReleaseFailedException
Hi, i'm suddenly getting a LockReleaseFailedException when starting a full-import using the Dataimporthandler: org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component This worked without problems until just now. Is there some lock file i can remove to unlock the index again? thanks. -robert
Re: EdgeNGram relevancy
thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
it seems adding the '+' (required) operator to each term in a multi-term query does the trick: http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+ ie: edgytext2:(+Martin +Sco) -robert On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote: thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
EdgeNGram relevancy
Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
thanks a lot, that setup works pretty well now. the only problem now is that the StopWords do not work that good anymore. I'll provide an example, but first the 2 fieldtypes: !-- autocomplete field which finds matches inside strings (scor matches Martin Scorsese) -- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType !-- autocomplete field which finds startsWith matches only (scor matches only Scorpio, but not Martin Scorsese) -- fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? thanks again! -robert On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: Concatenate multiple tokens into one
I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do pre-tokenization, the 'field' query parser should work well for this. Jonathan From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: George Cloo Matches: - George Harrison - John Clooridge - George Smith -George Clooney - etc
Re: Concatenate multiple tokens into one
this is the full source code, but be warned, i'm not a java developer, and i have no background in lucine/solr development: // ConcatFilter import java.io.IOException; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; public class ConcatFilter extends TokenFilter { protected ConcatFilter(TokenStream input) { super(input); } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) input.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) input.getAttribute(TypeAttribute.class); boolean hasToken = false; while (input.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); hasToken = true; } } if (hasToken == true) { token.setTermBuffer(builder.toString()); return token; } return null; } } //ConcatFilterFactory: import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenFilterFactory; public class ConcatFilterFactory extends BaseTokenFilterFactory { @Override public TokenStream create(TokenStream stream) { return new ConcatFilter(stream); } } and in your schema.xml, you can simply add the filterfactory using this element: filter class=com.example.ConcatFilterFactory / Jar files i have included in the buildpath (can be found in the solr download package): apache-solr-core-1.4.1.jar lucene-analyzers-2.9.3.jar lucene-core.2.9.3-jar good luck ;) -robert On Nov 11, 2010, at 8:45 PM, Nick Martin wrote: Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from. Will check the thread you mention. Best Nick On 11 Nov 2010, at 18:13, Robert Gründler wrote: I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word
Re: EdgeNGram relevancy
according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Best practices to rebuild index on live system
Hi again, we're coming closer to the rollout of our newly created solr/lucene based search, and i'm wondering how people handle changes to their schema on live systems. In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 hours for a full dataimport from the relational database. The Index is being updated in realtime, through post insert/update/delete events in our ORM. So far, i can only think of 2 scenarios for rebuilding the index, if we need to update the schema after the rollout: 1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After importing, switch the application to cores A1, B1, C1 This will most likely cause a corrupt index, as in the 1.5 hours of indexing, the database might get inserts/updates/deletes. 2. Put the Livesystem in a Read-Only mode and rebuild the index during that time. This will ensure data integrity in the index, with the drawback for users not being able to write to the app. Does Solr provide any built-in approaches to this problem? best -robert
Concatenate multiple tokens into one
Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: George Cloo Matches: - George Harrison - John Clooridge - George Smith -George Clooney - etc However, only George Clooney should match in the autocompletion use case. Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory. Are there filters which can do such a thing? If not, are there examples how to implement a custom TokenFilter? thanks! -robert
Re: Concatenate multiple tokens into one
On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do pre-tokenization, the 'field' query parser should work well for this. Jonathan From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: George Cloo Matches: - George Harrison - John Clooridge - George Smith -George Clooney - etc However, only George Clooney should match in the autocompletion use case. Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory. Are there filters which can do such a thing? If not, are there examples how to implement a custom TokenFilter? thanks! -robert
Dataimporthandler crashed raidcontroller
Hi all, we had a severe problem with our raidcontroller on one of our servers today during importing a table with ~8 million rows into a solr index. After importing about 4 million documents, our server shutdown, and failed to restart due to a corrupt raid disk. The Solr data import was the only heavy process running on that machine during the crash. Has anyone experienced hdd/raid-related problems during indexing large sql databases into solr? thanks! -robert