ANNOUNCE: Stump The Chump @ Lucene Revolution EU - Tommorrow
(Note: cross posted announcement, please confine any replies to solr-user) Hey folks, On Wednesday, I'll be doing a Stump The Chump session at Lucene Revolution EU in Dublin Ireland. http://lucenerevolution.org/stump-the-chump If you aren't familiar with Stump The Chump it is a QA style session where I (the Chump) get put on the hot seat to answer tough / interesting / unusual questions about Lucene Solr -- live, on stage, in front of hundreds of people who are laughing at me, with judges who have all seen and thought about the questions in advance and get to mock me and make me look bad. It's really a lot of fun. Even if you won't be at the conference, you can still participate by emailing your challenging question to st...@lucenerevolution.org. (Regardless of whether you already found a solution to a tough problem, you can still submit it and see what kind of creative solution I might come up with under pressure.) Prizes will be awarded at the discretion of the judges, and video should be posted online at some point soon after the con -- more details and links to videos of past sessions are in my recent blog posts... http://searchhub.org/tag/chump/ -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Twitter analyser
If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote: Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the hashtag It should matches complete word so I used the WhiteSpaceAnalyzer for indexing. Any recommendation for this use case? Thanks ! S. Sent from my iPhone - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Twitter analyser
Hi, Thanks for the reply. It's an index with tweets so any word really is a target for this. This would mean a significant increase of the index. My volumes are really small so that shouldn't be a problem (but performance/scalability is a concern). I have the control over the query. Another solution would be to translate a query on foo to foo or #foo or @foo WDYT? Thanks! S. On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson erickerick...@gmail.comwrote: If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote: Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the hashtag It should matches complete word so I used the WhiteSpaceAnalyzer for indexing. Any recommendation for this use case? Thanks ! S. Sent from my iPhone - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Twitter analyser
You have to get the values _into_ the index with the special characters, that's where the issue is. Depending on your analysis chain special characters may or may not be even in your index to search in the first place. So it's not how many different words are after the special characters as much as how many special characters there are. So what I'm thinking is that as you index documents, you detect #foo, #blah, #whatever and index #foo, foo, #blah, blah etc. If all you have to do is specially handle tokens that start with just a few different chars it's not very hard. FWIW, Erick On Tue, Nov 5, 2013 at 8:33 AM, Stephane Nicoll stephane.nic...@gmail.comwrote: Hi, Thanks for the reply. It's an index with tweets so any word really is a target for this. This would mean a significant increase of the index. My volumes are really small so that shouldn't be a problem (but performance/scalability is a concern). I have the control over the query. Another solution would be to translate a query on foo to foo or #foo or @foo WDYT? Thanks! S. On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson erickerick...@gmail.com wrote: If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote: Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the hashtag It should matches complete word so I used the WhiteSpaceAnalyzer for indexing. Any recommendation for this use case? Thanks ! S. Sent from my iPhone - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Twitter analyser
You can specify custom character types with the word delimiter filter, so you could define @ and # as digit and set SPLIT_ON_NUMERICS. This would cause @foo to tokenize as two adjacent terms, ditto for #foo. Unfortunately, A user name or tag that starts with a digit would not tokenize as desired, but that seems uncommon. foo would match all three since the @ or # would tokenize as a separate term. Use: public WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int configurationFlags, CharArraySet protWords) See: http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Stéphane Nicoll Sent: Tuesday, November 05, 2013 2:40 AM To: java-user@lucene.apache.org Subject: Twitter analyser Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the hashtag It should matches complete word so I used the WhiteSpaceAnalyzer for indexing. Any recommendation for this use case? Thanks ! S. Sent from my iPhone - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Corrupt Index with IndexWriter.addIndexes(IndexReader readers[])
Hello, I got an index corruption in production, and was wondering if it might be a known bug (still with Lucene 3.1), or is my code doing something wrong. It's a local disk index. No known machine power lose. No suppose to even happen, right? This index that got corrupted is updated every 30sec; adding to it a small delta's index (using addIndexes()) that was replicated from another machine. The series of writer actions to update the index is: 1. writer.deleteDocuments(q); 2. writer.flush(false, true); 3. writer.addIndexes(reader); 4. writer.commit(map); Is the index exposed to corruptions only during commit, or is addIndexes() risky by itself (doc says it's not). LUCENE-2610 https://issues.apache.org/jira/browse/LUCENE-2610 kind of looks in the neberhood, though it's not a bug report. I'll add an ls -l output in a follow up email. Technically the first indication of problems is when calling flush, but it could be that the previous writer action left it broken for flush to fail. My stack trace is: Caused by: java.io.FileNotFoundException: /disks/data1/opt/WAS/LotusConnections/Data/catalog/index/Places/index/_33gg.cfs (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:233) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.init(SimpleFSDirectory.java:69) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.init(SimpleFSDirectory.java:90) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:91) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) at org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.java:66) at org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReader.java:113) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:578) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:684) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:659) at org.apache.lucene.index.BufferedDeletes.applyDeletes(BufferedDeletes.java:283) at org.apache.lucene.index.BufferedDeletes.applyDeletes(BufferedDeletes.java:191) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3358) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3296)
Modify the StandardTokenizerFactory to concatenate all words
Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would want to extend the functionality ofStandardTokenizerFactory to create tokens toy, story, and toy story. How do I do that?
Re: Modify the StandardTokenizerFactory to concatenate all words
How would you expect to recognize that 'Toy Story' is a thing? On Tue, Nov 5, 2013 at 6:32 PM, Kevin glidekensing...@gmail.com wrote: Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would want to extend the functionality ofStandardTokenizerFactory to create tokens toy, story, and toy story. How do I do that?