Re: Chinese chars are not indexed ?
I am using the sample, not deploying Solr in Tomcat. Is there a place I can modify this setting ? Ha, okey if you are using jetty with java -jar start.jar then it is okey. But for Chinese you need special tokenizer since Chinese is written without spaces between words. tokenizer class=solr.CJKTokenizerFactory/ Or you can search with both leading and trailing star. q=*ChineseText* should return something.
Re: Chinese chars are not indexed ?
oh yes, *...* works. thanks. I saw tokenizer is defined in schema.xml. There are a few places that define the tokenizer. Wondering if it is enough to define one for: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index !-- this is the only one I need to modify ? - -- tokenizer class=solr.WhitespaceTokenizerFactory/ !-- - -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer/fieldType thanks, canal From: Ahmet Arslan iori...@yahoo.com To: solr-user@lucene.apache.org Sent: Mon, June 28, 2010 2:54:16 PM Subject: Re: Chinese chars are not indexed ? I am using the sample, not deploying Solr in Tomcat. Is there a place I can modify this setting ? Ha, okey if you are using jetty with java -jar start.jar then it is okey. But for Chinese you need special tokenizer since Chinese is written without spaces between words. tokenizer class=solr.CJKTokenizerFactory/ Or you can search with both leading and trailing star. q=*ChineseText* should return something.
Re: Chinese chars are not indexed ?
oh yes, *...* works. thanks. I saw tokenizer is defined in schema.xml. There are a few places that define the tokenizer. Wondering if it is enough to define one for: It is better to define a brand new field type specific to Chinese. http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething like: at index time: tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ at query time: tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PositionFilterFactory /
one to many denormalization approach
Hi, I have an architectural question about using apache solr/lucene. I'm building a solr index for searching a CV database. Basically every CV on there will have some fields like: rate of pay, address, title these fields are straight forward. The area I need advise on is, skills and job history. For skills, someone might add an entry like: Ruby - 5 Years, Java - 9 Years CV: John Smith 27 Skills: Java, 5 Years Sql, 4 Years Lucene, 1 Year Jobs: 1998-2004 Acme Search Ltd, Senior Java Developer, New York City, US 2004-2009 Software Labs Ltd, Technical Architect, San Francisco, CA, US So there's essentially N number of skills, each with a string name and a int no of years. I was thinking I could use a dynamic field, *_skill, and possibly add them like so: 1_skill: Ruby, 2_skill: Java But how can I index the years experience? would I then add a dynamic field like: 1_skill_years: 5, 2_skill_years: 9 How would i fit these into the index? Any help greatly appreciated? Regards
Question about the mailinglist (junk on my behalf)
Hello community, since a few days I recieve daily some mails with suspicious content. It is said that some of my mails were rejected, because of the file-types of the mail's attachements and other things. This wonders me a lot, because I didn't send any mails with attachements and even the eMail-adresses which want to make me aware of my rejected mails are unknown to me. This is the first mailinglist I have joined and I know that there are a lot of bots out there, crawling for eMail-adresses to send junk. However, I can't recognize any suspicious behaviour except those mails. The number of mails that make me aware of the mentioned thing is 10 in a few days, maybe 15 but not more. And I do not get more junk than I normally get. Does anyone recieves suspicious eMails on my behalf? Thank you. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-the-mailinglist-junk-on-my-behalf-tp927461p927461.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: is there a delete all command in updateHandler?
Hi Li, Yes, you can issue a delete all by: curl http://your_solr_server:your_solr_port/solr/update -H Content-Type: text/xml --data-binary 'deletequery*:*/query/delete'; Hope it helps. Cheers, Daniel -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: 28 June 2010 03:41 To: solr-user@lucene.apache.org Subject: is there a delete all command in updateHandler? I want to delete all index and rebuild index frequently. I can't delete the index files directly because I want to use replication http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
preside != president
Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
Search limit to the first 50 000 chars for one field
Hi, I use solr 1.4 for search contents in documents (pdf, doc, odt ...). I use the module /update/extract. When I am researching, I am limited to the first 5 characters (approximately). Any word or sentence after is not found (but the field has more than 5 characters when I recovered it by a search). I searched if anyone had the same problem or if there was a setting to resolved this but I found nothing. How I can increase this limit ? Line of my schema.xml for the field in which I search : field name=text type=text indexed=true stored=true multiValued=true termPositions=true termOffsets=true compressed=true / I store the content to use the module Highlighting. And here are my search options (but without options, I have the same problem): /select?q=mySearchstart=0rows=1250fl=idhl=onhl.fl=textomitHeader=truehl.mergeContiguous=truehl.snippets=5hl.simple.pre=[PRE_FIND_START]hl.simple.post=[PRE_FIND_END]wt=phps Thank you in advance for your reply. Best regards, Julien -- View this message in context: http://lucene.472066.n3.nabble.com/Search-limit-to-the-first-50-000-chars-for-one-field-tp927635p927635.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search limit to the first 50 000 chars for one field
I use solr 1.4 for search contents in documents (pdf, doc, odt ...). I use the module /update/extract. When I am researching, I am limited to the first 5 characters (approximately). Any word or sentence after is not found (but the field has more than 5 characters when I recovered it by a search). I searched if anyone had the same problem or if there was a setting to resolved this but I found nothing. How I can increase this limit ? maxFieldLength configuration can be done in solrconfig.xml maxFieldLength2147483647/maxFieldLength
Re: Data Import Handler Rich Format Documents
Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1 release. You should use trunk / nightly builds. https://issues.apache.org/jira/browse/SOLR-1583 My data-config.xml looks like this: dataConfig dataSource type=JdbcDataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@whatever:12345:whatever user=me name=ds-db password=secret/ dataSource type=BinURLDataSource name=ds-url/ document entity name=my_database dataSource=ds-db query=select * from my_database where rownum lt;=2 field column=CONTENT_ID name=content_id/ field column=CMS_TITLE name=cms_title/ field column=FORM_TITLE name=form_title/ field column=FILE_SIZE name=file_size/ field column=KEYWORDS name=keywords/ field column=DESCRIPTION name=description/ field column=CONTENT_URL name=content_url/ /entity entity name=my_database_url dataSource=ds-url query=select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}' entity processor=TikaEntityProcessor dataSource=ds-url format=text url=http://www.mysite.com/${my_database.content_url}; field column=text/ /entity /entity /document /dataConfig I added the entity name=my_database_url section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? I think you should move Tika entity into my_database entity and simplify the whole configuration entity name=my_database dataSource=ds-db query=select * from my_database where rownum lt;=2 ... field column=CONTENT_URL name=content_url/ entity processor=TikaEntityProcessor dataSource=ds-url format=text url=http://www.mysite.com/${my_database.content_url}; field column=text/ /entity /entity
Strange query behavior
Hello, I have a title that says 3DVIA Studio amp; Virtools Maya and 3dsMax Exporters. The analysis tool for this field gives me these tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport However, when i search for 3dsmax, i get no results :( Furthermore, if i search for dsmax i get the spellchecker that suggests me 3dsmax even though it doesn't find any results. If i search for any other token (3dvia, or max for example), the document is found. 3dsmax is the only token that doesn't seem to work!! :( Here is my schema for this field:fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt / /analyzer /fieldType Can anyone help me out please? :( PS: the ${Language} is set to en (for english) in this case... _ La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail dans Hotmail ! http://www.windowslive.fr/hotmail/nowgeneration/
Re: Search limit to the first 50 000 chars for one field
Ok thanks, it works. Best regards, Julien -- View this message in context: http://lucene.472066.n3.nabble.com/Search-limit-to-the-first-50-000-chars-for-one-field-tp927635p927725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: preside != president
Hi Darren, You might want to look at the KStemmer (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of the standard PorterStemmer. It essentially has a 'dictionary' of exception words where stemming stops if found, so in your case president won't be stemmed any further than president (but presidents will be stemmed to president). You will have to integrate it into solr yourself, but that's straightforward. HTH Brendan On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
Re: preside != president
Thanks for the tip. Yeah, I think the stemming confounds search results as it stands (porter stemmer). I was also thinking of using my dictionary of 500,000 words with their complete morphologies and conjugations and create a synonyms.txt to provide english accurate morphology. Is this a good idea? Darren Hi Darren, You might want to look at the KStemmer (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of the standard PorterStemmer. It essentially has a 'dictionary' of exception words where stemming stops if found, so in your case president won't be stemmed any further than president (but presidents will be stemmed to president). You will have to integrate it into solr yourself, but that's straightforward. HTH Brendan On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
DataImportHandler $deleteDocById question
Hi all. I'm trying to get $deleteDocById working, but any document is being deleted from my index. I'm using Full-Import (withOUT cleaning) and a script with: row.put('$deleteDocById', row.get('codAnuncio')); The script is passing in this line for every document it processes (for testing purposes). The schema has: uniqueKeycodanuncio/uniqueKey What can be wrong? Thank's Então aproximaram-se os que estavam no barco, e adoraram-no, dizendo: És verdadeiramente o Filho de Deus. (Mateus 14:33)
custom core admin handler
Hi all, I have been using Solr for quite a while, but I never really got into looking at the code. Last week that all changed, I decided to write a custom core admin handler. I've posted something on my blog about it, along with a Drupal centric howto. I'd be interested to know what people think of it. The post is at http://davehall.com.au/blog/dave/2010/06/26/multi-core-apache-solr-ubuntu-1004-drupal-auto-provisioning It's been a while since I hacked on Java, so I am sure there are bits that can be improved. Feel free to email me on or off list, or post a comment on my blog. If there is interest in including this in Solr, I would be willing to relicense it. Cheers Dave
Re: Chinese chars are not indexed ?
What if Chinese is mixed with English? I have text that is entered by users and it could be a mix of Chinese, English, etc. What's the best way to handle that? Thanks. --- On Mon, 6/28/10, Ahmet Arslan iori...@yahoo.com wrote: From: Ahmet Arslan iori...@yahoo.com Subject: Re: Chinese chars are not indexed ? To: solr-user@lucene.apache.org Date: Monday, June 28, 2010, 3:44 AM oh yes, *...* works. thanks. I saw tokenizer is defined in schema.xml. There are a few places that define the tokenizer. Wondering if it is enough to define one for: It is better to define a brand new field type specific to Chinese. http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething like: at index time: tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ at query time: tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PositionFilterFactory /
Re: preside != president
the general consensus among people who run into the problem you have is to use a plurals only stemmer, a synonyms file or a combination of both (for irregular nouns etc) if you search the archives you can find info on a plurals stemmer On Mon, Jun 28, 2010 at 6:49 AM, dar...@ontrenet.com wrote: Thanks for the tip. Yeah, I think the stemming confounds search results as it stands (porter stemmer). I was also thinking of using my dictionary of 500,000 words with their complete morphologies and conjugations and create a synonyms.txt to provide english accurate morphology. Is this a good idea? Darren Hi Darren, You might want to look at the KStemmer (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of the standard PorterStemmer. It essentially has a 'dictionary' of exception words where stemming stops if found, so in your case president won't be stemmed any further than president (but presidents will be stemmed to president). You will have to integrate it into solr yourself, but that's straightforward. HTH Brendan On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
Re: Strange query behavior
splitOnCaseChange is creating multiple tokens from 3dsMax disable it or enable catenateAll, use the analysys page in the admin tool to see exactly how your text will be indexed by analyzers without having to reindex your documents, once you have it right you can do a full reindex. On Mon, Jun 28, 2010 at 5:48 AM, Marc Ghorayeb dekay...@hotmail.com wrote: Hello, I have a title that says 3DVIA Studio Virtools Maya and 3dsMax Exporters. The analysis tool for this field gives me these tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport However, when i search for 3dsmax, i get no results :( Furthermore, if i search for dsmax i get the spellchecker that suggests me 3dsmax even though it doesn't find any results. If i search for any other token (3dvia, or max for example), the document is found. 3dsmax is the only token that doesn't seem to work!! :( Here is my schema for this field:fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 / filter class=solr.TrimFilterFactory updateOffsets=true/ filter class=solr.LengthFilterFactory min=2 max=15/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.SnowballPorterFilterFactory language=${Language} protected=protwords.txt / /analyzer /fieldType Can anyone help me out please? :( PS: the ${Language} is set to en (for english) in this case... _ La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail dans Hotmail ! http://www.windowslive.fr/hotmail/nowgeneration/
Re: questions about Solr shards
there is a first pass query to retrieve all matching document ids from every shard along with relevant sorting information, the document ids are then sorted and limited to the amount needed, then a second query is sent for the rest of the documents metadata. On Sun, Jun 27, 2010 at 7:32 PM, Babak Farhang farh...@gmail.com wrote: Otis, Belated thanks for your reply. 2. The index could change between stages, e.g. a document that matched a query and was subsequently changed may no longer match but will still be retrieved. 2. This describes the situation where, for instance, a document with ID=10 is updated between the 2 calls to the Solr instance/shard where that doc ID=10 lives. Can you explain why this happens? (I.e. does each query to the sharded index somehow involve 2 calls to each shard instance from the base instance?) -Babak On Thu, Jun 24, 2010 at 10:14 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Babak, 1. Yes, you are reading that correctly. 2. This describes the situation where, for instance, a document with ID=10 is updated between the 2 calls to the Solr instance/shard where that doc ID=10 lives. 3. Yup, orthogonal. You can have a master with multiple cores for sharded and non-sharded indices and you can have a slave with cores that hold complete indices or just their shards. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Babak Farhang farh...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, June 24, 2010 6:32:54 PM Subject: questions about Solr shards Hi everyone, There are a couple of notes on the limitations of this approach at target=_blank http://wiki.apache.org/solr/DistributedSearch which I'm having trouble understanding. 1. When duplicate doc IDs are received, Solr chooses the first doc and discards subsequent ones Received here is from the perspective of the base Solr instance at query time, right? I.e. if you inadvertently indexed 2 versions of the document with the same unique ID but different contents to 2 shards, then at query time, the first document (putting aside for the moment what exactly first means) would win. Am I reading this right? 2. The index could change between stages, e.g. a document that matched a query and was subsequently changed may no longer match but will still be retrieved. I have no idea what this second statement means. And one other question about shards: 3. The examples I've seen documented do not illustrate sharded, multicore setups; only sharded monolithic cores. I assume sharding works with multicore as well (i.e. the two issues are orthogonal). Is this right? Any help on interpreting the above would be much appreciated. Thank you, -Babak
Too Many Open Files
Hi all When i send a delete query to SOLR, using the SOLRJ i received this exception: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files 11:53:06,964 INFO [HttpMethodDirector] I/O exception (java.net.SocketException) caught when processing request: Too many open files Anyone could Help me? How i can solve this? Thanks
Re: Too Many Open Files
This probably means you're opening new readers without closing old ones. But that's just a guess. I'm guessing that this really has nothing to do with the delete itself, but the delete is what's finally pushing you over the limit. I know this has been discussed before, try searching the mail archive for TooManyOpenFiles and/or File Handles You could get much better information by providing more details, see: http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list) Best Erick On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Hi all When i send a delete query to SOLR, using the SOLRJ i received this exception: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files 11:53:06,964 INFO [HttpMethodDirector] I/O exception (java.net.SocketException) caught when processing request: Too many open files Anyone could Help me? How i can solve this? Thanks
solr data config questions
Hi All, I am a new user of Solr. We are now trying to enable searching on Digg dataset. It has story_id as the primary key and comment_id are the comment id which commented story_id, so story_id and comment_id is one-to-many relationship. These comment_ids can be replied by some repliers, so comment_id and repliers are one-to-many relationship. The problem is that within a single returned document the search results shows an array of comment_ids and an array of repliers without knowing which repliers replied which comment. For example: now we got comment_id:[c1,c,2...,cn], repliers:[r1,r2,r3rm]. Can we get something like comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that {r1,r2} is corresponding to c1? Our current data-config is attached: dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver autoreconnect=true netTimeoutForStreamingResults=1200 url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root password= / document entity name=story pk=story_id query=select * from story deltaImportQuery=select * from story where ID=='${dataimporter.delta.story_id}' deltaQuery=select story_id from story where last_modified '${dataimporter.last_index_time}' field column=link name=link / field column=title name=title / field column=description name=story_content / field column=digg name=positiveness / field column=comment name=spreading_number / field column=user_id name=author / field column=profile_view name=user_popularity / field column=topic name=topic / field column=timestamp name=timestamp / entity name=dugg_list pk=story_id query=select * from dugg_list where story_id='${story.story_id}' deltaQuery=select SID from dugg_list where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${dugg_list.story_id} field name=viewer column=dugger / /entity entity name=commenttable pk=comment_id query=select * from commenttable where story_id='${story.story_id}' deltaQuery=select SID from commenttable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${commenttable.story_id} field name=comment_id column=comment_id / field name=spreading_user column=replier / field name=comment_positiveness column=up / field name=comment_negativeness column=down / field name=user_comment column=content / field name=user_comment_timestamp column=timestamp / entity name=replytable query=select * from replytable where comment_id='${commenttable.comment_id}' deltaQuery=select SID from replytable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select comment_id from commenttable where comment_id=${replytable.comment_id} field name=replier_id column=replier_id / field name=reply_content column=content / field name=reply_positiveness column=up / field name=reply_negativeness column=down / field name=reply_timestamp column=timestamp / /entity /entity /entity /document /dataConfig Please help me on this. Many thanks Vivian
Re: preside != president
Hi, You might also want to check out the new Lucene-Hunspell stemmer at http://code.google.com/p/lucene-hunspell/ It uses OpenOffice dictionaries with known stems in combination with a large set of language specific rules. It handles your example, but it is an early release, so test it thoroughly before deploying in production :) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 28. juni 2010, at 17.43, Joe Calderon wrote: the general consensus among people who run into the problem you have is to use a plurals only stemmer, a synonyms file or a combination of both (for irregular nouns etc) if you search the archives you can find info on a plurals stemmer On Mon, Jun 28, 2010 at 6:49 AM, dar...@ontrenet.com wrote: Thanks for the tip. Yeah, I think the stemming confounds search results as it stands (porter stemmer). I was also thinking of using my dictionary of 500,000 words with their complete morphologies and conjugations and create a synonyms.txt to provide english accurate morphology. Is this a good idea? Darren Hi Darren, You might want to look at the KStemmer (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of the standard PorterStemmer. It essentially has a 'dictionary' of exception words where stemming stops if found, so in your case president won't be stemmed any further than president (but presidents will be stemmed to president). You will have to integrate it into solr yourself, but that's straightforward. HTH Brendan On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: Hi, It seems to me that because the stemming does not produce grammatically correct stems in many of the cases, search anomalies can occur like the one I am seeing where I have a document with president in it and it is returned when I search for preside, a different word entirely. Is this correct or acceptable behavior? Previous discussions here on stemming, I was told its ok as long as all the words reduce to the same stem, but when different words reduce to the same stem it seems to affect search results in a bad way. Darren
Re: SweetSpotSimilarity
iorixxx wrote: it is in schema.xml: similarity class=org.apache.lucene.search.SweetSpotSimilarity/ How would you configure the tfBaselineTfFactors and LengthNormFactors when configuring via schema.xml? Do I have to create a subclass that hardcodes these values? -- View this message in context: http://lucene.472066.n3.nabble.com/SweetSpotSimilarity-tp922546p928730.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SweetSpotSimilarity
How would you configure the tfBaselineTfFactors and LengthNormFactors when configuring via schema.xml? CustomSimilarityFactory that extends org.apache.solr.schema.SimilarityFactory should do it. There is an example CustomSimilarityFactory.java under src/test/org...
Re: Spatial types and DIH
On Jun 24, 2010, at 12:32 AM, Eric Angel wrote: I'm using solr 4.0-2010-06-23_08-05-33 and can't figure out how to add the spatial types (LatLon, Point, GeoHash or SpatialTile) using dataimporthandler. My lat/lngs from the database are in separate fields. Does anyone know how to do his? Can you concat the two fields together as part of your SQL statement?
Re: Spatial types and DIH
Yes. For now, I've gone back to Lucene 1.4 and installed Local Lucene. I just couldn't get the sfilt to work. I'm sure I was probably missing something, but I think I'll just wait until 1.5 is ready to be shipped. On Jun 28, 2010, at 12:02 PM, Grant Ingersoll wrote: On Jun 24, 2010, at 12:32 AM, Eric Angel wrote: I'm using solr 4.0-2010-06-23_08-05-33 and can't figure out how to add the spatial types (LatLon, Point, GeoHash or SpatialTile) using dataimporthandler. My lat/lngs from the database are in separate fields. Does anyone know how to do his? Can you concat the two fields together as part of your SQL statement?
Re: SweetSpotSimilarity
iorixxx wrote: CustomSimilarityFactory that extends org.apache.solr.schema.SimilarityFactory should do it. There is an example CustomSimilarityFactory.java under src/test/org... This is exactly what I was looking for... this is very similar ( no put intended ;) ) to the updateProcessorFactory configuration in solr-config.xml. The wiki should probably include this information. Side question. How would I know if a configuration option can also take a factory class.. like in this instance? -- View this message in context: http://lucene.472066.n3.nabble.com/SweetSpotSimilarity-tp922546p928862.html Sent from the Solr - User mailing list archive at Nabble.com.
spellcheckcomponent and frequency thresholds
Hi, I'm adding the spellCheckComponent to my current configuration of solr, and I was wondering if there was a way to set a minimum frequency threshold for the IndexBasedSpellChecker through solr like there is in the depreciated Spell Check Request Handler. I know that you can fix most problems by changing the 'accuracy' field, but there are small anomalies that I'd like do remove from the dictionary entirely, and a simple way to do this would be using a frequency threshold. I've looked around for this and I havent found anything recent. Thanks, Matt csn | stores shop easy Software Development Phone: 617-502-7694
Re: Too Many Open Files
Hi Anderson, If you are using SolrJ, it's recommended to reuse the same instance per solr server. http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer But there are other scenarios which may cause this situation: 1. Other application running in the same Solr JVM which doesn't close properly sockets or control file handlers. 2. Open files limits configuration is low . Check your limits, read it from JVM process info: cat /proc/1234/limits (where 1234 is your process ID) Cheers, Michel Bottan On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.comwrote: This probably means you're opening new readers without closing old ones. But that's just a guess. I'm guessing that this really has nothing to do with the delete itself, but the delete is what's finally pushing you over the limit. I know this has been discussed before, try searching the mail archive for TooManyOpenFiles and/or File Handles You could get much better information by providing more details, see: http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29 Best Erick On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Hi all When i send a delete query to SOLR, using the SOLRJ i received this exception: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files 11:53:06,964 INFO [HttpMethodDirector] I/O exception (java.net.SocketException) caught when processing request: Too many open files Anyone could Help me? How i can solve this? Thanks
Re: solr data config questions
Hi, You can add additional commentreplyjoin entity to story entity, i.e. entity name=story ... ... entity name=commenttable ... ... entity name=replytable ... ... /entity /entity entity name=commentreplyjoin query=select concat(comment_id, ',', replier_id) as commentreply from commenttable left join replytable on replytable.comment_id=commenttable.comment_id where commenttable.story_id=${story.story_id}' field name=commentreply column=commentreply / /entity /entity Thus, you will have multivalued field commentreply that contains list of related comment_id, reply_id (comment_id, if you don't have any related replies for this entry) pairs. You can retrieve all values of that field and process on a client and build complex data structure. HTH, Alex On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei wei.p...@xerox.com wrote: Hi All, I am a new user of Solr. We are now trying to enable searching on Digg dataset. It has story_id as the primary key and comment_id are the comment id which commented story_id, so story_id and comment_id is one-to-many relationship. These comment_ids can be replied by some repliers, so comment_id and repliers are one-to-many relationship. The problem is that within a single returned document the search results shows an array of comment_ids and an array of repliers without knowing which repliers replied which comment. For example: now we got comment_id:[c1,c,2...,cn], repliers:[r1,r2,r3rm]. Can we get something like comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that {r1,r2} is corresponding to c1? Our current data-config is attached: dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver autoreconnect=true netTimeoutForStreamingResults=1200 url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root password= / document entity name=story pk=story_id query=select * from story deltaImportQuery=select * from story where ID=='${dataimporter.delta.story_id}' deltaQuery=select story_id from story where last_modified '${dataimporter.last_index_time}' field column=link name=link / field column=title name=title / field column=description name=story_content / field column=digg name=positiveness / field column=comment name=spreading_number / field column=user_id name=author / field column=profile_view name=user_popularity / field column=topic name=topic / field column=timestamp name=timestamp / entity name=dugg_list pk=story_id query=select * from dugg_list where story_id='${story.story_id}' deltaQuery=select SID from dugg_list where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${dugg_list.story_id} field name=viewer column=dugger / /entity entity name=commenttable pk=comment_id query=select * from commenttable where story_id='${story.story_id}' deltaQuery=select SID from commenttable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${commenttable.story_id} field name=comment_id column=comment_id / field name=spreading_user column=replier / field name=comment_positiveness column=up / field name=comment_negativeness column=down / field name=user_comment column=content / field name=user_comment_timestamp column=timestamp / entity name=replytable query=select * from replytable where comment_id='${commenttable.comment_id}' deltaQuery=select SID from replytable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select comment_id from commenttable where comment_id=${replytable.comment_id} field name=replier_id column=replier_id / field name=reply_content column=content / field name=reply_positiveness column=up / field name=reply_negativeness column=down / field name=reply_timestamp column=timestamp / /entity /entity /entity /document /dataConfig Please help me on this. Many thanks Vivian
Very basic questions: Indexing text
Hi everyone, I'm looking for a way to index a bunch of (potentially large) text files. I would love to see results like Google, so I went through a few tutorials, but I've still got questions: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. 2) There are one or two fields at the beginning of the file that I would like to search on, so these should be indexed differently, right? 3) Is there a nice front-end example anywhere? Something that would return results kind of like Google? Thanks for your time - Solr / Lucene seem to be very powerful. -Pete
Re: Very basic questions: Indexing text
1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters 2) There are one or two fields at the beginning of the file that I would like to search on, so these should be indexed differently, right? Probably yes. 3) Is there a nice front-end example anywhere? Something that would return results kind of like Google? http://wiki.apache.org/solr/PublicServers http://search-lucene.com/
Re: Very basic questions: Indexing text
Great, thanks for the pointers. Thanks, Peter On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters 2) There are one or two fields at the beginning of the file that I would like to search on, so these should be indexed differently, right? Probably yes. 3) Is there a nice front-end example anywhere? Something that would return results kind of like Google? http://wiki.apache.org/solr/PublicServers http://search-lucene.com/
DIH and denormalizing
I am trying to do some denormalizing with DIH from a MySQL source. Here's part of my data-config.xml: entity name=dataTable pk=did query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) entity name=ncdat_wt query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' /entity /entity The relationship between features in ncdat and webtable in ncdat_wt (via featurecode) will be many-many. The wt field in schema.xml is set up as multivalued. It seems that ${ncdat.feature} is not being set. I saw a query happening on the server and it was SELECT webtable as wt FROM ncdat_wt WHERE featurecode='' - that last part is an empty string with single quotes around it. From what I can tell, there are no entries in ncdat where feature is blank. I've tried this with both a 1.5-dev checked out months ago (which we are using in production) and a 3.1-dev checked out today. Am I doing something wrong? Thanks, Shawn
Re: Too Many Open Files
Thanks for responses. I instantiate one instance of per request (per delete query, in my case). I have a lot of concurrency process. Reusing the same instance (to send, delete and remove data) in solr, i will have a trouble? My concern is if i do this, solr will commit documents with data from other transaction. Thanks 2010/6/28 Michel Bottan freakco...@gmail.com Hi Anderson, If you are using SolrJ, it's recommended to reuse the same instance per solr server. http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer But there are other scenarios which may cause this situation: 1. Other application running in the same Solr JVM which doesn't close properly sockets or control file handlers. 2. Open files limits configuration is low . Check your limits, read it from JVM process info: cat /proc/1234/limits (where 1234 is your process ID) Cheers, Michel Bottan On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.com wrote: This probably means you're opening new readers without closing old ones. But that's just a guess. I'm guessing that this really has nothing to do with the delete itself, but the delete is what's finally pushing you over the limit. I know this has been discussed before, try searching the mail archive for TooManyOpenFiles and/or File Handles You could get much better information by providing more details, see: http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29 http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29 Best Erick On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Hi all When i send a delete query to SOLR, using the SOLRJ i received this exception: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files 11:53:06,964 INFO [HttpMethodDirector] I/O exception (java.net.SocketException) caught when processing request: Too many open files Anyone could Help me? How i can solve this? Thanks
Optimizing cache
Here is a screen shot for our cache from New Relic. http://s4.postimage.org/mmuji-31d55d69362066630eea17ad7782419c.png Query cache: 55-65% Filter cache: 100% Document cache: 63% Cache size is 512 for above 3 caches. How do I interpret this data? What are some optimal configuration changes given the above stats? -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-cache-tp929156p929156.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DIH and denormalizing
In your query 'query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' .. instead of ${ncdat.feature} use ${dataTable.feature} where dataTable is your parent entity name. From: Shawn Heisey-4 [via Lucene] [mailto:ml-node+929151-1527242139-124...@n3.nabble.com] Sent: Monday, June 28, 2010 2:24 PM To: caman Subject: DIH and denormalizing I am trying to do some denormalizing with DIH from a MySQL source. Here's part of my data-config.xml: entity name=dataTable pk=did query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did ${dataimporter.request.minDid} AND did = ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) entity name=ncdat_wt query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' /entity /entity The relationship between features in ncdat and webtable in ncdat_wt (via featurecode) will be many-many. The wt field in schema.xml is set up as multivalued. It seems that ${ncdat.feature} is not being set. I saw a query happening on the server and it was SELECT webtable as wt FROM ncdat_wt WHERE featurecode='' - that last part is an empty string with single quotes around it. From what I can tell, there are no entries in ncdat where feature is blank. I've tried this with both a 1.5-dev checked out months ago (which we are using in production) and a 3.1-dev checked out today. Am I doing something wrong? Thanks, Shawn _ View message @ http://lucene.472066.n3.nabble.com/DIH-and-denormalizing-tp929151p929151.htm l To start a new topic under Solr - User, email ml-node+472068-464289649-124...@n3.nabble.com To unsubscribe from Solr - User, click (link removed) GZvcnRoZW90aGVyc3R1ZmZAZ21haWwuY29tfDQ3MjA2OHwtOTM0OTI1NzEx here. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-and-denormalizing-tp929151p929168.html Sent from the Solr - User mailing list archive at Nabble.com.
unknown handler dataimport
Hi, I am trying to get db indexing up and running, but I am having trouble getting it working. In the solrconfig.xml file, I added requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler I defined a couple of fields in schema.xml field name=media_id type=long stored=true / field name=artist_name type=text indexed=true stored=true multiValued=true / field name=song_title type=text indexed=true stored=true multiValued=true / media_id is defined as the unique key I added the dataconfig to the data-config.xml file dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/media user=xxxd password=*/ document name=media entity name=video query=select mediaId, name, title FROM Media field column=mediaId name=media_id type=integer stored=true/ field column=name name=artist_name type=string indexed=true stored=true/ field column=title name=song_title type=string indexed=true stored=true/ /entity /document /dataConfig When I start the server, I can see it is loading the dataimport handler Jun 28, 2010 8:52:32 PM org.apache.solr.handler.dataimport.DataImportHandler processConfiguration INFO: Processing configuration from solrconfig.xml: {config=data-config.xml} Jun 28, 2010 8:52:32 PM org.apache.solr.handler.dataimport.DataImporter loadDataConfig INFO: Data Configuration loaded successfully When I go to http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport, on the right side, I see the message unknown handler: /dataimport I do see a BindException: Address already in use when I restart the solr process, but I don't see any other errors . Since the dataimport config was successfully loaded, I don't think that is the reason /dataimport is unknown. Did I forget to add something to the configurations? Is there another log file I should be checking for errors? Regards, L. Hill
Re: DIH and denormalizing
On 6/28/2010 3:28 PM, caman wrote: In your query 'query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' .. instead of ${ncdat.feature} use ${dataTable.feature} where dataTable is your parent entity name. I knew it would be something stupid like that. I thought I changed everything, looks like I forgot one. Thank you! From what I can tell now, it's working. Sure is a lot slower now that it's got to do another query for every item. Shawn
Re: DIH and denormalizing
It seems that ${ncdat.feature} is not being set. Try ${dataTable.feature} instead. On Tue, Jun 29, 2010 at 1:22 AM, Shawn Heisey s...@elyograg.org wrote: I am trying to do some denormalizing with DIH from a MySQL source. Here's part of my data-config.xml: entity name=dataTable pk=did query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) entity name=ncdat_wt query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' /entity /entity The relationship between features in ncdat and webtable in ncdat_wt (via featurecode) will be many-many. The wt field in schema.xml is set up as multivalued. It seems that ${ncdat.feature} is not being set. I saw a query happening on the server and it was SELECT webtable as wt FROM ncdat_wt WHERE featurecode='' - that last part is an empty string with single quotes around it. From what I can tell, there are no entries in ncdat where feature is blank. I've tried this with both a 1.5-dev checked out months ago (which we are using in production) and a 3.1-dev checked out today. Am I doing something wrong? Thanks, Shawn
Re: Too Many Open Files
Other question, Why SOLRJ d'ont close the StringWriter e OutputStreamWriter ? thanks 2010/6/28 Anderson vasconcelos anderson.v...@gmail.com Thanks for responses. I instantiate one instance of per request (per delete query, in my case). I have a lot of concurrency process. Reusing the same instance (to send, delete and remove data) in solr, i will have a trouble? My concern is if i do this, solr will commit documents with data from other transaction. Thanks 2010/6/28 Michel Bottan freakco...@gmail.com Hi Anderson, If you are using SolrJ, it's recommended to reuse the same instance per solr server. http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer But there are other scenarios which may cause this situation: 1. Other application running in the same Solr JVM which doesn't close properly sockets or control file handlers. 2. Open files limits configuration is low . Check your limits, read it from JVM process info: cat /proc/1234/limits (where 1234 is your process ID) Cheers, Michel Bottan On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.com wrote: This probably means you're opening new readers without closing old ones. But that's just a guess. I'm guessing that this really has nothing to do with the delete itself, but the delete is what's finally pushing you over the limit. I know this has been discussed before, try searching the mail archive for TooManyOpenFiles and/or File Handles You could get much better information by providing more details, see: http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29 http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29 Best Erick On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Hi all When i send a delete query to SOLR, using the SOLRJ i received this exception: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files 11:53:06,964 INFO [HttpMethodDirector] I/O exception (java.net.SocketException) caught when processing request: Too many open files Anyone could Help me? How i can solve this? Thanks
Re: Very basic questions: Indexing text
On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing but I only get back the document ID rather than the snippet: doc float name=score0.05030759/float arr name=content_type strtext/plain/str /arr str name=iddoc16/str /doc I'm using the schema.xml from the lucid imagination: Indexing text and html files tutorial. -Pete
Re: Very basic questions: Indexing text
try adding hl.fl=text to specify your highlight field. I don't understand why you're only getting the ID field back though. Do note that the highlighting is after the docs, related by the ID. Try a (non highlighting) query of just * to verify that you're pointing at the index you think you are. It's possible that you've modified a different index with SolrJ than your web server is pointing at. Also, SOLR has no way of knowing you're modified your index with SolrJ, so it may not be automatically reopening an IndexReader so your recent changes may not be visible until you force the SOLR reader to reopen. HTH Erick On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote: On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters Here's how I commit my documents: J=0; for i in `find . -name \*.txt`; do (( J++ )) curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F myfi...@$i; done; echo - Committing curl http://localhost:8983/solr/update/extract?commit=true; Then, I try to query using http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing but I only get back the document ID rather than the snippet: doc float name=score0.05030759/float arr name=content_type strtext/plain/str /arr str name=iddoc16/str /doc I'm using the schema.xml from the lucid imagination: Indexing text and html files tutorial. -Pete
What is the proper procedure to reopen closed bugs?
I'd like to reopen a bug SOLR-1960 https://issues.apache.org/jira/browse/SOLR-1960 http://wiki.apache.org/solr/ : non-English users get generic MoinMoin page instead of the desired information as I submitted a patch. But jira won't let me do it. Do I have to clone it? Teruhiko Kuro Kurosaka, 415-227-9600 x122 RLP + Lucene Solr = powerful search for global contents
AutoSuggest Question
Hi, I've read some on the autosuggest and I would like to know if the following is possible with my current configuration. I'm using solr 1.4. field name=title type=text indexed=true stored=true required=true/ field name=titleac3 type=autocomplete3 indexed=true stored=true omitNorms=true omitTermFreqAndPositions=true/ copyField source=title dest=titleac3/ fieldType name=autocomplete3 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.LetterTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25/ /analyzer analyzer type=query tokenizer class=solr.LetterTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Results: http://localhost:8984/solr/core/select/?q=titleac3:%22secret%22version=2.2start=0rows=10indent=onfl=title I currently get the following results: doc str name=titlePasajes Secretos/str /doc − doc str name=titleSecret Agent Zero/str /doc − doc str name=titleSecretos de la Ciudad/str /doc − doc str name=titleBack to the Secret Garden/str /doc − doc str name=titleThe Making Of: The Secret Life of Bees/str /doc − doc str name=titleSexy Celebrity Secrets/str /doc − doc str name=titleThe Secrets of the Battle of the Bulge/str /doc − doc str name=titleThe Secret Life of Bees/str /doc − doc str name=titleAncient Secrets of the Bible/str /doc − doc str name=titleSecrets of the Submarine War/str /doc I'd like a way for the results to be sorted so it looks like this: − doc str name=titleSecret Agent Zero/str (found in 1st word) /doc − doc str name=titleThe Secrets of the Battle of the Bulge/str (found in 1st word) /doc − doc str name=titleThe Secret Life of Bees/str (found in 1st word) /doc − doc str name=titleSecrets of the Submarine War/str (found in 1st word) /doc - doc str name=titleSecretos de la Ciudad/str (found in 1st word) /doc - doc str name=titleAncient Secrets of the Bible/str (found in 2nd word) /doc - doc str name=titleBack to the Secret Garden/str (found in 2nd word) /doc - doc str name=titleThe Making Of: The Secret Life of Bees/str (found in 2nd word) /doc − doc str name=titlePasajes Secretos/str (found in 2nd word) /doc - doc str name=titleSexy Celebrity Secrets/str (found in 3rd word) /doc So I'd like to have the first matches are where secret happens in the first word or second leading sub-word grouped alphabetically. The next, where secret happens in the second word or second leading sub-word grouped alphabetically. Etc. My specific rule is that the first word is stop word it is ignored in the sorting. Is there a way I can get solr to order my results as such? Also, are there any drawbacks on using the solr.LetterTokenizerFactory? I assume the maxGramSize refers to the max length of a gram and making something larger than 25 really is not helpful? Is there a better way to do the autosuggest technique above aside from using the autocomplete3 field I've defined given what I'm trying to accomplish? Thanks, Neil
Re: Very basic questions: Indexing text
On 28.06.2010 23:00 Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters I didn't know this is possible and am also interested in this feature but even after reading the given Wiki page I cannot make out which is the parameter to use. The only paramter that could be similar is 'hl.maxAlternateFieldLength' where it is possible to give a length to return but according to the description that is for the case no match. And there is hl.fragmentsBuilder but with no explanation (the refered page SolrFragmentsBuilder does not yet exist). Could you give an example? E.g. lets say I have a field 'title' and a field 'fulltext' and my search term is 'solr'. What would be the right set of parameters to get back the whole title-field but only a sniplet of 50 words (or three sentences or whatever the unit) from the fulltext field. Thanks -Michael