How can i get collect stemmed query?
Hi~. I'm beginner who wanna make search system by using solr 1.4.1 and lucene 2.92. I got a collect lucene query from my custom Analyzer and filter from given query, but no result displayed. Here is my Analyzer source. -- public class KLTQueryAnalyzer extends Analyzer{ public static final Version LUCENE_VERSION = Version.LUCENE_29; public static int QUERY_MIN_LEN_WORD_FILTER = 1; public static int QUERY_MAX_LEN_WORD_FILTER = 40; public int elapsedTime = 0; @Override public TokenStream tokenStream(String paramString, Reader reader) { StandardTokenizer tokenizer = new StandardTokenizer( du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); TokenStream tokenStream = new LengthFilter( tokenizer, QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER ); tokenStream = new LowerCaseFilter( tokenStream ); //My custom stemmer method KLTSingleWordStemmer stemer = new KLTSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER); //My custom analyzer filter. this filter return sub-merged query. //ex) INPUT : flyaway // RETURN VALUE : fly +body:away tokenStream = new KLTQueryStemFilter( tokenStream, stemer, this ); return tokenStream; } } -- example query) Input User query : +body:flyaway Expected analyzed query : +body:fly +body:away INDEXED DATA : body fly away I'm expecting 1 docs returned from index, but I have no result returned. explain my custom flow 1. User input query : +body:flyaway 2. Analyzer return that : fly +body:away 3. Solr attach search field tag at filter returned query : +body as i defined at schema.xml.(default operator AND) 4. I'm indexed 1 docs that have field name body, has containing this phrase fly away 5. I expect 1 docs return of result by query +body:fly +body:away but 0 docs returned. What's the problem?? Anybody help me please~ : -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-i-get-collect-stemmed-query-tp1723055p1723055.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can i get collect stemmed query?
Are you using KLTQueryAnalyzer outside of the Solr? (pre-process) Or you defined a fieldType in schema.xml that uses KLTQueryAnalyzer? Can you append debugQuery=on to your search url and paste output? --- On Mon, 10/18/10, Jerad ag...@naver.com wrote: From: Jerad ag...@naver.com Subject: How can i get collect stemmed query? To: solr-user@lucene.apache.org Date: Monday, October 18, 2010, 9:15 AM Hi~. I'm beginner who wanna make search system by using solr 1.4.1 and lucene 2.92. I got a collect lucene query from my custom Analyzer and filter from given query, but no result displayed. Here is my Analyzer source. -- public class KLTQueryAnalyzer extends Analyzer{ public static final Version LUCENE_VERSION = Version.LUCENE_29; public static int QUERY_MIN_LEN_WORD_FILTER = 1; public static int QUERY_MAX_LEN_WORD_FILTER = 40; public int elapsedTime = 0; @Override public TokenStream tokenStream(String paramString, Reader reader) { StandardTokenizer tokenizer = new StandardTokenizer( du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); TokenStream tokenStream = new LengthFilter( tokenizer, QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER ); tokenStream = new LowerCaseFilter( tokenStream ); //My custom stemmer method KLTSingleWordStemmer stemer = new KLTSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER); //My custom analyzer filter. this filter return sub-merged query. //ex) INPUT : flyaway // RETURN VALUE : fly +body:away tokenStream = new KLTQueryStemFilter( tokenStream, stemer, this ); return tokenStream; } } -- example query) Input User query : +body:flyaway Expected analyzed query : +body:fly +body:away INDEXED DATA : body fly away I'm expecting 1 docs returned from index, but I have no result returned. explain my custom flow 1. User input query : +body:flyaway 2. Analyzer return that : fly +body:away 3. Solr attach search field tag at filter returned query : +body as i defined at schema.xml.(default operator AND) 4. I'm indexed 1 docs that have field name body, has containing this phrase fly away 5. I expect 1 docs return of result by query +body:fly +body:away but 0 docs returned. What's the problem?? Anybody help me please~ : -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-i-get-collect-stemmed-query-tp1723055p1723055.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
Hi, you can try to parse the xml via Java yourself and then push the SolrInputDocuments it via SolrJ to solr. setting format to binaray + using the streaming update processor should improve performance, but I am not sure... and performant (+less mem!) reading xml in Java is another topic ... ;-) Regards, Peter. Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, -- http://jetwick.com twitter search prototype
Re: query between two date
You'll have to supply your dates in a format Solr expects (e.g. 2010-10-19T08:29:43Z and not 2010-10-19). If you don't need millisecond granularity you can use the DateMath syntax to specify that. Please, also check http://wiki.apache.org/solr/SolrQuerySyntax. On 17 October 2010 10:54, nedaha neda...@gmail.com wrote: Hi there, At first i have to explain the situation. I have 2 fields indexed named tdm_avail1 and tdm_avail2 that are arrays of some different dates. This is a sample doc arr name=tdm_avail1 date2010-10-21T08:29:43Z/date date2010-10-22T08:29:43Z/date date2010-10-25T08:29:43Z/date date2010-10-26T08:29:43Z/date date2010-10-27T08:29:43Z/date /arr arr name=tdm_avail2 date2010-10-19T08:29:43Z/date date2010-10-20T08:29:43Z/date date2010-10-21T08:29:43Z/date date2010-10-22T08:29:43Z/date /arr And in my search form i have 2 field named check-in date and check-out date. I want solr to compare the range that user enter in the search form with the values of tdm_avail1 and tdm_avail2 and return doc if all dates between check-in and check-out dates matches with tdm_avail1 or tdm_avail2 values. for example if user enter: check-in date: 2010-10-19 check-out date: 2010-10-21 that is match with tdm_avail2 then doc must be returned. but if user enter: check-in date: 2010-10-25 check-out date: 2010-10-29 doc could not be returned. so i want the query that gives me the mentioned result. could you help me please? thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1718566.html Sent from the Solr - User mailing list archive at Nabble.com.
AW: How do you programatically create new cores?
A http-get call is simply made by entering the url into your browser, like shown in the example in the wiki: http://localhost:8983/solr/admin/cores?action=CREATEname=coreXinstanceDir= path_to_instance_directoryconfig=config_file_name.xmlschema=schem_file_nam e.xmldataDir=data -Ursprüngliche Nachricht- Von: Tharindu Mathew [mailto:mcclou...@gmail.com] Gesendet: Sonntag, 17. Oktober 2010 18:07 An: solr-user@lucene.apache.org Cc: solr-user@lucene.apache.org Betreff: Re: How do you programatically create new cores? Hi Marc, Thanks for the reply. So as I understand I need to make a http get call with an action parameter set to create to dynamically create a core? I do not see an API to do this anywhere. On Oct 17, 2010, at 3:54 PM, Marc Sturlese marc.sturl...@gmail.com wrote: You have to create the core's folder with it's conf inside the Solr home. Once done you can call the create action of the admin handler: http://wiki.apache.org/solr/CoreAdmin#CREATE If you need to dinamically create, start and stop lots of cores there's this patch, but don't know about it's current state: http://wiki.apache.org/solr/LotsOfCores -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-you-programatically-create-n ew-cores-tp1706487p1718648.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: query between two date
Thanks for your reply. I know about the solr date format!! Check-in and Check-out dates are user-friendly format that we use in our search form for system's users. and i change the format via code and then send them to solr. I want to know how can i make a query to compare a range between check-in and check-out date with some separate different days that i have in solr index. for example: check-in date is: 2010-10-19T00:00:00Z and check-out date is: 2010-10-21T00:00:00Z when i want to build a query from my application i have a range date but in solr index i have separate dates. So how can i compare them to get the appropriate result? -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can i get collect stemmed query?
Oops, I'm Sorry! I found some mistakes on previous posted source.( Main class name has been wrong :) This is the collect analyzer source. --- public class MyCustomQueryAnalyzer extends Analyzer{ public static final Version LUCENE_VERSION = Version.LUCENE_29; public static int QUERY_MIN_LEN_WORD_FILTER = 1; public static int QUERY_MAX_LEN_WORD_FILTER = 40; public int elapsedTime = 0; @Override public TokenStream tokenStream(String paramString, Reader reader) { StandardTokenizer tokenizer = new StandardTokenizer( du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); TokenStream tokenStream = new LengthFilter( tokenizer, QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER ); tokenStream = new LowerCaseFilter( tokenStream ); //My custom stemmer method MyCustomSingleWordStemmer stemer = new MyCustomSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER); //My custom analyzer filter. this filter return sub-merged query. //ex) INPUT : flyaway // RETURN VALUE : fly +body:away tokenStream = new KLTQueryStemFilter( tokenStream, stemer, this ); return tokenStream; } } --- [Additional info] 1. MyCustomQueryAnalyzer made outside of Solr. I made this analyzer outside of the solr package and make it to ~.jar and located at ~/Solr/example/work/Jetty_0_0_0_0_8982_solr.war__solr__-2c5peu/webapp/WEB-INF/lib 2. I edited field type and field name in scheme.xml which to be searched. field name=body type=textTp indexed=true stored=true omitNorms=true/ fieldType name=textTp class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query class=com.testsolr.ir.customAnalyzer.MyCustomQueryAnalyzer tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType This is my custom scheme.xml and custom search field type. 3. I've got this xml result when I append debugQuery=on to my search url. ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=debugQueryon/str str name=indenton/str str name=start0/str str name=q+body:flyaway/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0 / - lst name=debug str name=rawquerystring+body:flyaway/str str name=querystring+body:flyaway/str str name=parsedquery+body:fly +body:away/str str name=parsedquery_toString+body:fly +body:away/str lst name=explain / str name=QParserLuceneQParser/str - lst name=timing double name=time0.0/double - lst name=prepare double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst - lst name=process double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response I really appreciate your advice~ :) -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-i-get-collect-search-result-from-custom-filtered-query-tp1723055p1723815.html Sent from the Solr - User mailing list archive at Nabble.com.
Boosting documents based on the vote count
Hello all, I have a field in my schema which holds the number of votes a document has. How can I boost documents based on that number? Something like the one which has the maximum number has a boost of 10, the one with the smallest number has 0.5 and in between the values get calculated automatically. Thanks, Alexandru Badiu
Re: query between two date
ok, maybe don't get this right.. are you trying to match something like check-in date 2010-10-19T00:00:00Z AND check-out date 2010-10-21T00:00:00Z *or* check-in date = 2010-10-19T00:00:00Z AND check-out date = 2010-10-21T00:00:00Z? On 18 October 2010 10:05, nedaha neda...@gmail.com wrote: Thanks for your reply. I know about the solr date format!! Check-in and Check-out dates are user-friendly format that we use in our search form for system's users. and i change the format via code and then send them to solr. I want to know how can i make a query to compare a range between check-in and check-out date with some separate different days that i have in solr index. for example: check-in date is: 2010-10-19T00:00:00Z and check-out date is: 2010-10-21T00:00:00Z when i want to build a query from my application i have a range date but in solr index i have separate dates. So how can i compare them to get the appropriate result? -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html Sent from the Solr - User mailing list archive at Nabble.com.
solr requirements
Hi All, I am planning to have a separate server for solr and regarding hardware requirements i have a doubt about what configuration to be needed. I know it will be hard to tell but i just need a minimum requirement for the particular situation as follows:: 1) There are 1000 regular users using solr and Every day each user indexes 10 files of 1KB each and totally it leads to a size of 10MB for a day and it goes on...??? 2)How much of RAM is used by solr in genral??? Thanks, satya
Re: query between two date
The exact query that i want is: check-in date = 2010-10-19T00:00:00Z AND check-out date = 2010-10-21T00:00:00Z but because of the structure that i have to index i don't have specific start date and end date in my solr index to compare with check-in and check-out date range. I have some dates that available to reserve! Could you please help me? :) -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1724062.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: API for using Multi cores with SolrJ
I asked this myself ... here could be some pointers: http://lucene.472066.n3.nabble.com/SolrJ-and-Multi-Core-Set-up-td1411235.html http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-in-Single-Core-td475238.html Hi everyone, I'm trying to write some code for creating and using multi cores. Is there a method available for this purpose or do I have to do a HTTP to a URL such as http://localhost:8983/solr/admin/cores?action=STATUScore=core0 Is there an API available for this purpose. For example, if I want to create a new core named core01 and then check for the status and then insert a document to that index of core01, how do I do it? Any help or a document would help greatly. Thanks in advance. -- Regards, Tharindu -- http://jetwick.com twitter search prototype
Re: query between two date
ok, I see now..well, the only query that comes to mind is something like: check-in date:[2010-10-19T00:00:00Z TO *] AND check-out date:[* TO 2010-10-21T00:00:00Z] would something like that work? On 18 October 2010 11:04, nedaha neda...@gmail.com wrote: The exact query that i want is: check-in date = 2010-10-19T00:00:00Z AND check-out date = 2010-10-21T00:00:00Z but because of the structure that i have to index i don't have specific start date and end date in my solr index to compare with check-in and check-out date range. I have some dates that available to reserve! Could you please help me? :) -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1724062.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can i get collect stemmed query?
rawquerystring = +body:flyaway parsedquery = +body:fly +body:away shows that your custom filter is working as you expected. However you are using different tokenizers in query (standardtokenizer hard-coded) and index (whitespacetokenizer) time. That may cause numFound=0. For example if your indexed document contains 'fly, away' in its body field, your query won't return it. Because of comma. admin/analysis.jsp shows indexed tokens. You can issue a *:* query to see if that document really exists. q=*:*fl=body Your query analyzer definition should look like : analyzer type=query class=com.testsolr.ir.customAnalyzer.MyCustomQueryAnalyzer / you cannot have both an analyzer and a tokenizer at the same time. Once you get this working, in your case it is better to write a custom filter factory plug-in and define query analyzer using it. ( for performance reason) And you can load your plug-in easier : http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LengthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=com.testsolr.ir.KLTQueryStemFilter/ /analyzer --- On Mon, 10/18/10, Jerad ag...@naver.com wrote: From: Jerad ag...@naver.com Subject: Re: How can i get collect stemmed query? To: solr-user@lucene.apache.org Date: Monday, October 18, 2010, 12:14 PM Oops, I'm Sorry! I found some mistakes on previous posted source.( Main class name has been wrong :) This is the collect analyzer source. --- public class MyCustomQueryAnalyzer extends Analyzer{ public static final Version LUCENE_VERSION = Version.LUCENE_29; public static int QUERY_MIN_LEN_WORD_FILTER = 1; public static int QUERY_MAX_LEN_WORD_FILTER = 40; public int elapsedTime = 0; @Override public TokenStream tokenStream(String paramString, Reader reader) { StandardTokenizer tokenizer = new StandardTokenizer( du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); TokenStream tokenStream = new LengthFilter( tokenizer, QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER ); tokenStream = new LowerCaseFilter( tokenStream ); //My custom stemmer method MyCustomSingleWordStemmer stemer = new MyCustomSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER, QUERY_MAX_LEN_WORD_FILTER); //My custom analyzer filter. this filter return sub-merged query. //ex) INPUT : flyaway // RETURN VALUE : fly +body:away tokenStream = new KLTQueryStemFilter( tokenStream, stemer, this ); return tokenStream; } } --- [Additional info] 1. MyCustomQueryAnalyzer made outside of Solr. I made this analyzer outside of the solr package and make it to ~.jar and located at ~/Solr/example/work/Jetty_0_0_0_0_8982_solr.war__solr__-2c5peu/webapp/WEB-INF/lib 2. I edited field type and field name in scheme.xml which to be searched. field name=body type=textTp indexed=true stored=true omitNorms=true/ fieldType name=textTp class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query class=com.testsolr.ir.customAnalyzer.MyCustomQueryAnalyzer tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType This is my custom scheme.xml and custom search field type. 3. I've got this xml result when I append debugQuery=on to my search url. ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=debugQueryon/str str name=indenton/str str name=start0/str str name=q+body:flyaway/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=0 start=0 / - lst name=debug str name=rawquerystring+body:flyaway/str str name=querystring+body:flyaway/str str name=parsedquery+body:fly +body:away/str str name=parsedquery_toString+body:fly +body:away/str lst name=explain / str name=QParserLuceneQParser/str - lst name=timing double name=time0.0/double - lst name=prepare double name=time0.0/double - lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst - lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst - lst
Re: Boosting documents based on the vote count
I have a field in my schema which holds the number of votes a document has. How can I boost documents based on that number? you can do it with http://wiki.apache.org/solr/FunctionQuery
Re: Boosting documents based on the vote count
I know but I can't figure out what functions to use. :) On Mon, Oct 18, 2010 at 1:38 PM, Ahmet Arslan iori...@yahoo.com wrote: I have a field in my schema which holds the number of votes a document has. How can I boost documents based on that number? you can do it with http://wiki.apache.org/solr/FunctionQuery
Implementing Search Suggestion on Solr
Hi! I'm trying to implement some kind of Search Suggestion on a search engine I have implemented. This search suggestions should not be automatically like the one described for the SpellCheckComponent [1]. I'm looking something like: SAS oppositions = Public job offers for some-company So I will have to define it manually. I was thinking about synonyms [2] but I don't know if it's the proper way to do it, because semantically those terms are not synonyms. Any ideas or suggestions? Regards, [1] http://wiki.apache.org/solr/SpellCheckComponent [2] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Re: Term is duplicated when updating a document
Thanks. Not really the answer I wanted to hear, but at least I know this is not my fault ;) Regards Thomas Erick Erickson, 15.10.2010 20:42: This is actually known behavior. The problem is that when you update a document, it's deleted and re-added, but the original is marked as deleted. However, the terms aren't touched, both the original and the new document's terms are counted. It'd be hard, very hard, to remove the terms from the inverted index efficiently. But when you optimize, all the deleted documents (and their assiociated terms) are physically removed from the files, thus your term counts change. HTH Erick On Fri, Oct 15, 2010 at 10:05 AM, Thomas Kellererspam_ea...@gmx.netwrote: Thanks for the answer. Which fields are modified when the document is updated/replaced. Only one field was changed, but it was not the one where the auto-suggest term is coming from. Are there any differences in the content of the fields that you are using for the AutoSuggest. No Have you changed you schema.xml file recently? If you have, then there may have been changes in the way these fields are analyzed and broken down to terms. No, I did a complete index rebuild to rule out things like that. Then after startup, did a search, then updated the document and did a search again. Regards Thomas This may be a bug if you did not change the field or the schema file but the terms count is changing. On Fri, Oct 15, 2010 at 9:14 AM, Thomas Kellererspam_ea...@gmx.net wrote: Hi, we are updating our documents (that represent products in our shop) when a dealer modifies them, by calling SolrServer.add(SolrInputDocument) with the updated document. My understanding is, that there is no other way of updating an existing document. However we also use a term query to autocomplete the search field for the user, but each time adocument is updated (added) the term count is incremented. So after starting with a new index the count is e.g. 1, then the document (that contains that term) is updated, and the count is 2, the next update will set this to 3 and so on. One the index is optimized (by calling SolServer.optimize()) the count is correct again. Am I missing something or is this a bug in Solr/Lucene? Thanks in advance Thomas
Re: Virtual field, Statistics
Hello Lance, thank you for your reply. I created the following JIRA issue: https://issues.apache.org/jira/browse/SOLR-2171, as suggested. Can you tell me how new issues are handled by the development teams, and whether there's a way I could help/contribute ? -- Tanguy 2010/10/16 Lance Norskog goks...@gmail.com: Please add a JIRA issue requesting this. A bunch of things are not supported for functions: returning as a field value, for example. On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal tanguy.m...@gmail.com wrote: Dear solr-user folks, I would like to use the stats module to perform very basic statistics (mean, min and max) which is actually working just fine. Nethertheless I found a little limitation that bothers me a tiny bit : how to perform the exact same statistics, but on the result of a function query rather than a field. Example : schema : - string : id - float : width - float : height - float : depth - string : color - float : price What I'd like to do is something like : select?price:[45.5 TO 99.99]stats=onstats.facet=colorstats.field={volume=product(product(width, height), depth)} I would expect to obtain : lst name=stats lst name=stats_fields lst name=(product(product(width,height),depth)) double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double lst name=facets lst name=color lst name=white double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double /lst lst name=red double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double /lst !-- Other facets on other colors go here -- /lst!-- end of statistical facets on volumes -- /lst!-- end of stats on volumes -- /lst!-- end of stats_fields node -- /lst Of course computing the volume can be performed before indexing data, but defining virtual fields on the fly given an arbitrary function is powerful and I am comfortable with the idea that many others would appreciate. Especially for BI needs and so on... :-D Is there a way to do it easily that I would have not been able to find, or is it actually impossible ? Thank you very much in advance for your help. -- Tanguy -- Lance Norskog goks...@gmail.com
Re: SOLR DateTime and SortableLongField field type problems
Just following up to see if anybody might have some words of wisdom on the issue? Thank you, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley doh...@gmail.com wrote: Hello all, I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow the advice from http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmlabout converting date fields to SortableLong fields for better memory efficiency. However, whenever I try to do this using the DateFormater, I get exceptions when indexing for every row that tries to create my sortable fields. In my schema.xml, I have the following definitions for the fieldType and dynamicField: fieldType name=sdate class=solr.SortableLongField indexed=true stored=false sortMissingLast=true omitNorms=true / dynamicField name=sort_date_* type=sdate stored=false indexed=true / In my dih.xml, I have the following definitions: dataConfig dataSource type=FileDataSource encoding=UTF-8 / entity name=xml_stories rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=legacy_stories.*\.xml$ recursive=false baseDir=/usr/local/extracts newerThan=${dataimporter.xml_stories.last_index_time} entity name=stories pk=id dataSource=xml_stories processor=XPathEntityProcessor url=${xml_stories.fileAbsolutePath} forEach=/RECORDS/RECORD stream=true transformer=DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer onError=continue field column=_modified_date xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL / field column=modified_date sourceColName=_modified_date dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=_df_date_published xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL / field column=df_date_published sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=sort_date_modified sourceColName=modified_date dateTimeFormat=MMddhhmmss / field column=sort_date_published sourceColName=df_date_published dateTimeFormat=MMddhhmmss / /entity /entity /document /dataConfig The fields in question are in the formats: RECORDS RECORD PROP NAME=R_StoryDate PVAL2001-12-04T00:00:00Z/PVAL /PROP PROP NAME=R_ModifiedTime PVAL2001-12-04T19:38:01Z/PVAL /PROP /RECORD /RECORDS The exception that I am receiving is: Oct 15, 2010 6:23:24 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007 at java.text.DateFormat.parse(DateFormat.java:337) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) I know that it has to be the SortableLong fields, because if I remove just those two lines from my dih.xml, everything imports as I expect it to. Am I doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is this not supported in my version of SOLR? I'm not very experienced with Java, so digging into the code would be a lost cause for me right now. I was hoping that somebody here might be able to help point me in the right/correct direction. It should be noted that the modified_date and df_date_published fields index just fine (so long as I do it as I've defined above). Thank you, - Ken It looked like something resembling white marble, which was probably what it was:
Re: indexing mysql database
Also, the little-advertised DIH debug page can help, see: solr/admin/dataimport.jsp Best Erick On Sun, Oct 17, 2010 at 11:56 AM, William Pierce evalsi...@hotmail.comwrote: Two suggestions: a) Noticed that your dih spec in the solrconfig.xml seems to to refer to db-data-config.xml but you said that your file was db-config.xml. You may want to check this to make sure that your file names are correct. b) what does your log say when you ran the import process? - Bill -Original Message- From: do3do3 Sent: Sunday, October 17, 2010 8:29 AM To: solr-user@lucene.apache.org Subject: indexing mysql database i try to index table in mysql database, 1st i create db-config.xml file which contain dataSource type=JdbcDataSource name=1stTrial Driver=com.mysql.jdbc.Driver encoding=UTF-8 url=jdbc:mysql://localhost:3306/(database name) user=(user) password=(password) batchSize=-1/ followed by entity dataSource=1stTrial name=(table name) pk=id query=select * from (table name) and defining of table like field column=id name=ID/ field column=Text1 name=(field name)/ 2nd i add this field in schema.xml file and finally decide in solronfig.xml file the db-config.xml file as requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdb-data-config.xml/str /lst /requestHandler i found index folder which contain only segment.gen segment_1 files and when try to search no result i got any body can present a help ??? thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-mysql-database-tp1719883p1719883.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOLR DateTime and SortableLongField field type problems
I think if you look closely you'll find the date quoted in the Exception report doesn't match any of the declared formats in the schema. I would suggest, as a first step, hunting through your data to see where that date is coming from. -Mike -Original Message- From: Ken Stanley [mailto:doh...@gmail.com] Sent: Monday, October 18, 2010 7:40 AM To: solr-user@lucene.apache.org Subject: Re: SOLR DateTime and SortableLongField field type problems Just following up to see if anybody might have some words of wisdom on the issue? Thank you, Ken It looked like something resembling white marble, which was probably what it was: something resembling white marble. -- Douglas Adams, The Hitchhikers Guide to the Galaxy On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley doh...@gmail.com wrote: Hello all, I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow the advice from http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmla bout converting date fields to SortableLong fields for better memory efficiency. However, whenever I try to do this using the DateFormater, I get exceptions when indexing for every row that tries to create my sortable fields. In my schema.xml, I have the following definitions for the fieldType and dynamicField: fieldType name=sdate class=solr.SortableLongField indexed=true stored=false sortMissingLast=true omitNorms=true / dynamicField name=sort_date_* type=sdate stored=false indexed=true / In my dih.xml, I have the following definitions: dataConfig dataSource type=FileDataSource encoding=UTF-8 / entity name=xml_stories rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=legacy_stories.*\.xml$ recursive=false baseDir=/usr/local/extracts newerThan=${dataimporter.xml_stories.last_index_time} entity name=stories pk=id dataSource=xml_stories processor=XPathEntityProcessor url=${xml_stories.fileAbsolutePath} forEach=/RECORDS/RECORD stream=true transformer=DateFormatTransformer,HTMLStripTransformer,RegexT ransformer,TemplateTransformer onError=continue field column=_modified_date xpath=/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL / field column=modified_date sourceColName=_modified_date dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=_df_date_published xpath=/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL / field column=df_date_published sourceColName=_df_date_published dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=sort_date_modified sourceColName=modified_date dateTimeFormat=MMddhhmmss / field column=sort_date_published sourceColName=df_date_published dateTimeFormat=MMddhhmmss / /entity /entity /document /dataConfig The fields in question are in the formats: RECORDS RECORD PROP NAME=R_StoryDate PVAL2001-12-04T00:00:00Z/PVAL /PROP PROP NAME=R_ModifiedTime PVAL2001-12-04T19:38:01Z/PVAL /PROP /RECORD /RECORDS The exception that I am receiving is: Oct 15, 2010 6:23:24 PM org.apache.solr.handler.dataimport.DateFormatTransformer transformRow WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Wed Nov 28 21:39:05 EST 2007 at java.text.DateFormat.parse(DateFormat.java:337) at org.apache.solr.handler.dataimport.DateFormatTransformer.proce ss(DateFormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.trans formRow(DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.appl yTransformer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.next Row(EntityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(Do cBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(Do cBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBu ilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuild er.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(D ataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImp orter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.jav a:370) I know that it has to be the SortableLong
Re: how can i use solrj binary format for indexing?
Hi, Gora I haven't tried yet indexing huge amount of xml files through curl or pure java(like a post.jar). Indexing through xml is really fast? How many files did you index? And How did it(using curl or pure java)? Thanks, Gora -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1724645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr requirements
Well, always get the biggest, fastest machine you can G... On a serious note, you're right, there's not much info to go on here. And even if there were more info, Solr performance depends on how you search your data as well as how much data you have... About the only way you can really tell is to set your system up and use the adminstatistics page to monitor your system. In particular, monitor your cache evictions etc. This page may also help: http://wiki.apache.org/solr/SolrPerformanceFactors Best Erick On Mon, Oct 18, 2010 at 5:59 AM, satya swaroop satya.yada...@gmail.comwrote: Hi All, I am planning to have a separate server for solr and regarding hardware requirements i have a doubt about what configuration to be needed. I know it will be hard to tell but i just need a minimum requirement for the particular situation as follows:: 1) There are 1000 regular users using solr and Every day each user indexes 10 files of 1KB each and totally it leads to a size of 10MB for a day and it goes on...??? 2)How much of RAM is used by solr in genral??? Thanks, satya
Re: Virtual field, Statistics
The beauty/problem with open source is issues are picked up when somebody thinks they're important enough and has the time/energy to work on it. And that person can be you G... What usually happens is that someone submits a patch, various people comment on it, look it over, ask for changes or provide other feedback (e.g. Have you considered XYZ, or You do realize that if we implement this patch, the universe will end, don't you? G). Then, after a bunch of back-and forths one of the committers decides that it's ready to be included in the trunk and/or the branches. The chances of the particular changed you need being included in trunk go up dramatically if you provide a patch. And keep pushing (gently) on the issue. One tip, though. Before investing a lot of time and energy in creating a patch, figure out how you expect to change the code and ask some questions (via commenting on the JIRA issue) about what you're thinking about doing. You'll often get some really valuable feedback before investing lots of time... See: http://wiki.apache.org/solr/HowToContribute for the details of getting the source, compiling, running unit tests, setting up your IDE, etc. Best Erick On Mon, Oct 18, 2010 at 6:59 AM, Tanguy Moal tanguy.m...@gmail.com wrote: Hello Lance, thank you for your reply. I created the following JIRA issue: https://issues.apache.org/jira/browse/SOLR-2171, as suggested. Can you tell me how new issues are handled by the development teams, and whether there's a way I could help/contribute ? -- Tanguy 2010/10/16 Lance Norskog goks...@gmail.com: Please add a JIRA issue requesting this. A bunch of things are not supported for functions: returning as a field value, for example. On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal tanguy.m...@gmail.com wrote: Dear solr-user folks, I would like to use the stats module to perform very basic statistics (mean, min and max) which is actually working just fine. Nethertheless I found a little limitation that bothers me a tiny bit : how to perform the exact same statistics, but on the result of a function query rather than a field. Example : schema : - string : id - float : width - float : height - float : depth - string : color - float : price What I'd like to do is something like : select?price:[45.5 TO 99.99]stats=onstats.facet=colorstats.field={volume=product(product(width, height), depth)} I would expect to obtain : lst name=stats lst name=stats_fields lst name=(product(product(width,height),depth)) double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double lst name=facets lst name=color lst name=white double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double /lst lst name=red double name=min.../double double name=max.../double double name=sum.../double long name=count.../long long name=missing.../long double name=sumOfSquares.../double double name=mean.../double double name=stddev.../double /lst !-- Other facets on other colors go here -- /lst!-- end of statistical facets on volumes -- /lst!-- end of stats on volumes -- /lst!-- end of stats_fields node -- /lst Of course computing the volume can be performed before indexing data, but defining virtual fields on the fly given an arbitrary function is powerful and I am comfortable with the idea that many others would appreciate. Especially for BI needs and so on... :-D Is there a way to do it easily that I would have not been able to find, or is it actually impossible ? Thank you very much in advance for your help. -- Tanguy -- Lance Norskog goks...@gmail.com
Re: Boosting documents based on the vote count
I know but I can't figure out what functions to use. :) Oh, I see. Why not just use {!boost b=log(vote)}? May be scale(vote,0.5,10)?
Re: how can i use solrj binary format for indexing?
On Mon, Oct 18, 2010 at 5:26 PM, Jason, Kim hialo...@gmail.com wrote: Hi, Gora I haven't tried yet indexing huge amount of xml files through curl or pure java(like a post.jar). Indexing through xml is really fast? How many files did you index? And How did it(using curl or pure java)? [...] We did it through curl. There were some 3.5 million XML files, and some 60 fields in the Solr schema, with minor tokenising, though with some facets. A total of about 40GB of data. We used five Solr instances, and five cores on each instance. From what I recall, it took 6h, though here we might have well been limited by the read speed on a slow network drive that held the data. If done in this way, one might need to merge the data from the various cores, a task which took us about 1.5h. Regards, Gora
Re: solr requirements
Hi, here is some more info about it. I use Solr to output only the file names(file id's). Here i enclose the fields in my schema.xml and presently i have only about 40MB of indexed data. field name=id type=string indexed=true stored=true required=true / field name=sku type=textTight indexed=true stored=false omitNorms=true/ field name=name type=textgen indexed=true stored=false/ field name=manu type=textgen indexed=true stored=false omitNorms=true/ field name=cat type=text_ws indexed=true stored=false multiValued=true omitNorms=true / field name=features type=text indexed=true stored=false multiValued=true/ field name=includes type=text indexed=true stored=false termVectors=true termPositions=true termOffsets=true / field name=weight type=float indexed=true stored=false/ field name=price type=float indexed=true stored=false/ field name=popularity type=int indexed=true stored=false / field name=inStock type=boolean indexed=true stored=false / !-- The following store examples are used to demonstrate the various ways one might _CHOOSE_ to implement spatial. It is highly unlikely that you would ever have ALL of these fields defined. -- field name=store type=location indexed=true stored=false/ field name=store_lat_lon type=latLon indexed=true stored=false/ field name=store_hash type=geohash indexed=true stored=false/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. -- field name=title type=text indexed=true stored=true multiValued=true/ field name=subject type=text indexed=true stored=false/ field name=description type=text indexed=true stored=false/ field name=comments type=text indexed=true stored=false/ field name=author type=textgen indexed=true stored=false/ field name=keywords type=textgen indexed=true stored=false/ field name=category type=textgen indexed=true stored=false/ field name=content_type type=string indexed=true stored=false multiValued=true/ field name=last_modified type=date indexed=true stored=false/ field name=links type=string indexed=true stored=false multiValued=true/ !-- added here content satya-- field name=content type=spell indexed=true stored=false multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- field name=text type=text indexed=true stored=false multiValued=true termVectors=true/ !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. here satya-- field name=text_rev type=text_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField here satya-- field name=manu_exact type=string indexed=true stored=false/ field name=spell type=spell indexed=true stored=false multiValued=true/ !-- heere changed -- field name=payloads type=payloads indexed=true stored=false/ field name=timestamp type=date indexed=true stored=false default=NOW multiValued=false/ Regards, satya
RE: query between two date
Recommend using the pdate format for faster range queries. Here's how (or one way) to do a range query in solr defType=luceneq=some_field:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] Does that answer your question? I don't really understand what you're trying to do with your two dates. You can of course combine range queries with operators with the standard/lucene query parser: defType=luceneq=some_field:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] AND other_field=[whatever TO whatever] There are ways to make a query comparing the values of two fields too using function queries. But it's slighlty confusing and i'm not sure that's what you want to do, I'm not really sure what you want to do. Want to give an example of exactly what input you have (from your application), and what question you are trying to answer from your index? From: nedaha [neda...@gmail.com] Sent: Monday, October 18, 2010 5:05 AM To: solr-user@lucene.apache.org Subject: Re: query between two date Thanks for your reply. I know about the solr date format!! Check-in and Check-out dates are user-friendly format that we use in our search form for system's users. and i change the format via code and then send them to solr. I want to know how can i make a query to compare a range between check-in and check-out date with some separate different days that i have in solr index. for example: check-in date is: 2010-10-19T00:00:00Z and check-out date is: 2010-10-21T00:00:00Z when i want to build a query from my application i have a range date but in solr index i have separate dates. So how can i compare them to get the appropriate result? -- View this message in context: http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR DateTime and SortableLongField field type problems
On Mon, Oct 18, 2010 at 7:52 AM, Michael Sokolov soko...@ifactory.comwrote: I think if you look closely you'll find the date quoted in the Exception report doesn't match any of the declared formats in the schema. I would suggest, as a first step, hunting through your data to see where that date is coming from. -Mike [Note: RE-sending this because apparently in my sleepy-stupor, I clicked to wrong Reply button and never sent this to the list (It's a Monday) :)] I've noticed that date anomaly as well, and I've discovered that is one of the gotchas of DIH: it seems to modify my date to that format. All of the dates in the data are in the correct -MM-dd'T'hh:mm:ss'Z' format. Once it is run through dateTImeFormat, I assume it is converted into a date object; trying to use that date object in any other form (i.e., using template, or even another dateTimeFormat) results in the exception I've described (displaying the date in the incorrect format). Thanks, Ken Stanley
Re: API for using Multi cores with SolrJ
Thanks Peter. That helps a lot. It's weird that this not documented anywhere. :( On Mon, Oct 18, 2010 at 3:42 PM, Peter Karich peat...@yahoo.de wrote: I asked this myself ... here could be some pointers: http://lucene.472066.n3.nabble.com/SolrJ-and-Multi-Core-Set-up-td1411235.html http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-in-Single-Core-td475238.html Hi everyone, I'm trying to write some code for creating and using multi cores. Is there a method available for this purpose or do I have to do a HTTP to a URL such as http://localhost:8983/solr/admin/cores?action=STATUScore=core0 Is there an API available for this purpose. For example, if I want to create a new core named core01 and then check for the status and then insert a document to that index of core01, how do I do it? Any help or a document would help greatly. Thanks in advance. -- Regards, Tharindu -- http://jetwick.com twitter search prototype -- Regards, Tharindu
Re: API for using Multi cores with SolrJ
On Mon, Oct 18, 2010 at 10:12 AM, Tharindu Mathew mcclou...@gmail.com wrote: Thanks Peter. That helps a lot. It's weird that this not documented anywhere. :( Feel free to edit the wiki :)
Re: how can i use solrj binary format for indexing?
Do you already have the files as solr XML? If so, I don't think you need solrj If you need to build SolrInputDocuments from your existing structure, solrj is a good choice. If you are indexing lots of stuff, check the StreamingUpdateSolrServer: http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html On Sun, Oct 17, 2010 at 11:01 PM, Jason, Kim hialo...@gmail.com wrote: Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1722612.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Please let me know how to work multiple instances and cores in your case. Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Disable (or prohibit) per-field overrides
You know about the 'invariant' that can be set in the request handler, right? Not sure if that will do for you or not, but sounds related. Added recnetly to some wiki page somewhere although the feature has been there for a long time. Let's see if I can find the wiki page...Ah yes: http://wiki.apache.org/solr/SearchHandler#Configuration Markus Jelsma wrote: Hi, Thanks for the suggestion and pointer. We've implemented it using a single regex in Nginx for now. Cheers, : Anyone knows useful method to disable or prohibit the per-field override : features for the search components? If not, where to start to make it : configurable via solrconfig and attempt to come up with a working patch? If your goal is to prevent *clients* from specifying these (while you're still allowed to use them in your defaults) then the simplest solution is probably something external to Solr -- along the lines of mod_rewrite. Internally... that would be tough. You could probably write a SearchComponent (configured to run first) that does it fairly easily -- just wrap the SolrParams in an impl that retuns null anytime a component asks for a param name that starts with f. (and excludes those param names when asked for a list of the param names) It could probably be generalized to support arbitrary rules i na way that might be handy for other folks, but it would still just be wrapping all of hte params, so it would prevent you from using them in your config as well. Ultimatley i think a general solution would need to be in RequestHandlerBase ... where it wraps the request params using the defaults and invariants ... you'd want the custom exclusion rules to apply only to the request params from the client. -Hoss
Re: Disable (or prohibit) per-field overrides
Thanks for your reply. But as i replied the following to Erick's suggestion which is quite the same: Yes, we're using it but the problem is that there can be many fields and that means quite a large list of parameters to set for each request handler, and there can be many request handlers. It's not very practical for us to maintain such big set of invariants. It's much easier for us to maintain a very short white list than a huge black list. Cheers On Monday, October 18, 2010 04:59:09 pm Jonathan Rochkind wrote: You know about the 'invariant' that can be set in the request handler, right? Not sure if that will do for you or not, but sounds related. Added recnetly to some wiki page somewhere although the feature has been there for a long time. Let's see if I can find the wiki page...Ah yes: http://wiki.apache.org/solr/SearchHandler#Configuration Markus Jelsma wrote: Hi, Thanks for the suggestion and pointer. We've implemented it using a single regex in Nginx for now. Cheers, : Anyone knows useful method to disable or prohibit the per-field : override features for the search components? If not, where to start : to make it configurable via solrconfig and attempt to come up with a : working patch? If your goal is to prevent *clients* from specifying these (while you're still allowed to use them in your defaults) then the simplest solution is probably something external to Solr -- along the lines of mod_rewrite. Internally... that would be tough. You could probably write a SearchComponent (configured to run first) that does it fairly easily -- just wrap the SolrParams in an impl that retuns null anytime a component asks for a param name that starts with f. (and excludes those param names when asked for a list of the param names) It could probably be generalized to support arbitrary rules i na way that might be handy for other folks, but it would still just be wrapping all of hte params, so it would prevent you from using them in your config as well. Ultimatley i think a general solution would need to be in RequestHandlerBase ... where it wraps the request params using the defaults and invariants ... you'd want the custom exclusion rules to apply only to the request params from the client. -Hoss -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
query pending commits?
I have an indexing pipeline that occasionally needs to check if a document is already in the index (even if not commited yet). Any suggestions on how to do this without calling commit/ before each check? I have a list of document ids and need to know which ones are in the index (actually I need to know which ones are not in the index) I figured I would write a custome RequestHandler that would check the main Reader and the UpdateHander reader, but it now looks like 'update' is handled directly within IndexWriter. Any ideas? thanks ryan
Commits on service after shutdown
Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/
Re: Commits on service after shutdown
The documents should be implicitly committed when the Lucene index is closed. When you perform a graceful shutdown, the Lucene index gets closed and the documents get committed implicitly. When the shutdown is abrupt as in a KILL -9, then this does not happen and the updates are lost. You can use the auto commit parameter when sending your updates so that the changes are saved right away, thought this could slow down the indexing speed considerably but I do not believe there are parameters to keep those un-commited documents alive after a kill. On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara ezech...@gmail.comwrote: Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
RE: how can i use solrj binary format for indexing?
Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, You might want to take a look at this section of the wiki too -- http://wiki.apache.org/solr/Solrj#Setting_the_RequestWriter -Jon -Original Message- From: Jason, Kim [mailto:hialo...@gmail.com] Sent: Monday, October 18, 2010 7:52 AM To: solr-user@lucene.apache.org Subject: Re: how can i use solrj binary format for indexing? Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Please let me know how to work multiple instances and cores in your case. Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html Sent from the Solr - User mailing list archive at Nabble.com. - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
ApacheCon Atlanta Meetup
Is there interest in having a Meetup at ApacheCon? Who's going? Would anyone like to present? We could do something less formal, too, and just have drinks and QA/networking. Thoughts? -Grant
Spell checking question from a Solr novice
Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: Commits on service after shutdown
I understand, but i want to have control of what is commit or not. In our scenario, we want to add documents to the index, and maybe after an hour trigger the commit. If in the middle, we have a server shutdown or any process sending a Shutdown signal to the process. I don't want those documents being commited. Should i file a bug issue or an enhacement issue?. Thanks On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpo israele...@gmail.com wrote: The documents should be implicitly committed when the Lucene index is closed. When you perform a graceful shutdown, the Lucene index gets closed and the documents get committed implicitly. When the shutdown is abrupt as in a KILL -9, then this does not happen and the updates are lost. You can use the auto commit parameter when sending your updates so that the changes are saved right away, thought this could slow down the indexing speed considerably but I do not believe there are parameters to keep those un-commited documents alive after a kill. On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ http://www.ironicnet.com/ -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/ -- __ Ezequiel. Http://www.ironicnet.com
Re: Commits on service after shutdown
No.. you would just turn autocommit off, and have the thread that is doing updates to your indexes commit every hour. I'd think that this would take care of the scenario that you are describing. Matt On 10/18/2010 3:50 PM, Ezequiel Calderara wrote: I understand, but i want to have control of what is commit or not. In our scenario, we want to add documents to the index, and maybe after an hour trigger the commit. If in the middle, we have a server shutdown or any process sending a Shutdown signal to the process. I don't want those documents being commited. Should i file a bug issue or an enhacement issue?. Thanks On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpoisraele...@gmail.com wrote: The documents should be implicitly committed when the Lucene index is closed. When you perform a graceful shutdown, the Lucene index gets closed and the documents get committed implicitly. When the shutdown is abrupt as in a KILL -9, then this does not happen and the updates are lost. You can use the auto commit parameter when sending your updates so that the changes are saved right away, thought this could slow down the indexing speed considerably but I do not believe there are parameters to keep those un-commited documents alive after a kill. On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderaraezech...@gmail.com wrote: Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.comhttp://www.ironicnet.com/ http://www.ironicnet.com/ -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: Commits on service after shutdown
But if something happens in between that hour, i will have lost or committed the documents to the index out of the schedule. How can i handle this scenario? I think that Solr (or Lucene) should make sure of the durabilityhttp://en.wikipedia.org/wiki/Durability_(database_systems)of the data even if its in an uncommited state. On Mon, Oct 18, 2010 at 4:53 PM, Matthew Hall mh...@informatics.jax.orgwrote: No.. you would just turn autocommit off, and have the thread that is doing updates to your indexes commit every hour. I'd think that this would take care of the scenario that you are describing. Matt On 10/18/2010 3:50 PM, Ezequiel Calderara wrote: I understand, but i want to have control of what is commit or not. In our scenario, we want to add documents to the index, and maybe after an hour trigger the commit. If in the middle, we have a server shutdown or any process sending a Shutdown signal to the process. I don't want those documents being commited. Should i file a bug issue or an enhacement issue?. Thanks On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpoisraele...@gmail.com wrote: The documents should be implicitly committed when the Lucene index is closed. When you perform a graceful shutdown, the Lucene index gets closed and the documents get committed implicitly. When the shutdown is abrupt as in a KILL -9, then this does not happen and the updates are lost. You can use the auto commit parameter when sending your updates so that the changes are saved right away, thought this could slow down the indexing speed considerably but I do not believe there are parameters to keep those un-commited documents alive after a kill. On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderaraezech...@gmail.com wrote: Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ http://www.ironicnet.com/ http://www.ironicnet.com/ -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/ -- __ Ezequiel. Http://www.ironicnet.com
RE: Spell checking question from a Solr novice
Oops, never mind. Just read Google API policy. 1000 queries per day limit for non-commercial use only. -Original Message- From: Xin Li Sent: Monday, October 18, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Spell checking question from a Solr novice Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: Commits on service after shutdown
I'll see if i can resolve this adding an extra core with the same schema for holding this documents. So, Core0 will act as a Queue and the Core1 will be the real index. And the commit in the core0 will trigger an add to the core1 and its commit. That way i can be sure of not losing data. It surprises me that solr doesn't have this feature built-in. I still have to verify the perfomance, but looks good to me. Anyway, any help would be appreciated. On Mon, Oct 18, 2010 at 5:05 PM, Ezequiel Calderara ezech...@gmail.comwrote: But if something happens in between that hour, i will have lost or committed the documents to the index out of the schedule. How can i handle this scenario? I think that Solr (or Lucene) should make sure of the durabilityhttp://en.wikipedia.org/wiki/Durability_(database_systems)of the data even if its in an uncommited state. On Mon, Oct 18, 2010 at 4:53 PM, Matthew Hall mh...@informatics.jax.org wrote: No.. you would just turn autocommit off, and have the thread that is doing updates to your indexes commit every hour. I'd think that this would take care of the scenario that you are describing. Matt On 10/18/2010 3:50 PM, Ezequiel Calderara wrote: I understand, but i want to have control of what is commit or not. In our scenario, we want to add documents to the index, and maybe after an hour trigger the commit. If in the middle, we have a server shutdown or any process sending a Shutdown signal to the process. I don't want those documents being commited. Should i file a bug issue or an enhacement issue?. Thanks On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpoisraele...@gmail.com wrote: The documents should be implicitly committed when the Lucene index is closed. When you perform a graceful shutdown, the Lucene index gets closed and the documents get committed implicitly. When the shutdown is abrupt as in a KILL -9, then this does not happen and the updates are lost. You can use the auto commit parameter when sending your updates so that the changes are saved right away, thought this could slow down the indexing speed considerably but I do not believe there are parameters to keep those un-commited documents alive after a kill. On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderaraezech...@gmail.com wrote: Hi, i'm new in the mailing list. I'm implementing Solr in my actual job, and i'm having some problems. I was testing the consistency of the commits. I found for example that if we add X documents to the index (without commiting) and then we restart the service, the documents are commited. They show up in the results. This is interpreted to me like an error. But when we add X documents to the index (without commiting) and then we kill the process and we start it again, the documents doesn't appear. This behaviour is the one i want. Is there any param to avoid the auto-committing of documents after a shutdown? Is there any param to keep those un-commited documents alive after a kill? Thanks! -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ http://www.ironicnet.com/ http://www.ironicnet.com/ -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/ -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ -- __ Ezequiel. Http://www.ironicnet.com
Re: Spell checking question from a Solr novice
In general, the benefit of the built-in Solr spellcheck is that it can use a dictionary based on your actual index. If you want to use some external API, you certainly can, in your actual client app -- but it doesn't really need to involve Solr at all anymore, does it? Is there any benefit I'm not thinking of to doing that on the solr side, instead of just in your client app? I think Yahoo (and maybe Microsoft?) have similar APIs with more generous ToSs, but I haven't looked in a while. Xin Li wrote: Oops, never mind. Just read Google API policy. 1000 queries per day limit for non-commercial use only. -Original Message- From: Xin Li Sent: Monday, October 18, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Spell checking question from a Solr novice Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: Spell checking question from a Solr novice
I think a spellchecker based on your index has clear advantages. You can spellcheck words specific to your domain which may not be available in an outside dictionary. You can always dump the list from wordnet to get a starter english dictionary. But then it also means that misspelled words from your domain become the suggested correct word. Hmmm ... you'll need to have a way to prune out such words. Even then, your own domain based dictionary is a total go. On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: In general, the benefit of the built-in Solr spellcheck is that it can use a dictionary based on your actual index. If you want to use some external API, you certainly can, in your actual client app -- but it doesn't really need to involve Solr at all anymore, does it? Is there any benefit I'm not thinking of to doing that on the solr side, instead of just in your client app? I think Yahoo (and maybe Microsoft?) have similar APIs with more generous ToSs, but I haven't looked in a while. Xin Li wrote: Oops, never mind. Just read Google API policy. 1000 queries per day limit for non-commercial use only. -Original Message- From: Xin Li Sent: Monday, October 18, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Spell checking question from a Solr novice Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: Spell checking question from a Solr novice
If you know the misspellings you could prevent them from being added to the dictionary with a StopFilterFactory like so: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=misspelled_words.txt/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all/ filter class=solr.LengthFilterFactory min=2 max=50/ /analyzer /fieldType where misspelled_words.txt contains the misspellings. On Mon, Oct 18, 2010 at 5:14 PM, Pradeep Singh pksing...@gmail.com wrote: I think a spellchecker based on your index has clear advantages. You can spellcheck words specific to your domain which may not be available in an outside dictionary. You can always dump the list from wordnet to get a starter english dictionary. But then it also means that misspelled words from your domain become the suggested correct word. Hmmm ... you'll need to have a way to prune out such words. Even then, your own domain based dictionary is a total go. On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: In general, the benefit of the built-in Solr spellcheck is that it can use a dictionary based on your actual index. If you want to use some external API, you certainly can, in your actual client app -- but it doesn't really need to involve Solr at all anymore, does it? Is there any benefit I'm not thinking of to doing that on the solr side, instead of just in your client app? I think Yahoo (and maybe Microsoft?) have similar APIs with more generous ToSs, but I haven't looked in a while. Xin Li wrote: Oops, never mind. Just read Google API policy. 1000 queries per day limit for non-commercial use only. -Original Message- From: Xin Li Sent: Monday, October 18, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Spell checking question from a Solr novice Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Schema required?
We need to index documents where the fields in the document can change frequently. It appears that we would need to update our Solr schema definition before we can reindex using new fields. Is there any way to make the Solr schema optional? --frank
I need to indexing the first character of a field in another field
Hello guys, I need to indexing the first character of the field autor in another field inicialautor. Example: autor = Mark Webber inicialautor = M I did a javascript function in the dataimport, but the field inicialautor indexing empty. The function: function InicialAutor(linha) { var aut = linha.get(autor); if (aut != null) { if (aut.length 0) { var ch = aut.charAt(0); linha.put(inicialautor, ch); } else { linha.put(inicialautor, ''); } } else { linha.put(inicialautor, ''); } return linha; } What's wrong? Thank's, Renato Wesenauer
RE: Schema required?
Hi Frank, Check out the Dynamic Fields option from here http://wiki.apache.org/solr/SchemaXml Tim -Original Message- From: Frank Calfo [mailto:fca...@aravo.com] Sent: Monday, October 18, 2010 5:25 PM To: solr-user@lucene.apache.org Subject: Schema required? We need to index documents where the fields in the document can change frequently. It appears that we would need to update our Solr schema definition before we can reindex using new fields. Is there any way to make the Solr schema optional? --frank
Admin for spellchecker?
Do we need an admin screen for spellchecker? Where you can browse the words and delete the ones you don't like so that they don't get suggested?
Re: Spell checking question from a Solr novice
You can cross the new words against a dictionary and keep them in the file as Jason described... What Pradeep said is true, is always better to have suggestions related to your index that have suggestions with no results... On Mon, Oct 18, 2010 at 6:24 PM, Jason Blackerby jblacke...@gmail.comwrote: If you know the misspellings you could prevent them from being added to the dictionary with a StopFilterFactory like so: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=misspelled_words.txt/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all/ filter class=solr.LengthFilterFactory min=2 max=50/ /analyzer /fieldType where misspelled_words.txt contains the misspellings. On Mon, Oct 18, 2010 at 5:14 PM, Pradeep Singh pksing...@gmail.com wrote: I think a spellchecker based on your index has clear advantages. You can spellcheck words specific to your domain which may not be available in an outside dictionary. You can always dump the list from wordnet to get a starter english dictionary. But then it also means that misspelled words from your domain become the suggested correct word. Hmmm ... you'll need to have a way to prune out such words. Even then, your own domain based dictionary is a total go. On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: In general, the benefit of the built-in Solr spellcheck is that it can use a dictionary based on your actual index. If you want to use some external API, you certainly can, in your actual client app -- but it doesn't really need to involve Solr at all anymore, does it? Is there any benefit I'm not thinking of to doing that on the solr side, instead of just in your client app? I think Yahoo (and maybe Microsoft?) have similar APIs with more generous ToSs, but I haven't looked in a while. Xin Li wrote: Oops, never mind. Just read Google API policy. 1000 queries per day limit for non-commercial use only. -Original Message- From: Xin Li Sent: Monday, October 18, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Spell checking question from a Solr novice Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity. -- __ Ezequiel. Http://www.ironicnet.com
Re: I need to indexing the first character of a field in another field
How are you declaring the transformer in the dataconfig? On Mon, Oct 18, 2010 at 6:31 PM, Renato Wesenauer renato.wesena...@gmail.com wrote: Hello guys, I need to indexing the first character of the field autor in another field inicialautor. Example: autor = Mark Webber inicialautor = M I did a javascript function in the dataimport, but the field inicialautor indexing empty. The function: function InicialAutor(linha) { var aut = linha.get(autor); if (aut != null) { if (aut.length 0) { var ch = aut.charAt(0); linha.put(inicialautor, ch); } else { linha.put(inicialautor, ''); } } else { linha.put(inicialautor, ''); } return linha; } What's wrong? Thank's, Renato Wesenauer -- __ Ezequiel. Http://www.ironicnet.com
Re: I need to indexing the first character of a field in another field
You can use regular expression based template transformer without writing a separate function. It's pretty easy to use. On Mon, Oct 18, 2010 at 2:31 PM, Renato Wesenauer renato.wesena...@gmail.com wrote: Hello guys, I need to indexing the first character of the field autor in another field inicialautor. Example: autor = Mark Webber inicialautor = M I did a javascript function in the dataimport, but the field inicialautor indexing empty. The function: function InicialAutor(linha) { var aut = linha.get(autor); if (aut != null) { if (aut.length 0) { var ch = aut.charAt(0); linha.put(inicialautor, ch); } else { linha.put(inicialautor, ''); } } else { linha.put(inicialautor, ''); } return linha; } What's wrong? Thank's, Renato Wesenauer
Re: Admin for spellchecker?
i was thinking about, you also would need to mark a word like valid, so it doesn't mark it as wrong. On Mon, Oct 18, 2010 at 6:37 PM, Pradeep Singh pksing...@gmail.com wrote: Do we need an admin screen for spellchecker? Where you can browse the words and delete the ones you don't like so that they don't get suggested? -- __ Ezequiel. Http://www.ironicnet.com
Re: Schema required?
Frank Calfo wrote: We need to index documents where the fields in the document can change frequently. It appears that we would need to update our Solr schema definition before we can reindex using new fields. Is there any way to make the Solr schema optional? No. But you can design your schema more flexibly than you are designing it. Design it in a more abstract way, so it doesn't in fact need to change when external factors change. I mean, every time you change your schema you are going to have to change any client applications that use your solr index to look things up using new fields and such too, right? You don't want to go changing your schema all the time. You want to design your schema so it doesn't need to change. Solr is not an rdbms. You do not need to 'normalize' your data, or design your schema in the same way you would an rdbms. Design your schema to feed your actual and potential client apps. Jonathan
Re: I need to indexing the first character of a field in another field
You can just do this with a copyfield in your schema.xml instead. Copy to a field which uses regexpfilter or some other analyzer to limit to first non-whitespace char (and perhaps force upcase too if you want). That's what I'd do, easier and will work if you index to Solr from something other than dataimport as well. Renato Wesenauer wrote: Hello guys, I need to indexing the first character of the field autor in another field inicialautor. Example: autor = Mark Webber inicialautor = M I did a javascript function in the dataimport, but the field inicialautor indexing empty. The function: function InicialAutor(linha) { var aut = linha.get(autor); if (aut != null) { if (aut.length 0) { var ch = aut.charAt(0); linha.put(inicialautor, ch); } else { linha.put(inicialautor, ''); } } else { linha.put(inicialautor, ''); } return linha; } What's wrong? Thank's, Renato Wesenauer
Re: I need to indexing the first character of a field in another field
This exact topic was just discussed a few days ago... http://search.lucidimagination.com/search/document/7b6e2cc37bbb95c8/faceting_and_first_letter_of_fields#3059a28929451cb4 My comments on when/where it makes sense to put this logic... http://search.lucidimagination.com/search/document/7b6e2cc37bbb95c8/faceting_and_first_letter_of_fields#7b6e2cc37bbb95c8 : Date: Mon, 18 Oct 2010 19:31:28 -0200 : From: Renato Wesenauer renato.wesena...@gmail.com : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: I need to indexing the first character of a field in another field : : Hello guys, : : I need to indexing the first character of the field autor in another field : inicialautor. : Example: :autor = Mark Webber :inicialautor = M : : I did a javascript function in the dataimport, but the field inicialautor : indexing empty. : : The function: : : function InicialAutor(linha) { : var aut = linha.get(autor); : if (aut != null) { : if (aut.length 0) { : var ch = aut.charAt(0); : linha.put(inicialautor, ch); : } : else { : linha.put(inicialautor, ''); : } : } : else { : linha.put(inicialautor, ''); : } : return linha; : } : : What's wrong? : : Thank's, : : Renato Wesenauer : -Hoss
Removing Common Web Page Header and Footer from All Content Fetched by Nutch
Hi All, I am indexing a web application with approximately 9500 distinct URL and contents using Nutch and Solr. I use Nutch to fetch the urls, links and the crawl the entire web application to extract all the content for all pages. Then I run the solrindex command to send the content to Solr. The problem that I have now is that the first 1000 or so characters of some pages and the last 400 characters of the pages are showing up in the search results. These are contents of the common header and footer used in the site respectively. The only work around that I have now is to index everything and then go through each document one at a time to remove the first 1000 characters if the levenshtein distance between the first 1000 characters of the page and the common header is less than a certain value. Same applies to the footer content common to all pages. Is there a way to ignore certain stop phrase so to speak in the Nutch configuration based on levenshtein distance or jaro winkler distance so that certain parts of the fetched data that matches this stop phrases will not be parsed? Any useful pointers would be highly appreciated. Thanks in advance. -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: How can i get collect stemmed query?
Thanks for your reply :) 1. I tested that q=*:*fl=body , 1 doc returned as result as I expected. 2. I'm edit my scheme.xml as you instructed. analyzer type=query class=com.testsolr.ir.customAnalyzer.MyCustomQueryAnalyzer //No filter description. /analyzer but no result returned. 3. I wonder that... Tipically Tokenizer and filter flow was 1) Input stream provide text stream to tokenizer or filter. 2) tokenizer or filter get a token, and processed token and offset attribute info has returned. 3) offset attributes has the infomation of token's. This is a part of tipical filter src that I thought. public class CustomStemFilter extends TokenFilter { private MyCustomStemer stemmer; private TermAttribute termAttr; private OffsetAttribute offsetAttr; private TypeAttribute typeAttr; private HashtableString,String reserved = new HashtableString,String(); public CustomStemFilter( TokenStream tokenStream, boolean isQuery, MyCustomStemer stemmer ){ super( tokenStream ); this.stemmer = stemmer; termAttr = (TermAttribute) addAttribute(TermAttribute.class); offsetAttr = (OffsetAttribute) addAttribute(OffsetAttribute.class); typeAttr = (TypeAttribute) addAttribute(TypeAttribute.class); addAttribute(PositionIncrementAttribute.class); //Some of my custom logic here. //do something. } private MyCustomStemmer stemmer = new MyCustomStemmer(); public boolean incrementToken() throws IOException { clearAttributes(); if (!input.incrementToken()) return false; StringBuffer queryBuffer = new StringBuffer(); //stemming logic here. //generated query string has append to queryBuffer. termAttr.setTermBuffer(queryBuffer.toString(), 0, queryBuffer.length()); offsetAttr.setOffset(0, queryBuffer.length()); offSet += queryBuffer.length(); typeAttr.setType(word); return true; } } ※ MyCustomStemmer analyze input string flyaway to query string : fly +body:away and return it. At index time, contents to be searched is normally analyzed and indexed as below. a) Contents to be indexed : fly away b) Token fly and length of fly = 3(Has been setup by offset attribute method) has returned by filter or analyzer. c) Next token away and length of away = 4 has returned. I think it's a general index flow. But, I customized MyCustomFilter that filter generate query string, not a token. In the process, offset value has changed : query's length, not a single token's length. I wonder that value to be set up by offsetAttr.setOffset() method has influence on search result on using solr? (I tested this on main page's query input box at http://localhost:8983/solr/admin/ ) -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-i-get-collect-search-result-from-custom-filtered-query-tp1723055p1729717.html Sent from the Solr - User mailing list archive at Nabble.com.
Setting solr home directory in websphere
I've installed Solr a hundred times using Tomcat (on Windows) but now need to get it going with WebSphere (on Windows). For whatever reason this seems to be black magic :) I've installed the war file but have no idea how to set Solr home to let WebSphere know where the index and config files are. Can someone enlighten me on how to do this please?
Re: Setting solr home directory in websphere
You need to make sure that the following system variable is one of the values specific in the JAVA_OPTS environment variable -Dsolr.solr.home=path_to_solr_home On Mon, Oct 18, 2010 at 10:20 PM, Kevin Cunningham kcunning...@telligent.com wrote: I've installed Solr a hundred times using Tomcat (on Windows) but now need to get it going with WebSphere (on Windows). For whatever reason this seems to be black magic :) I've installed the war file but have no idea how to set Solr home to let WebSphere know where the index and config files are. Can someone enlighten me on how to do this please? -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
snapshot-4.0 and maven
I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is this possible to do? If so, could someone give me a tip or two on getting started? Thanks, Matt
Re: snapshot-4.0 and maven
Once you built the solr 4.0 jar, you can use mvn's install command like this: mvn install:install-file -DgroupId=org.apache -DartifactId=solr -Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar -DgeneratePom=true @tommychheng On 10/18/10 7:28 PM, Matt Mitchell wrote: I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is this possible to do? If so, could someone give me a tip or two on getting started? Thanks, Matt
Re: Spell checking question from a Solr novice
The first question to ask is will it work for you. The SECOND question is do you want google to know what's in your data? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Mon, 10/18/10, Xin Li x...@book.com wrote: From: Xin Li x...@book.com Subject: Spell checking question from a Solr novice To: solr-user@lucene.apache.org Date: Monday, October 18, 2010, 12:43 PM Hi, I am looking for a quick solution to improve a search engine's spell checking performance. I was wondering if anyone tried to integrate Google SpellCheck API with Solr search engine (if possible). Google spellcheck came to my mind because of two reasons. First, it is costly to clean up the data to be used as spell check baseline. Secondly, google probably has the most complete set of misspelled search terms. That's why I would like to know if it is a feasible way to go. Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: ApacheCon Atlanta Meetup
I would love to go, but funds are low right now. NEXT year, I'd have something to demo though :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Mon, 10/18/10, Grant Ingersoll gsing...@apache.org wrote: From: Grant Ingersoll gsing...@apache.org Subject: ApacheCon Atlanta Meetup To: solr-user@lucene.apache.org Date: Monday, October 18, 2010, 11:58 AM Is there interest in having a Meetup at ApacheCon? Who's going? Would anyone like to present? We could do something less formal, too, and just have drinks and QA/networking. Thoughts? -Grant
'Advertising' a site
When I get my site which uses Solr/Lucene going, is is considered polite to post a small paragraph about it with a link? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: 'Advertising' a site
Hi Dennis, There is a PoweredBy page on the Wiki that's good for that. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Mon, October 18, 2010 11:35:09 PM Subject: 'Advertising' a site When I get my site which uses Solr/Lucene going, is is considered polite to post a small paragraph about it with a link? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Schema required?
Solr requires a schema. But Lucene does not! :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Frank Calfo fca...@aravo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Mon, October 18, 2010 5:25:27 PM Subject: Schema required? We need to index documents where the fields in the document can change frequently. It appears that we would need to update our Solr schema definition before we can reindex using new fields. Is there any way to make the Solr schema optional? --frank
Re: 'Advertising' a site
Cool, thanks! Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Mon, 10/18/10, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Subject: Re: 'Advertising' a site To: solr-user@lucene.apache.org Date: Monday, October 18, 2010, 9:28 PM Hi Dennis, There is a PoweredBy page on the Wiki that's good for that. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Mon, October 18, 2010 11:35:09 PM Subject: 'Advertising' a site When I get my site which uses Solr/Lucene going, is is considered polite to post a small paragraph about it with a link? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch
Hi Israel, You can use this: http://search-lucene.com/?q=boilerpipefc_project=Tika Not sure if it's built into Nutch, though... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Israel Ekpo israele...@gmail.com To: solr-user@lucene.apache.org; u...@nutch.apache.org Sent: Mon, October 18, 2010 9:01:50 PM Subject: Removing Common Web Page Header and Footer from All Content Fetched by Nutch Hi All, I am indexing a web application with approximately 9500 distinct URL and contents using Nutch and Solr. I use Nutch to fetch the urls, links and the crawl the entire web application to extract all the content for all pages. Then I run the solrindex command to send the content to Solr. The problem that I have now is that the first 1000 or so characters of some pages and the last 400 characters of the pages are showing up in the search results. These are contents of the common header and footer used in the site respectively. The only work around that I have now is to index everything and then go through each document one at a time to remove the first 1000 characters if the levenshtein distance between the first 1000 characters of the page and the common header is less than a certain value. Same applies to the footer content common to all pages. Is there a way to ignore certain stop phrase so to speak in the Nutch configuration based on levenshtein distance or jaro winkler distance so that certain parts of the fetched data that matches this stop phrases will not be parsed? Any useful pointers would be highly appreciated. Thanks in advance. -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
count(*) equivilent in Solr/Lucene
Is there something in Solr/Lucene that could give me the equivalent to: SELECT COUNT(*) WHERE date_column1 :start_date AND date_column2 :end_date; Providing I take into account deleted documents, of course (I.E., do some sort of averaging or some tracking function over time.) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: 'Advertising' a site
: There is a PoweredBy page on the Wiki that's good for that. Even better is a post to the list telling folks about your usee case, index size, hardware, etc A lot of new users find that information really helpful for comparison. -Hoss
Re: count(*) equivilent in Solr/Lucene
: : SELECT : COUNT(*) : WHERE : date_column1 :start_date AND : date_column2 :end_date; q=*:*fq=column1:[start TO *]fq=column2:[end TO *]rows=0 ...every result includes a total count. -Hoss