Code for getting distinct facet counts across shards(Distributed Process).
In solr 1.4.1, for getting distinct facet terms count across shards, The piece of code added for getting count of distinct facet terms across distributed process is as followed: Class: facetcomponent.java Function: -- finishStage(ResponseBuilder rb) for (DistribFieldFacet dff : fi.facets.values()) { //just after this line of code else { // TODO: log error or throw exception? counts = dff.getLexSorted(); int namedistint = 0; namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0); if (namedistint == 0) facet_fields.add(dff.getKey(), fieldCounts); if (namedistint == 1) facet_fields.add(numfacetTerms, counts.length); if (namedistint == 2) { NamedList resCount = new NamedList(); resCount.add(numfacetTerms, counts.length); resCount.add(counts, fieldCounts); facet_fields.add(dff.getKey(), resCount); } Is this flow correct ? I have worked with few test cases and it has worked fine. but i want to know if there are any bugs that can creep in here? (My concern is this piece of code should not effect the rest of logic) *Code flow with comments for reference:* Function : -- finishStage(ResponseBuilder rb) //in this for loop , for (DistribFieldFacet dff : fi.facets.values()) { //just after this line of code else { // TODO: log error or throw exception? counts = dff.getLexSorted(); int namedistint = 0; //default //get the value of facet.numterms from the input query namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetParams.FACET_NAMEDISTINCT,0); // based on the value for facet.numterms==0 or 1 or 2 , if conditions //Get only facet field counts if (namedistint == 0) { facet_fields.add(dff.getKey(), fieldCounts); } //get only distinct facet term count if (namedistint == 1) { facet_fields.add(numfacetTerms, counts.length); } //get facet field count and distinct term count. if (namedistint == 2) { NamedList resCount = new NamedList(); resCount.add(numfacetTerms, counts.length); resCount.add(counts, fieldCounts); facet_fields.add(dff.getKey(), resCount); } Regards, Rajani On Fri, May 27, 2011 at 1:14 PM, rajini maski rajinima...@gmail.com wrote: No such issues . Successfully integrated with 1.4.1 and it works across single index. for f.2.facet.numFacetTerms=1 parameter it will give the distinct count result for f.2.facet.numFacetTerms=2 parameter it will give counts as well as results for facets. But this is working only across single index not distributed process. The conditions you have added in simple facet.java- if namedistinct count ==int ( 0, 1 and 2 condtions).. Should it be added in distributed process function to enable it work across shards? Rajani On Fri, May 27, 2011 at 12:33 PM, Bill Bell billnb...@gmail.com wrote: I am pretty sure it does not yet support distributed shards.. But the patch was written for 4.0... So there might be issues with running it on 1.4.1. On 5/26/11 11:08 PM, rajini maski rajinima...@gmail.com wrote: The patch solr 2242 for getting count of distinct facet terms doesn't work for distributedProcess (https://issues.apache.org/jira/browse/SOLR-2242) The error log says HTTP ERROR 500 Problem accessing /solr/select. Reason: For input string: numFacetTerms java.lang.NumberFormatException: For input string: numFacetTerms at java.lang.NumberFormatException.forInputString(NumberFormatException.java: 48) at java.lang.Long.parseLong(Long.java:403) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331) at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344) at org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add(Fac etComponent.java:619) at org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponen t.java:265) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComp onent.java:235) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa ndler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java :338) at
Re: Displaying highlights in formatted HTML document
--- On Thu, 6/9/11, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: From: Bryan Loofbourrow bloofbour...@knowledgemosaic.com Subject: Displaying highlights in formatted HTML document To: solr-user@lucene.apache.org Date: Thursday, June 9, 2011, 2:14 AM Here is my use case: I have a large number of HTML documents, sizes in the 0.5K-50M range, most around, say, 10M. I want to be able to present the user with the formatted HTML document, with the hits tagged, so that he may iterate through them, and see them in the context of the document, with the document looking as it would be presented by a browser; that is, fully formatted, with its tables and italics and font sizes and all. This is something that the user would explicitly request from within a set of search results, not something I’d expect to have returned from an initial search – the initial search merely returns the snippets around the hits. But if the user wants to dive into one of the returned results and see them in context, I need to be able to go get that. We are currently solving this problem by using an entirely separate search engine (dtSearch), which performs the tagging of the hits in the HTML just fine. But the solution is unsatisfactory because there are Solr searches that dtSearch’s capabilities cannot reasonably match. Can anyone suggest a good way to use Solr/Lucene for this instead? I’m thinking a separate core for this purpose might make sense, so as not to burden the primary search core with the full contents of the document. But after that, I’m stuck. How can I get Solr to express the highlighting in the context of the formatted HTML document? If Solr does not do this currently, and anyone can suggest ways to add the feature, any tips on how this might best be incorporated into the implementation would be welcome. I am doing the same thing (solr trunk) using the following field type: fieldType name=HTMLText class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ charFilter class=solr.HTMLStripCharFilterFactory mapping=mappings.txt/tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.TurkishLowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt ignoreCase=true expand=true/ /analyzeranalyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.TurkishLowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ /analyzer In your separate core - which will is queried when the user wants to dive into one of the returned results - feed your html files in to this field. You may want to increase max analyzed chars too. int name=hl.maxAnalyzedChars147483647/int
wrong index version of solr3.2?
After switching to solr 3.2 and building a new index from scratch I ran check_index which reports: Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1] Why do I get FORMAT_3_1 and Lucene 3.1, anything wrong with my index? from my schema.xml: schema name=my_solr320_schema version=1.3 from my solrconfig.xml: luceneMatchVersionLUCENE_32/luceneMatchVersion Regards, Bernd
Re: Multiple Values not getting Indexed
Pawan, just separating multiple values by comma does not make them multi-value in solr-speak. But if you're already using DIH, you may try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer to 'splitBy' the field and get the expected field-values Regards Stefan On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira pawan.dar...@gmail.com wrote: Hi I am trying to index 2 fields with multiple values. BUT, it is only putting 1 value for each ignoring rest of the values after comma(,). I am fetching query through DIH. It works fine if i have only 1 value each of the 2 fields E.g. Field1 - 150,178,461,151,310,306,305,179,137,162 Field2 - Chandigarh,Gurgaon,New Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others *Schema.xml* field name=city_type type=text indexed=true stored=true/ field name=city_desc type=text indexed=true stored=true/ p.s. i tried multivalued=true but of no help. -- Thanks, Pawan Darira
Re: Code for getting distinct facet counts across shards(Distributed Process).
I have coded and tested this and it appears to work. Are you having any problems? On 6/9/11 12:35 AM, rajini maski rajinima...@gmail.com wrote: In solr 1.4.1, for getting distinct facet terms count across shards, The piece of code added for getting count of distinct facet terms across distributed process is as followed: Class: facetcomponent.java Function: -- finishStage(ResponseBuilder rb) for (DistribFieldFacet dff : fi.facets.values()) { //just after this line of code else { // TODO: log error or throw exception? counts = dff.getLexSorted(); int namedistint = 0; namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa rams.FACET_NAMEDISTINCT,0); if (namedistint == 0) facet_fields.add(dff.getKey(), fieldCounts); if (namedistint == 1) facet_fields.add(numfacetTerms, counts.length); if (namedistint == 2) { NamedList resCount = new NamedList(); resCount.add(numfacetTerms, counts.length); resCount.add(counts, fieldCounts); facet_fields.add(dff.getKey(), resCount); } Is this flow correct ? I have worked with few test cases and it has worked fine. but i want to know if there are any bugs that can creep in here? (My concern is this piece of code should not effect the rest of logic) *Code flow with comments for reference:* Function : -- finishStage(ResponseBuilder rb) //in this for loop , for (DistribFieldFacet dff : fi.facets.values()) { //just after this line of code else { // TODO: log error or throw exception? counts = dff.getLexSorted(); int namedistint = 0; //default //get the value of facet.numterms from the input query namedistint=rb.req.getParams().getFieldInt(dff.getKey().toString(),FacetPa rams.FACET_NAMEDISTINCT,0); // based on the value for facet.numterms==0 or 1 or 2 , if conditions //Get only facet field counts if (namedistint == 0) { facet_fields.add(dff.getKey(), fieldCounts); } //get only distinct facet term count if (namedistint == 1) { facet_fields.add(numfacetTerms, counts.length); } //get facet field count and distinct term count. if (namedistint == 2) { NamedList resCount = new NamedList(); resCount.add(numfacetTerms, counts.length); resCount.add(counts, fieldCounts); facet_fields.add(dff.getKey(), resCount); } Regards, Rajani On Fri, May 27, 2011 at 1:14 PM, rajini maski rajinima...@gmail.com wrote: No such issues . Successfully integrated with 1.4.1 and it works across single index. for f.2.facet.numFacetTerms=1 parameter it will give the distinct count result for f.2.facet.numFacetTerms=2 parameter it will give counts as well as results for facets. But this is working only across single index not distributed process. The conditions you have added in simple facet.java- if namedistinct count ==int ( 0, 1 and 2 condtions).. Should it be added in distributed process function to enable it work across shards? Rajani On Fri, May 27, 2011 at 12:33 PM, Bill Bell billnb...@gmail.com wrote: I am pretty sure it does not yet support distributed shards.. But the patch was written for 4.0... So there might be issues with running it on 1.4.1. On 5/26/11 11:08 PM, rajini maski rajinima...@gmail.com wrote: The patch solr 2242 for getting count of distinct facet terms doesn't work for distributedProcess (https://issues.apache.org/jira/browse/SOLR-2242) The error log says HTTP ERROR 500 Problem accessing /solr/select. Reason: For input string: numFacetTerms java.lang.NumberFormatException: For input string: numFacetTerms at java.lang.NumberFormatException.forInputString(NumberFormatException.ja va: 48) at java.lang.Long.parseLong(Long.java:403) at java.lang.Long.parseLong(Long.java:461) at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:331) at org.apache.solr.schema.TrieField.toInternal(TrieField.java:344) at org.apache.solr.handler.component.FacetComponent$DistribFieldFacet.add( Fac etComponent.java:619) at org.apache.solr.handler.component.FacetComponent.countFacets(FacetCompo nen t.java:265) at org.apache.solr.handler.component.FacetComponent.handleResponses(FacetC omp onent.java:235) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(Searc hHa ndler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler Bas e.java:131) at
Re: Multiple Values not getting Indexed
Is there a way to splitBy and trim the field after splitting? I know I can do it with Javascript in DIH, but how about using the regex parser? On 6/9/11 1:18 AM, Stefan Matheis matheis.ste...@googlemail.com wrote: Pawan, just separating multiple values by comma does not make them multi-value in solr-speak. But if you're already using DIH, you may try the http://wiki.apache.org/solr/DataImportHandler#RegexTransformer to 'splitBy' the field and get the expected field-values Regards Stefan On Thu, Jun 9, 2011 at 6:14 AM, Pawan Darira pawan.dar...@gmail.com wrote: Hi I am trying to index 2 fields with multiple values. BUT, it is only putting 1 value for each ignoring rest of the values after comma(,). I am fetching query through DIH. It works fine if i have only 1 value each of the 2 fields E.g. Field1 - 150,178,461,151,310,306,305,179,137,162 Field2 - Chandigarh,Gurgaon,New Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others *Schema.xml* field name=city_type type=text indexed=true stored=true/ field name=city_desc type=text indexed=true stored=true/ p.s. i tried multivalued=true but of no help. -- Thanks, Pawan Darira
Re: Multiple Values not getting Indexed
You have to take the input and splitBy something like , to get it into an array and reposted back to Solr... I believe others have suggested that? On 6/8/11 10:14 PM, Pawan Darira pawan.dar...@gmail.com wrote: Hi I am trying to index 2 fields with multiple values. BUT, it is only putting 1 value for each ignoring rest of the values after comma(,). I am fetching query through DIH. It works fine if i have only 1 value each of the 2 fields E.g. Field1 - 150,178,461,151,310,306,305,179,137,162 Field2 - Chandigarh,Gurgaon,New Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others *Schema.xml* field name=city_type type=text indexed=true stored=true/ field name=city_desc type=text indexed=true stored=true/ p.s. i tried multivalued=true but of no help. -- Thanks, Pawan Darira
Solr monitoring: Newrelic
Hello, I found this tool to monitor solr querys, cache etc. http://newrelic.com/ http://newrelic.com/ I have some problems with the installation of it. I get the following errors: Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr Try re-running the install command from AppServerRootDirectory/newrelic. If that doesn't work, locate and edit the start script manually. Generated New Relic configuration file /var/www/sites/royr/newrelic/newrelic.yml * Install incomplete Does anybody have experience with Newrelic in combination with Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr monitoring: Newrelic
You need to install the new relic folder under tomcat folder, in case app server is tomcat. Then from the command line ,you need to run the install commnad given in the new relic site from your newrelic folder. Once this is done, restart the appserver and you shld be able to see a log file created under newrelic folder, if all went well. Regards Sujatha On Thu, Jun 9, 2011 at 1:27 PM, roySolr royrutten1...@gmail.com wrote: Hello, I found this tool to monitor solr querys, cache etc. http://newrelic.com/ http://newrelic.com/ I have some problems with the installation of it. I get the following errors: Could not locate a Tomcat, Jetty or JBoss instance in /var/www/sites/royr Try re-running the install command from AppServerRootDirectory/newrelic. If that doesn't work, locate and edit the start script manually. Generated New Relic configuration file /var/www/sites/royr/newrelic/newrelic.yml * Install incomplete Does anybody have experience with Newrelic in combination with Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042889.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr monitoring: Newrelic
I use Jetty, it's standard in the solr package. Where can i find the jetty folder? then i can start this command: java -jar newrelic.jar install -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Displaying highlights in formatted HTML document
Hi Bryan, how do you index your html files ? I mean do you create fields for different parts of your document (for different stop words lists, stemming, etc) ? with DIH or solrj or something else ? iorixxx, could you please explain a bit more your solution, because I don't see how your solution could give an exact highlighting, I mean with the different fields analysis for each fields. I developed this week a new highlighter module which transfers the fields highlighting to the original document (xml in my case) (I use payloads to store offsets and lenghts of fields in the index). This way, I use the good analyzers to do the highlighting correctly and then, I replace the different field parts in the document by the highlighted parts. It is not finished yet, but I already have some good results. This is a client request too. Let me know if the iorixxx's solution is not enought for your particular use case. Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3042983.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr monitoring: Newrelic
There is no jetty folder in the standard package ,but the jetty war file is under example/lib folder ,so this where u need to put the newrelic folder i guess Regards Sujatha On Thu, Jun 9, 2011 at 2:03 PM, roySolr royrutten1...@gmail.com wrote: I use Jetty, it's standard in the solr package. Where can i find the jetty folder? then i can start this command: java -jar newrelic.jar install -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3042981.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr monitoring: Newrelic
Yes, that's the problem. There is no jetty folder. I have try the example/lib directory, it's not working. There is no jetty war file, only jetty-***.jar files Same error, could not locate a jetty instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Displaying highlights in formatted HTML document
iorixxx, could you please explain a bit more your solution, because I don't see how your solution could give an exact highlighting, I mean with the different fields analysis for each fields. It does not work with your use case (e.g. different synonyms applied different parts of the html/xml etc)
ExtractingRequestHandler - renaming tika generated fields
Hi, I post a PDF from a CMS client, which has metadata about the document. One of those metadata is the title. I trust the title of the CMS more than the title extracted from the PDF, but I cannot find a way to both send literal.title=CMS-Title as well as changing the name of the title field generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title also also changes name. Any ideas? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: how to Index and Search non-Eglish Text in solr
Can I specify multiple language in filter tag in schema.xml ??? like below fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr. WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.SnowballPorterFilterFactory language=Dutch / filter class=solr.SnowballPorterFilterFactory language=English / filter class=solr.SnowballPorterFilterFactory language=Chinese / tokenizer class=solr.WhitespaceTokenizerFactory/ tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/filter class=solr.SnowballPorterFilterFactory language=Hungarian / On 8 June 2011 18:47, Erick Erickson erickerick...@gmail.com wrote: This page is a handy reference for individual languages... http://wiki.apache.org/solr/LanguageAnalysis But the usual approach, especially for Chinese/Japanese/Korean (CJK) is to index the content in different fields with language-specific analyzers then spread your search across the language-specific fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords particularly give surprising results if you put words from different languages in the same field. Best Erick On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq shariqn...@gmail.com wrote: Hi, I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in English, but my requirement extend to index the news of other languages too. This is how my schema looks : field name=news type=text indexed=true stored=false required=false/ And the text Field in schema.xml looks like : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType My Problem is : Now I want to index the news articles in other languages to e.g. Chinese,Japnese. How I can I modify my text field so that I can Index the news in other lang too and make it searchable ?? Thanks Shariq -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks and Regards Mohammad Shariq
Re: Solr monitoring: Newrelic
Try the RPM support accessed from the accout support page ,Giving all details ,they are very helpful. Regards Sujatha On Thu, Jun 9, 2011 at 2:33 PM, roySolr royrutten1...@gmail.com wrote: Yes, that's the problem. There is no jetty folder. I have try the example/lib directory, it's not working. There is no jetty war file, only jetty-***.jar files Same error, could not locate a jetty instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: AW: How to deal with many files using solr external file field
Hi, as I'm also involved in this issue (on the side of Sven) I created a patch, that replaces the float array by a map that stores score by doc, so it contains as many entries as the external scoring file contains lines, but no more. I created an issue for this: https://issues.apache.org/jira/browse/SOLR-2583 It would be great if someone could have a look at it and comment. Thanx for your feedback, cheers, Martin On 06/08/2011 12:22 PM, Bohnsack, Sven wrote: Hi, I could not provide a stack trace and IMHO it won't provide some useful information. But we've made a good progress in the analysis. We took a deeper look at what happened, when an external-file-field-Request is sent to SOLR: * SOLR looks if there is a file for the requested query, e.g. trousers * If so, then SOLR loads the trousers-file and generates a HashMap-Entry consisting of a FileFloatSource-Object and a FloatArray with the size of the number of documents in the SOLR-index. Every document matched by the query gains the score-value, which is provided in the external-score-file. For every(!) other document SOLR writes a zero in that FloatArray * if SOLR does not find a file for the query-Request, then SOLR still generates a HashMapEntry with score zero for every document In our case we have about 8.5 Mio. documents in our index and one of those Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and using external file field for sorting the result, SOLR occupies about 3.4GB of Heap Space. The problem might be the use of WeakHashMap [1], which prevents the Garbage Collector from cleaning up unused Keys. What do you think could be a possible solution for this whole problem? (except from don't use external file fields ;) Regards Sven [1]: A hashtable-based Map implementation with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently than other Map implementations. -Ursprüngliche Nachricht- Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon Rosenthal Gesendet: Mittwoch, 8. Juni 2011 03:56 An: solr-user@lucene.apache.org Betreff: Re: How to deal with many files using solr external file field Can you provide a stack trace for the OOM eexception ? On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven sven.bohns...@shopping24.dewrote: Hi all, we're using solr 1.4 and external file field ([1]) for sorting our searchresults. We have about 40.000 Terms, for which we use this sorting option. Currently we're running into massive OutOfMemory-Problems and were not pretty sure, what's the matter. It seems that the garbage collector stops working or some processes are going wild. However, solr starts to allocate more and more RAM until we experience this OutOfMemory-Exception. We noticed the following: For some terms one could see in the solr log that there appear some java.io.FileNotFoundExceptions, when solr tries to load an external file for a term for which there is not such a file, e.g. solr tries to load the external score file for trousers but there ist none in the /solr/data-Folder. Question: is it possible, that those exceptions are responsible for the OutOfMemory-Problem or could it be due to the large(?) number of 40k terms for which we want to sort the result via external file field? I'm looking forward for your answers, suggestions and ideas :) Regards Sven [1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html -- Martin Grotzke http://twitter.com/martin_grotzke signature.asc Description: OpenPGP digital signature
Re: Tokenising based on known words?
we've played with HyphenationCompoundWordTokenFilterFactory it works better than maintaining a word dictionary to split (although we ended up not using it for reasons i can't recall) see http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html On 9 June 2011 06:42, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote: Not sure if this possible, but figured I would ask the question. Basically, we have some users who do some pretty rediculous things ;o) Rather than writing red jacket, they write redjacket, which obviously returns no results. [...] Have you tried using synonyms, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory It seems like they should fit your use case. Regards, Gora
Boost or sort a query with range values
Hello I try to boost a query with a range values but I can't find the correct syntax : this is ok .bq=myfield:-1^5 but I want to do something lik this bq=myfield:-1 to 1^5 Boost value from -1 to 1 thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost or sort a query with range values
[* TO *]^5 On 9 June 2011 11:31, jlefebvre jlefeb...@allocine.fr wrote: Hello I try to boost a query with a range values but I can't find the correct syntax : this is ok .bq=myfield:-1^5 but I want to do something lik this bq=myfield:-1 to 1^5 Boost value from -1 to 1 thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043328.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost or sort a query with range values
thanks it's ok another question how to do a condition in bq ? something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0) thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost or sort a query with range values
Check the new if() function in Trunk, SOLR-2136. You could then use it in bf= or boost= -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. juni 2011, at 13.05, jlefebvre wrote: thanks it's ok another question how to do a condition in bq ? something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0) thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Boost or sort a query with range values
Btw. your example is a simple boolean query, and this will also work: bq=(myfield1:0 AND myfield2:1)^100.0 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. juni 2011, at 13.31, Jan Høydahl wrote: Check the new if() function in Trunk, SOLR-2136. You could then use it in bf= or boost= -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. juni 2011, at 13.05, jlefebvre wrote: thanks it's ok another question how to do a condition in bq ? something like bq=iif(myfield1 = 0 AND myfield2 = 1;1;0) thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-or-sort-a-query-with-range-values-tp3043328p3043406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: London open source search social - 13th June
Just a quick reminder that we're meeting on Monday. Come along if you're around. On 1 June 2011 13:27, Richard Marr richard.m...@gmail.com wrote: Hi guys, Just to let you know we're meeting up to talk all-things-search on Monday 13th June. There's usually a good mix of backgrounds and experience levels so if you're free and in the London area then it'd be good to see you there. Details: 7pm - The Elgin - 96 Ladbrooke Grove http://www.meetup.com/london-search-social/events/20387881/ Greetings search geeks! We've booked the next meetup for the 13th June. As usual, the plan is to meet up and geek out over a friendly beer. I know my co-organiser René has been working on some interesting search projects, and I've recently left Empora to work on my own project so by June I should hopefully have some war stories about using @elasticsearch in production. The format is completely open though so please bring your own topics if you've got them. Hope to see you there! -- Richard Marr
[Mahout] Integration with Solr
Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
Re: Tokenising based on known words?
Synonyms really wouldn't work for every possible combination of words in our index. Thanks for the idea though. Mark On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote: Not sure if this possible, but figured I would ask the question. Basically, we have some users who do some pretty rediculous things ;o) Rather than writing red jacket, they write redjacket, which obviously returns no results. [...] Have you tried using synonyms, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory It seems like they should fit your use case. Regards, Gora -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
Edismax sorting help
Hi, everyone. I have fields: text fields: name, title, text boolean field: isflag (true / false) int field: popularity (0 to 9) Now i do query: defType=edismax start=0 rows=20 fl=id,name q=lg optimus fq= qf=name^3 title text^0.3 sort=score desc pf=name bf=isflag sqrt(popularity) mm=100% debugQuery=on If i do query like Samsung i want to see prior most relevant results with isflag:true and bigger popularity, but if i do query like Nokia 6500 and there is isflag:false, then it should be higher because of exact match. Tried different combinations, but didn't found one that suites me. Just got isflag/popularity sorting working or isflag/relevancy sorting.
Re: tika integration exception and other related queries
Naveen, Not sure our requirement matches yours, but one of the things we index is a comment item that can have one or more files attached to it. To index the whole thing as a single Solr document we create a zipfile containing a file with the comment details in it and any additional attached files. This is submitted to Solr as a TEXT field in an XML doc, along with other meta-data fields from the comment. In our schema the TEXT field is indexed but not stored, so when we search and get a match back it doesn't contain all of the contents from the attached files etc., only the stored fields in our schema. Admittedly, the user can therefore get back a comment match with no indication as to WHERE the match occurred (ie. was it in the meta-data or the contents of the attached files), but at the moment we're only interested in getting appropriate matches, not explaining where the match is. Hope that helps. Kind regards, Gary. On 09/06/2011 03:00, Naveen Gupta wrote: Hi Gary It started working .. though i did not test for Zip files, but for rar files, it is working fine .. only thing what i wanted to do is to index the metadata (text mapped to content) not store the data Also in search result, i want to filter the stuffs ... and it started working fine .. i don't want to show the content stuffs to the end user, since the way it extracts the information is not very helpful to the user .. although we can apply few of the analyzers and filters to remove the unnecessary tags ..still the information would not be of much help .. looking for your opinion ... what you did in order to filter out the content or are you showing the content extracted to the end user? Even in case, we are showing the text part to the end user, how can i limit the number of characters while querying the search results ... is there any feature where we can achieve this ... the concept of snippet kind of thing ... Thanks Naveen On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylorg...@inovem.com wrote: Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
Re: [Mahout] Integration with Solr
I don't know much of it, but I know Grant Ingersoll posted about that: http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ On Thu, Jun 9, 2011 at 9:24 AM, Adam Estrada estrada.adam.gro...@gmail.comwrote: Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
RE: Tokenising based on known words?
Hi Mark, Are you familiar with shingles aka token n-grams? http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html Use the empty string for the tokenSeparator to get wordstogether style tokens in your index. I think you'll want to apply this filter only at index-time, since the users will supply the shingles all by themselves :). Steve -Original Message- From: Mark Mandel [mailto:mark.man...@gmail.com] Sent: Thursday, June 09, 2011 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Tokenising based on known words? Synonyms really wouldn't work for every possible combination of words in our index. Thanks for the idea though. Mark On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote: Not sure if this possible, but figured I would ask the question. Basically, we have some users who do some pretty rediculous things ;o) Rather than writing red jacket, they write redjacket, which obviously returns no results. [...] Have you tried using synonyms, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF ilterFactory It seems like they should fit your use case. Regards, Gora -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
how can I return function results in my query?
I want to be able to run a query like idf(text, 'term') and have that data returned with my search results. I've searched the docs,but I'm unable to find how to do it. Is this possible and how can I do that ?
Re: how can I return function results in my query?
I want to be able to run a query like idf(text, 'term') and have that data returned with my search results. I've searched the docs,but I'm unable to find how to do it. Is this possible and how can I do that ? http://wiki.apache.org/solr/FunctionQuery#idf
Re: how to Index and Search non-Eglish Text in solr
No, you'd have to create multiple fieldTypes, one for each language Best Erick On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq shariqn...@gmail.com wrote: Can I specify multiple language in filter tag in schema.xml ??? like below fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr. WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.SnowballPorterFilterFactory language=Dutch / filter class=solr.SnowballPorterFilterFactory language=English / filter class=solr.SnowballPorterFilterFactory language=Chinese / tokenizer class=solr.WhitespaceTokenizerFactory/ tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/filter class=solr.SnowballPorterFilterFactory language=Hungarian / On 8 June 2011 18:47, Erick Erickson erickerick...@gmail.com wrote: This page is a handy reference for individual languages... http://wiki.apache.org/solr/LanguageAnalysis But the usual approach, especially for Chinese/Japanese/Korean (CJK) is to index the content in different fields with language-specific analyzers then spread your search across the language-specific fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords particularly give surprising results if you put words from different languages in the same field. Best Erick On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq shariqn...@gmail.com wrote: Hi, I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in English, but my requirement extend to index the news of other languages too. This is how my schema looks : field name=news type=text indexed=true stored=false required=false/ And the text Field in schema.xml looks like : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType My Problem is : Now I want to index the news articles in other languages to e.g. Chinese,Japnese. How I can I modify my text field so that I can Index the news in other lang too and make it searchable ?? Thanks Shariq -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks and Regards Mohammad Shariq
Re: Edismax sorting help
2011/6/9 Denis Kuzmenok forward...@ukr.net: Hi, everyone. I have fields: text fields: name, title, text boolean field: isflag (true / false) int field: popularity (0 to 9) Now i do query: defType=edismax start=0 rows=20 fl=id,name q=lg optimus fq= qf=name^3 title text^0.3 sort=score desc pf=name bf=isflag sqrt(popularity) mm=100% debugQuery=on If i do query like Samsung i want to see prior most relevant results with isflag:true and bigger popularity, but if i do query like Nokia 6500 and there is isflag:false, then it should be higher because of exact match. Tried different combinations, but didn't found one that suites me. Just got isflag/popularity sorting working or isflag/relevancy sorting. Multiplicative boosts tend to be more stable... Perhaps try replacing bf=isflag sqrt(popularity) with bq=isflag:true^10 // vary the boost to change how much isflag counts vs the relevancy score of the main query boost=sqrt(popularity) // this will multiply the result by sqrt(popularity)... assumes that every document has a non-zero popularity You could get more creative in trunk where booleans have better support in function queries. -Yonik http://www.lucidimagination.com
Re: Solr monitoring: Newrelic
It sounds like roySolr is running embedded Jetty, launching solr using the start.jar If so, then there's no app container where Newrelic can be installed. -- Ken On Jun 9, 2011, at 2:28am, Sujatha Arun wrote: Try the RPM support accessed from the accout support page ,Giving all details ,they are very helpful. Regards Sujatha On Thu, Jun 9, 2011 at 2:33 PM, roySolr royrutten1...@gmail.com wrote: Yes, that's the problem. There is no jetty folder. I have try the example/lib directory, it's not working. There is no jetty war file, only jetty-***.jar files Same error, could not locate a jetty instance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-monitoring-Newrelic-tp3042889p3043080.html Sent from the Solr - User mailing list archive at Nabble.com. -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Edismax sorting help
Your solution seems to work fine, not perfect, but much better then mine :) Thanks! If i do query like Samsung i want to see prior most relevant results with isflag:true and bigger popularity, but if i do query like Nokia 6500 and there is isflag:false, then it should be higher because of exact match. Tried different combinations, but didn't found one that suites me. Just got isflag/popularity sorting working or isflag/relevancy sorting. Multiplicative boosts tend to be more stable... Perhaps try replacing bf=isflag sqrt(popularity) with bq=isflag:true^10 // vary the boost to change how much isflag counts vs the relevancy score of the main query boost=sqrt(popularity) // this will multiply the result by sqrt(popularity)... assumes that every document has a non-zero popularity You could get more creative in trunk where booleans have better support in function queries. -Yonik http://www.lucidimagination.com
Re: Does MultiTerm highlighting work with the fastVectorHighlighter?
(11/06/09 4:24), Burton-West, Tom wrote: We are trying to implement highlighting for wildcard (MultiTerm) queries. This seems to work find with the regular highlighter but when we try to use the fastVectorHighlighter we don't see any results in the highlighting section of the response. Appended below are the parameters we are using. It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and DisjunctionMaxQuery and Query constructed by those queries. koji -- http://www.rondhuit.com/en/
Re: [Mahout] Integration with Solr
Hello Adam, I've managed to create a small POC of integrating Mahout with Solr for a clustering task, do you want to use it for clustering only or possibly for other purposes/algorithms? More generally speaking, I think it'd be nice if Solr could be extended with a proper API for integrating clustering engines in it so that one can plug and exchange engines flawlessly (just need an Adapter). Regards, Tommaso 2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
Indexing data from multiple datasources
Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
[Free Text] Field Tokenizing
All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Mahout] Integration with Solr
Thanks for the reply, Tommaso! I would like to see tighter integration like in the way Nutch integrates with Solr. There is a single param that you set which points to the Solr instance. My interest in Mahout is with it's abitlity to handle large data and find frequency, co-location of data, clustering, etc...All the algorithms that are in the core build are great and I am just now wrapping my head around how to use them all. Adam On Thu, Jun 9, 2011 at 10:33 AM, Tommaso Teofili tommaso.teof...@gmail.comwrote: Hello Adam, I've managed to create a small POC of integrating Mahout with Solr for a clustering task, do you want to use it for clustering only or possibly for other purposes/algorithms? More generally speaking, I think it'd be nice if Solr could be extended with a proper API for integrating clustering engines in it so that one can plug and exchange engines flawlessly (just need an Adapter). Regards, Tommaso 2011/6/9 Adam Estrada estrada.adam.gro...@gmail.com Has anyone integrated Mahout with Solr? I know that Carrot2 is part of the core build but the docs say that it's not very good for very large indexes. Anyone have thoughts on this? Thanks, Adam
RE: Does MultiTerm highlighting work with the fastVectorHighlighter?
Hi Koji, Thank you for your reply. It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and DisjunctionMaxQuery and Query constructed by those queries. Sorry, I'm not sure I understand. Are you saying that FVH supports MultiTerm highlighting? Tom
Re: ExtractingRequestHandler - renaming tika generated fields
One solution to this problem is to change the order of field operation (http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations) to first do fmap.*= processing, then add the fields from literal.*=. Why would anyone want to rename a field they just have explicitly named anyway? Another solution that would work for me is an option to let ALL tika generated fields be prefixed, e.g. tprefix=tika_. But I need Extracting handler to output to fields which do not exist in schema.xml. This is because later in the UpdateChain I do field choosing and renaming in another UpdateProcessor, so the field names coming from ExtractingHandler are only tempoprary and will not be sent to Solr. Thus, an option to skip the schema check would be useful, perhaps in the form of a whitelist for uprefix uprefix.whitelist=fielda,other-non-existing-field, causing uprefix not rename those. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. juni 2011, at 11.26, Jan Høydahl wrote: Hi, I post a PDF from a CMS client, which has metadata about the document. One of those metadata is the title. I trust the title of the CMS more than the title extracted from the PDF, but I cannot find a way to both send literal.title=CMS-Title as well as changing the name of the title field generated by Tika/SolrCell. If I do fmap.title=tika_title then my literal.title also also changes name. Any ideas? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: Does MultiTerm highlighting work with the fastVectorHighlighter?
(11/06/10 0:14), Burton-West, Tom wrote: Hi Koji, Thank you for your reply. It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery and DisjunctionMaxQuery and Query constructed by those queries. Sorry, I'm not sure I understand. Are you saying that FVH supports MultiTerm highlighting? Tom, I'm sorry but FVH doesn't cover MultiTermQuery. koji -- http://www.rondhuit.com/en/
Re: Indexing data from multiple datasources
Hmmm, when you say you use Tika, are you using some custom Java code? Because if you are, the best thing to do is query your database at that point and add whatever information you need to the document. If you're using DIH to do the crawl, consider implementing a Transformer to do the database querying and modify the document as necessary This is pretty simple to do, we can chat a bit more depending on whether either approach makes sense. Best Erick On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote: Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
Re: [Free Text] Field Tokenizing
The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Free Text] Field Tokenizing
Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote: The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Free Text] Field Tokenizing
The KeywordTokenizer doesn't do anything to break up the input stream, it just treats the whole input to the field as a single token. So I don't think you'll be able to extract anything starting with that tokenizer. Look at the admin/analysis page to see a step-by-step breakdown of what your analyzer chain does. Be sure to check the verbose checkbox Best Erick On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote: The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
RE: Indexing data from multiple datasources
Hello Erick, Thanks for the response. No, I am using the extract handler to extract the data from my text files. In your second approach, you say I could use a DIH to update the index which would have been created by the extract handler in the first phase. I thought that lets say I get info from the DB and update the index with the document ID, will I overwrite the data and lose the initial data from the extract handler phase? Thanks Greg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 9 juin 2011 12:15 To: solr-user@lucene.apache.org Subject: Re: Indexing data from multiple datasources Hmmm, when you say you use Tika, are you using some custom Java code? Because if you are, the best thing to do is query your database at that point and add whatever information you need to the document. If you're using DIH to do the crawl, consider implementing a Transformer to do the database querying and modify the document as necessary This is pretty simple to do, we can chat a bit more depending on whether either approach makes sense. Best Erick On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote: Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
Re: Indexing data from multiple datasources
How are you using it? Streaming the files to Solr via HTTP? You can use Tika on the client to extract the various bits from the structured documents, and use SolrJ to assemble various bits of that data Tika exposes into a Solr document that you then send to Solr. At the point you're transferring data from the Tika parse to the Solr document, you could add any data from your database that you wanted. The result is that you'd be indexing the complete Solr document only once. You're right that updating a document in Solr overwrites the previous version and any data in the previous version is lost Best Erick On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote: Hello Erick, Thanks for the response. No, I am using the extract handler to extract the data from my text files. In your second approach, you say I could use a DIH to update the index which would have been created by the extract handler in the first phase. I thought that lets say I get info from the DB and update the index with the document ID, will I overwrite the data and lose the initial data from the extract handler phase? Thanks Greg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 9 juin 2011 12:15 To: solr-user@lucene.apache.org Subject: Re: Indexing data from multiple datasources Hmmm, when you say you use Tika, are you using some custom Java code? Because if you are, the best thing to do is query your database at that point and add whatever information you need to the document. If you're using DIH to do the crawl, consider implementing a Transformer to do the database querying and modify the document as necessary This is pretty simple to do, we can chat a bit more depending on whether either approach makes sense. Best Erick On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote: Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
RE: Indexing data from multiple datasources
This thread got me thinking a bit... Does SOLR support the concept of partial updates to documents? By this I mean updating a subset of fields in a document that already exists in the index, and without having to resubmit the entire document. An example would be storing/indexing user tags associated with documents. These tags will not be available when the document is initially presented to SOLR, and may or may not come along at a later time. When that time comes, can we just submit the tag data (and document identifier I'd imagine), or do we have to import the entire document? new to SOLR... Date: Thu, 9 Jun 2011 14:00:43 -0400 Subject: Re: Indexing data from multiple datasources From: erickerick...@gmail.com To: solr-user@lucene.apache.org How are you using it? Streaming the files to Solr via HTTP? You can use Tika on the client to extract the various bits from the structured documents, and use SolrJ to assemble various bits of that data Tika exposes into a Solr document that you then send to Solr. At the point you're transferring data from the Tika parse to the Solr document, you could add any data from your database that you wanted. The result is that you'd be indexing the complete Solr document only once. You're right that updating a document in Solr overwrites the previous version and any data in the previous version is lost Best Erick On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote: Hello Erick, Thanks for the response. No, I am using the extract handler to extract the data from my text files. In your second approach, you say I could use a DIH to update the index which would have been created by the extract handler in the first phase. I thought that lets say I get info from the DB and update the index with the document ID, will I overwrite the data and lose the initial data from the extract handler phase? Thanks Greg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 9 juin 2011 12:15 To: solr-user@lucene.apache.org Subject: Re: Indexing data from multiple datasources Hmmm, when you say you use Tika, are you using some custom Java code? Because if you are, the best thing to do is query your database at that point and add whatever information you need to the document. If you're using DIH to do the crawl, consider implementing a Transformer to do the database querying and modify the document as necessary This is pretty simple to do, we can chat a bit more depending on whether either approach makes sense. Best Erick On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote: Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
RE: Indexing data from multiple datasources
No from what I understand, the way Solr does an update is to delete the document, then recreate all the fields, there is no partial updating of the file.. maybe because of performance issues or locking? -Original Message- From: David Ross [mailto:davidtr...@hotmail.com] Sent: 9 juin 2011 15:23 To: solr-user@lucene.apache.org Subject: RE: Indexing data from multiple datasources This thread got me thinking a bit... Does SOLR support the concept of partial updates to documents? By this I mean updating a subset of fields in a document that already exists in the index, and without having to resubmit the entire document. An example would be storing/indexing user tags associated with documents. These tags will not be available when the document is initially presented to SOLR, and may or may not come along at a later time. When that time comes, can we just submit the tag data (and document identifier I'd imagine), or do we have to import the entire document? new to SOLR... Date: Thu, 9 Jun 2011 14:00:43 -0400 Subject: Re: Indexing data from multiple datasources From: erickerick...@gmail.com To: solr-user@lucene.apache.org How are you using it? Streaming the files to Solr via HTTP? You can use Tika on the client to extract the various bits from the structured documents, and use SolrJ to assemble various bits of that data Tika exposes into a Solr document that you then send to Solr. At the point you're transferring data from the Tika parse to the Solr document, you could add any data from your database that you wanted. The result is that you'd be indexing the complete Solr document only once. You're right that updating a document in Solr overwrites the previous version and any data in the previous version is lost Best Erick On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges greg.geor...@biztree.com wrote: Hello Erick, Thanks for the response. No, I am using the extract handler to extract the data from my text files. In your second approach, you say I could use a DIH to update the index which would have been created by the extract handler in the first phase. I thought that lets say I get info from the DB and update the index with the document ID, will I overwrite the data and lose the initial data from the extract handler phase? Thanks Greg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 9 juin 2011 12:15 To: solr-user@lucene.apache.org Subject: Re: Indexing data from multiple datasources Hmmm, when you say you use Tika, are you using some custom Java code? Because if you are, the best thing to do is query your database at that point and add whatever information you need to the document. If you're using DIH to do the crawl, consider implementing a Transformer to do the database querying and modify the document as necessary This is pretty simple to do, we can chat a bit more depending on whether either approach makes sense. Best Erick On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges greg.geor...@biztree.com wrote: Hello all, I have checked the forums to see if it is possible to create and index from multiple datasources. I have found references to SOLR-1358, but I don't think this fits my scenario. In all, we have an application where we upload files. On the file upload, I use the Tika extract handler to save metadata from the file (_attr, literal values, etc..). We also have a database which has information on the uploaded files, like the category, type, etc.. I would like to update the index to include this information from the db in the index for each document. If I run a dataimporthandler after the extract phase I am afraid that by updating the doc in the index by its id will just cause that I overwrite the old information with the info from the DB (what I understand is that Solr updates its index by ID by deleting first then recreating the info). Anyone have any pointers, is there a clean way to do this, or must I find a way to pass the db metadata to the extract handler and save it as literal fields? Thanks in advance Greg
Processing/Indexing CSV
Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
Re: Processing/Indexing CSV
Hi, to make my point more clear: if the CSV has a fixed schema / column layout, using the RegexTransformer is of course a possibility (however awkward). But if you want to implement a (more or less) schema free shopping search engine ... regards On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
Unique Results from Edgy Text
I am using the guide found here ( http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/) to build an autocomplete search capability but in my data set I have some documents which have the same value for the field that is being returned, so for instance I have the following being returned: A test document to see how this works A test document to see how this works A test document to see how this works A test document to see how this works A test document to see how this works I'm wondering if there is something I can specify that I want only unique results to come back. I know I can do some post processing of the results to make sure that only unique items come back, but I was hoping there was something that could be done to the query. Any thoughts?
RE: Processing/Indexing CSV
Helmut, I recently submitted SOLR-2549 (https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width and delimited flat files. To be honest, I only needed fixed-width support for my app so this might not support everything you mention for delimited files, but it should be a good start. In particular, you might need to enhance this to handle the double quotes (I had though a delimiter regex along these lines might handle it: (?:[\]?[,]|[\]$) ... note this is a sample I just cooked up quick and no doubt has errors, and maybe as you say a simple regex might not work at all ) ... I also didn't do anything with encodings but I'm not sure this will be an issue either... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] Sent: Thursday, June 09, 2011 2:32 PM To: solr-user@lucene.apache.org Subject: Processing/Indexing CSV Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
RE: Displaying highlights in formatted HTML document
Ludovic, how do you index your html files ? I mean do you create fields for different parts of your document (for different stop words lists, stemming, etc) ? with DIH or solrj or something else ? We are sending them over http, and using Tika to strip the HTML, at present. We do not split the document itself into separate fields, but what we index includes a bunch of metadata that has been extracted by processes earlier in the pipeline. These fields don't enter into the HTML-hit-highlighting question. I developed this week a new highlighter module which transfers the fields highlighting to the original document (xml in my case) (I use payloads to store offsets and lenghts of fields in the index). This way, I use the good analyzers to do the highlighting correctly and then, I replace the different field parts in the document by the highlighted parts. It is not finished yet, but I already have some good results. Yes, I have been thinking along very similar lines. If you arrive at something you're happy with, I encourage you to share it. This is a client request too. Let me know if the iorixxx's solution is not enought for your particular use case. I'm enough of a Solr newb that I'll need to study his suggestion for a bit, to figure out what it does and does not do. When I've done so, I'll respond to his message. Thanks, -- Bryan
Re: Processing/Indexing CSV
On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, there seems to be no way to index CSV using the DataImportHandler. Looking over the features you want, it looks like you're starting from a CSV file (as opposed to CSV stored in a database). Is there a reason that you need to use DIH and can't directly use the CSV loader? http://wiki.apache.org/solr/UpdateCSV -Yonik http://www.lucidimagination.com Using a combination of LineEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformerhttp://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
RE: Displaying highlights in formatted HTML document
I am not (yet) a tika user, perhaps that the iorixxx's solution is good for you. We will share the highlighter module and 2 other developments soon. ('have to see how to do that) Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Displaying-highlights-in-formatted-HTML-document-tp3041909p3045654.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Processing/Indexing CSV
Hi, just looked at your code. Definitely an improvement :-) The problem with the double-quotes is, that the delimiter (let's say ',') might be part of the column value. The goal is to process something like this without any tricky configuration name1,name2,name3 val1,val2,...,val3 ... The user should not have to provide and before-hand knowledge regarding the column layout or the encoding of the CSV file. Ideally the only thing that has to be specified is firstLineHasFieldnames=true separator=;. Autodetecting the separator and encoding would be even more elegant. If nobody else has this in the works I will start building such a patch next week. Best Regards On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James james.d...@ingrambook.comwrote: Helmut, I recently submitted SOLR-2549 ( https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width and delimited flat files. To be honest, I only needed fixed-width support for my app so this might not support everything you mention for delimited files, but it should be a good start. In particular, you might need to enhance this to handle the double quotes (I had though a delimiter regex along these lines might handle it: (?:[\]?[,]|[\]$) ... note this is a sample I just cooked up quick and no doubt has errors, and maybe as you say a simple regex might not work at all ) ... I also didn't do anything with encodings but I'm not sure this will be an issue either... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] Sent: Thursday, June 09, 2011 2:32 PM To: solr-user@lucene.apache.org Subject: Processing/Indexing CSV Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessor http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformer http://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
Re: Processing/Indexing CSV
s/provide and/provide any/ig ,-) On Thu, Jun 9, 2011 at 10:01 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, just looked at your code. Definitely an improvement :-) The problem with the double-quotes is, that the delimiter (let's say ',') might be part of the column value. The goal is to process something like this without any tricky configuration name1,name2,name3 val1,val2,...,val3 ... The user should not have to provide and before-hand knowledge regarding the column layout or the encoding of the CSV file. Ideally the only thing that has to be specified is firstLineHasFieldnames=true separator=;. Autodetecting the separator and encoding would be even more elegant. If nobody else has this in the works I will start building such a patch next week. Best Regards On Thu, Jun 9, 2011 at 9:45 PM, Dyer, James james.d...@ingrambook.comwrote: Helmut, I recently submitted SOLR-2549 ( https://issues.apache.org/jira/browse/SOLR-2549) to handle both fixed-width and delimited flat files. To be honest, I only needed fixed-width support for my app so this might not support everything you mention for delimited files, but it should be a good start. In particular, you might need to enhance this to handle the double quotes (I had though a delimiter regex along these lines might handle it: (?:[\]?[,]|[\]$) ... note this is a sample I just cooked up quick and no doubt has errors, and maybe as you say a simple regex might not work at all ) ... I also didn't do anything with encodings but I'm not sure this will be an issue either... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Helmut Hoffer von Ankershoffen [mailto:helmut...@googlemail.com] Sent: Thursday, June 09, 2011 2:32 PM To: solr-user@lucene.apache.org Subject: Processing/Indexing CSV Hi, there seems to be no way to index CSV using the DataImportHandler. Using a combination of LineEntityProcessor http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformer http://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
RE: Displaying highlights in formatted HTML document
-Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, June 08, 2011 11:56 PM To: solr-user@lucene.apache.org Subject: Re: Displaying highlights in formatted HTML document --- On Thu, 6/9/11, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: From: Bryan Loofbourrow bloofbour...@knowledgemosaic.com Subject: Displaying highlights in formatted HTML document To: solr-user@lucene.apache.org Date: Thursday, June 9, 2011, 2:14 AM Here is my use case: I have a large number of HTML documents, sizes in the 0.5K-50M range, most around, say, 10M. I want to be able to present the user with the formatted HTML document, with the hits tagged, so that he may iterate through them, and see them in the context of the document, with the document looking as it would be presented by a browser; that is, fully formatted, with its tables and italics and font sizes and all. This is something that the user would explicitly request from within a set of search results, not something I'd expect to have returned from an initial search - the initial search merely returns the snippets around the hits. But if the user wants to dive into one of the returned results and see them in context, I need to be able to go get that. We are currently solving this problem by using an entirely separate search engine (dtSearch), which performs the tagging of the hits in the HTML just fine. But the solution is unsatisfactory because there are Solr searches that dtSearch's capabilities cannot reasonably match. Can anyone suggest a good way to use Solr/Lucene for this instead? I'm thinking a separate core for this purpose might make sense, so as not to burden the primary search core with the full contents of the document. But after that, I'm stuck. How can I get Solr to express the highlighting in the context of the formatted HTML document? If Solr does not do this currently, and anyone can suggest ways to add the feature, any tips on how this might best be incorporated into the implementation would be welcome. I am doing the same thing (solr trunk) using the following field type: fieldType name=HTMLText class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ charFilter class=solr.HTMLStripCharFilterFactory mapping=mappings.txt/tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.TurkishLowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms_index.txt ignoreCase=true expand=true/ /analyzeranalyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mappings.txt/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.TurkishLowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ /analyzer In your separate core - which will is queried when the user wants to dive into one of the returned results - feed your html files in to this field. You may want to increase max analyzed chars too. int name=hl.maxAnalyzedChars147483647/int OK, I think see what you're up to. Might be pretty viable for me as well. Can you talk about anything in your mappings.txt files that is an important part of the solution? Also, isn't there another piece? Don't you need to force it to return the whole document, rather than its usual context chunks? Or are you somehow able to map the returned chunks into the separately-stored documents? We have another requirement I forgot to mention, about wanting to associate a sequence number with each hit, but I imagine I can deal with that by putting some sort of identifiable char sequence in a custom prefix for the highlighting, then replacing that with a sequence number in postprocessing. I'm also wondering about the performance of this approach with large documents, vs. something like what Ludovic is talking about, where you would just get positions back from Solr, and fetch the document separately from a filestore. -- Bryan
Re: Processing/Indexing CSV
Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. DIH is flexible enough for building the importing part of such a thing but misses elegant handling of CSV data ... Regards On Thu, Jun 9, 2011 at 9:50 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 3:31 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, there seems to be no way to index CSV using the DataImportHandler. Looking over the features you want, it looks like you're starting from a CSV file (as opposed to CSV stored in a database). Is there a reason that you need to use DIH and can't directly use the CSV loader? http://wiki.apache.org/solr/UpdateCSV -Yonik http://www.lucidimagination.com Using a combination of LineEntityProcessor http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor and RegexTransformer http://wiki.apache.org/solr/DataImportHandler#RegexTransformer as proposed in http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is not working for real world CSV files. E.g. many CSV files have double-quotes enclosing some but not all columns - there is no elegant way to segment this using a simple regular expression. As CSV is still very common esp. in E-Commerce scenarios, I propose that Solr provides a CSVEntityProcessor that: 1) Handles the case of CSV files with/without and with some double-quote enclosed columns 2) Allows for a configurable column separator (';',',','\t' etc.) 3) Allows for a leading row containing column headings 4) If there is a leading row with column headings provides a possibility to address columns by their column names and map them to Solr fields (similar to the XPathEntityProcessor) 5) Auto-detects encoding of the file (UTF-8 etc.) This would make it A LOT easier to use Solr for E-Commerce scenarios. If there is no such entity processor in the works i will develop one ... So please let me know. Regards
Re: Processing/Indexing CSV
On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com
Re: Processing/Indexing CSV
Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in the solr schema. If I understand the capabilities of the CSVLoader correctly (sorry, I am completely new to Solr, started work on it today) this is not possible - is it? Best Regards On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com
RE: Displaying highlights in formatted HTML document
OK, I think see what you're up to. Might be pretty viable for me as well. Can you talk about anything in your mappings.txt files that is an important part of the solution? It is not important. I just copied it. Plus html strip char filter does not have mappings parameter. It was a copy paste mistake. Also, isn't there another piece? Don't you need to force it to return the whole document, rather than its usual context chunks? Yes you are right. hl.fragsize=0 is needed. We have another requirement I forgot to mention, about wanting to associate a sequence number with each hit, but I imagine I can deal with that by putting some sort of identifiable char sequence in a custom prefix for the highlighting, then replacing that with a sequence number in postprocessing. I'm also wondering about the performance of this approach with large documents, vs. something like what Ludovic is talking about, where you would just get positions back from Solr, and fetch the document separately from a filestore. Highlighting large documents takes time. Storing termVectors can be used to speedup. I don't know the answer to performance comparison. Perhaps someone familiar with highlighting can answer this.
RE: Displaying highlights in formatted HTML document
OK, I think see what you're up to. Might be pretty viable for me as well. Can you talk about anything in your mappings.txt files that is an important part of the solution? It is not important. I just copied it. Plus html strip char filter does not have mappings parameter. It was a copy paste mistake. Yes, I asked the wrong question. What I was subconsciously getting at is this: how are you avoiding the possibility of getting hits in the HTML elements? Is that accomplished by putting tag names in your stopwords, or by some other mechanism? -- Bryan
RE: solr Invalid Date in Date Math String/Invalid Date String
: Here is the error message: : : Fieldtype: tdate (I use the default one in solr schema.xml) : Field value(Index): 2006-12-22T13:52:13Z : Field value(query): [2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z] : with '[' and ']' : : And it generates the result below: i think the piece of info people were overlooking here is that you are describing input to the analysis.jsp page. you can't enter arbitrary query expressions on this page -- just *values* for hte analyzer of the specifeid field (or field type) DateField doesn't know abything about the [... TO ...] syntax -- that is syntax of the query parser. all the DateField knows is that what you have entered into the Field Value text box is not a date value, and it is not a date match value either. -Hoss
Re: Processing/Indexing CSV
On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in the solr schema. If I understand the capabilities of the CSVLoader correctly (sorry, I am completely new to Solr, started work on it today) this is not possible - is it? As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the names/positions of fields in the CSV file, and ignore fieldnames. So this seems like it would solve your requirement, as each different layout could specify its own such mapping during import. It could be handy to provide a fieldname map (versus the value map that UpdateCSV supports). Then you could use the header, and just provide a mapping from header fieldnames to schema fieldnames. -- Ken On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Solr Indexing Patterns
Very informative links and statement Jonathan. thank you. On 6 June 2011 20:55, Jonathan Rochkind rochk...@jhu.edu wrote: This is a start, for many common best practices: http://wiki.apache.org/solr/SolrRelevancyFAQ Many of the questions in there have an answer that involves de-normalizing. As an example. It may be that even if your specific problem isn't in there, I myself anyway found reading through there gave me a general sense of common patterns in Solr. ( It's certainly true that some things are hard to do in Solr. It turns out that an RDBMS is a remarkably flexible thing -- but when it doesn't do something you need well, and you turn to a specialized tool instead like Solr, you certainly give up some things One of the biggest areas of limitation involves hieararchical or relationship data, definitely. There are a variety of features, some more fully baked than others, some not yet in a Solr release, meant to provide tools to get at different aspects of this. Including pivot facetting, join (https://issues.apache.org/jira/browse/SOLR-2272), and field-collapsing. Each, IMO, is trying to deal with different aspects of dealing with hieararchical or multi-class data, or data that is entities with relationships. ). On 6/6/2011 3:43 PM, Judioo wrote: I do think that Solr would be better served if there was a *best practice section *of the site. Looking at the majority of emails to this list they resolve around how do I do X?. Seems like tutorials with real world examples would serve Solr no end of good. I still do not have an example of the best method to approach my problem, although Erick has help me understand the limitations of Solr. Just thought I'd say. On 6 June 2011 20:26, Judioocont...@judioo.com wrote: Thanks On 6 June 2011 19:32, Erick Ericksonerickerick...@gmail.com wrote: #Everybody# (including me) who has any RDBMS background doesn't want to flatten data, but that's usually the way to go in Solr. Part of whether it's a good idea or not depends on how big the index gets, and unfortunately the only way to figure that out is to test. But that's the first approach I'd try. Good luck! Erick On Mon, Jun 6, 2011 at 11:42 AM, Judioocont...@judioo.com wrote: On 5 June 2011 14:42, Erick Ericksonerickerick...@gmail.com wrote: See: http://wiki.apache.org/solr/SchemaXml By adding ' multiValued=true ' to the field, you can add the same field multiple times in a doc, something like add doc field name=mvvalue1/field field name=mvvalue2/field /doc /add I can't see how that would work as one would need to associate the right start / end dates and price. As I understand using multivalued and thus flattening the discounts would result in: { name:The Book, price:$9.99, price:$3.00, price:$4.00,synopsis:thanksgiving special, starts:11-24-2011, starts:10-10-2011, ends:11-25-2011, ends:10-11-2011, synopsis:Canadian thanksgiving special, }, How does one differentiate the different offers? But there's no real ability in Solr to store sub documents, so you'd have to get creative in how you encoded the discounts... This is what I'm asking :) What is the best / recommended / known patterns for doing this? But I suspect a better approach would be to store each discount as a separate document. If you're in the trunk version, you could then group results by, say, ISBN and get responses grouped together... This is an option but seems sub optimal. So say I store the discounts in multiple documents with ISDN as an attribute and also store the title again with ISDN as an attribute. To get all books currently discounted requires 2 request * get all discounts currently active * get all books using ISDN retrieved from above search Not that bad. However what happens when I want all books that are currently on discount in the horror genre containing the word 'elm' in the title. The only way I can see in catering for the above search is to duplicate all searchable fields in my book document in my discount document. Coming from a RDBM background this seems wrong. Is this the correct approach to take? Best Erick On Sat, Jun 4, 2011 at 1:42 AM, Judioocont...@judioo.com wrote: Hi, Discounts can change daily. Also there can be a lot of them (over time and in a given time period ). Could you give an example of what you mean buy multi-valuing the field. Thanks On 3 June 2011 14:29, Erick Ericksonerickerick...@gmail.com wrote: How often are the discounts changed? Because you can simply re-index the book information with a multiValued discounts field and get something similar to your example (wt=json) Best Erick On Fri, Jun 3, 2011 at 8:38 AM, Judioocont...@judioo.com wrote: What is the best practice method to index the following in Solr: I'm attempting to use solr for a book store site.
Re: Processing/Indexing CSV
On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in the solr schema. If I understand the capabilities of the CSVLoader correctly (sorry, I am completely new to Solr, started work on it today) this is not possible - is it? As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the names/positions of fields in the CSV file, and ignore fieldnames. So this seems like it would solve your requirement, as each different layout could specify its own such mapping during import. Sure, but the requirement (to keep the process of integrating new shops efficient) is not to have one mapping per import (cp. the Email regarding more or less schema free) but to enhance one mapping that maps common field names to defined fields disregarding order of known fields/columns. As far as I understand that is not a problem at all with DIH, however DIH and CSV are not a perfect match ,-) It could be handy to provide a fieldname map (versus the value map that UpdateCSV supports). Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH ... Then you could use the header, and just provide a mapping from header fieldnames to schema fieldnames. That's the idea -) = what's the best way to progress. Either someone enhances the CSVLoader by a field mapper (with multipel input field names mapping to one field name in the Solr schema) or someone enhances the DIH with a robust CSV loader ,-). As I am completely new to this Community, please give me the direction to go (or wait :-). best regards -- Ken On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: Processing/Indexing CSV
Hi, btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH regarding the CSV format (James Dyer) and the effort to maintain the CSVLoader (Ken Krugler). How about merging your efforts and migrating the CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-) Best Regards On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in the solr schema. If I understand the capabilities of the CSVLoader correctly (sorry, I am completely new to Solr, started work on it today) this is not possible - is it? As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the names/positions of fields in the CSV file, and ignore fieldnames. So this seems like it would solve your requirement, as each different layout could specify its own such mapping during import. Sure, but the requirement (to keep the process of integrating new shops efficient) is not to have one mapping per import (cp. the Email regarding more or less schema free) but to enhance one mapping that maps common field names to defined fields disregarding order of known fields/columns. As far as I understand that is not a problem at all with DIH, however DIH and CSV are not a perfect match ,-) It could be handy to provide a fieldname map (versus the value map that UpdateCSV supports). Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH ... Then you could use the header, and just provide a mapping from header fieldnames to schema fieldnames. That's the idea -) = what's the best way to progress. Either someone enhances the CSVLoader by a field mapper (with multipel input field names mapping to one field name in the Solr schema) or someone enhances the DIH with a robust CSV loader ,-). As I am completely new to this Community, please give me the direction to go (or wait :-). best regards -- Ken On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
RE: Displaying highlights in formatted HTML document
Yes, I asked the wrong question. What I was subconsciously getting at is this: how are you avoiding the possibility of getting hits in the HTML elements? Is that accomplished by putting tag names in your stopwords, or by some other mechanism? HtmlStripCharFilter removes html tags. After it only textual content remains. It is the same as extracting text from html/xml. admin/analysis.jsp is great tool visualizing analysis chain. You can try it. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Re: Processing/Indexing CSV
On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote: Hi, btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH regarding the CSV format (James Dyer) and the effort to maintain the CSVLoader (Ken Krugler). How about merging your efforts and migrating the CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-) While I'm a CSVLoader user (and I've found/fixed one bug in it), I'm not involved in any active development/maintenance of that piece of code. If James or you can make progress on merging support for CSV into DIH, that's great. -- Ken On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote: Hi, ... that would be an option if there is a defined set of field names and a single column/CSV layout. The scenario however is different csv files (from different shops) with individual column layouts (separators, encodings etc.). The idea is to map known field names to defined field names in the solr schema. If I understand the capabilities of the CSVLoader correctly (sorry, I am completely new to Solr, started work on it today) this is not possible - is it? As per the documentation on http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the names/positions of fields in the CSV file, and ignore fieldnames. So this seems like it would solve your requirement, as each different layout could specify its own such mapping during import. Sure, but the requirement (to keep the process of integrating new shops efficient) is not to have one mapping per import (cp. the Email regarding more or less schema free) but to enhance one mapping that maps common field names to defined fields disregarding order of known fields/columns. As far as I understand that is not a problem at all with DIH, however DIH and CSV are not a perfect match ,-) It could be handy to provide a fieldname map (versus the value map that UpdateCSV supports). Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH ... Then you could use the header, and just provide a mapping from header fieldnames to schema fieldnames. That's the idea -) = what's the best way to progress. Either someone enhances the CSVLoader by a field mapper (with multipel input field names mapping to one field name in the Solr schema) or someone enhances the DIH with a robust CSV loader ,-). As I am completely new to this Community, please give me the direction to go (or wait :-). best regards -- Ken On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen helmut...@googlemail.com wrote: Hi, yes, it's about CSV files loaded via HTTP from shops to be fed into a shopping search engine. The CSV Loader cannot map fields (only field values) etc. You can provide your own list of fieldnames and optionally ignore the first line of the CSV file (assuming it contains the field names). http://wiki.apache.org/solr/UpdateCSV#fieldnames -Yonik http://www.lucidimagination.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
SolrCloud questions
I'm exploring SolrCloud for a new project, and have some questions based upon what I've found so far. The setup I'm planning is going to have a number of multicore hosts, with cores being moved between hosts, and potentially with cores merging as they get older (cores are time based, so once today has passed, they don't get updated). First question: The solr/conf dir gets uploaded to Zookeeper when you first start up, and using system properties you can specify a name to be associated with those conf files. How do you handle it when you have a multicore setup, and different configs for each core on your host? Second question: Can you query collections when using multicore? On single core, I can query: http://localhost:8983/solr/collection1/select?q=blah On a multicore system I can query: http://localhost:8983/solr/core1/select?q=blah but I cannot work out a URL to query collection1 when I have multiple cores. Third question: For replication, I'm assuming that replication in SolrCloud is still managed in the same way as non-cloud Solr, that is as ReplicationHandler config in solrconfig? In which case, I need a different config setup for each slave, as each slave has a different master (or can I delegate the decision as to which host/core is its master to zookeeper?) Thanks for any pointers. Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
Re: Tokenising based on known words?
Thanks for the feedback! This definitely gives me some options to work on! Mark On Thu, Jun 9, 2011 at 11:21 PM, Steven A Rowe sar...@syr.edu wrote: Hi Mark, Are you familiar with shingles aka token n-grams? http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html Use the empty string for the tokenSeparator to get wordstogether style tokens in your index. I think you'll want to apply this filter only at index-time, since the users will supply the shingles all by themselves :). Steve -Original Message- From: Mark Mandel [mailto:mark.man...@gmail.com] Sent: Thursday, June 09, 2011 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Tokenising based on known words? Synonyms really wouldn't work for every possible combination of words in our index. Thanks for the idea though. Mark On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel mark.man...@gmail.com wrote: Not sure if this possible, but figured I would ask the question. Basically, we have some users who do some pretty rediculous things ;o) Rather than writing red jacket, they write redjacket, which obviously returns no results. [...] Have you tried using synonyms, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF ilterFactory It seems like they should fit your use case. Regards, Gora -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
Where to find the Log file
Where can I find the log file of solr? Is it turned on by default? (I use Jetty) Thanks Ruixiang
Re: Boosting result on query.
HI, Thank you for your answer. But... I cannot use a boost calculated offline since the boost will changed depending of the query made. Each query will boost the query differently. Any other ideaàs ? Jeff -- View this message in context: http://lucene.472066.n3.nabble.com/Boosting-result-on-query-tp3037649p3046859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Where to find the Log file
On Jun 9, 2011, at 5:45 PM, Ruixiang Zhang wrote: Where can I find the log file of solr? (I use Jetty) By default, it's in yourapp/solr/logs/solr.log Is it turned on by default? Yes. Oh, yes. Very much so. Uh-huh, you betcha. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Where to find the Log file
Here's help on how to setup logging http://skybert.wordpress.com/2009/07/22/how-to-get-solr-to-log-to-a-log-file/ - Morris - Original Message - From: Ruixiang Zhang rxzh...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, June 9, 2011 8:45:30 PM GMT -05:00 US/Canada Eastern Subject: Where to find the Log file Where can I find the log file of solr? Is it turned on by default? (I use Jetty) Thanks Ruixiang
Re: tika integration exception and other related queries
Hi Gary, Similar thing we are doing, but we are not creating an XML doc, rather we are leaving TIKA to extract the content and depends on dynamic fields. We are not storing the text as well. But not sure if in future that would be the case. What about microsoft 7 and later related attachments. Is this working for you, because we are always getting number format exception. I posted as well in the community, but till now no response has some. Thanks Naveen On Thu, Jun 9, 2011 at 6:43 PM, Gary Taylor g...@inovem.com wrote: Naveen, Not sure our requirement matches yours, but one of the things we index is a comment item that can have one or more files attached to it. To index the whole thing as a single Solr document we create a zipfile containing a file with the comment details in it and any additional attached files. This is submitted to Solr as a TEXT field in an XML doc, along with other meta-data fields from the comment. In our schema the TEXT field is indexed but not stored, so when we search and get a match back it doesn't contain all of the contents from the attached files etc., only the stored fields in our schema. Admittedly, the user can therefore get back a comment match with no indication as to WHERE the match occurred (ie. was it in the meta-data or the contents of the attached files), but at the moment we're only interested in getting appropriate matches, not explaining where the match is. Hope that helps. Kind regards, Gary. On 09/06/2011 03:00, Naveen Gupta wrote: Hi Gary It started working .. though i did not test for Zip files, but for rar files, it is working fine .. only thing what i wanted to do is to index the metadata (text mapped to content) not store the data Also in search result, i want to filter the stuffs ... and it started working fine .. i don't want to show the content stuffs to the end user, since the way it extracts the information is not very helpful to the user .. although we can apply few of the analyzers and filters to remove the unnecessary tags ..still the information would not be of much help .. looking for your opinion ... what you did in order to filter out the content or are you showing the content extracted to the end user? Even in case, we are showing the text part to the end user, how can i limit the number of characters while querying the search results ... is there any feature where we can achieve this ... the concept of snippet kind of thing ... Thanks Naveen On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylorg...@inovem.com wrote: Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
ERROR on posting update request using CURL in php
Hi This is my document in php $xmldoc = 'adddocfield name=idF_146/fieldfield name=userid74/fieldfield name=groupuseidgmail.com/fieldfield name=attachment_size121/fieldfield name=attachment_namesample.pptx/field/doc/add'; $ch = curl_init(http://localhost:8080/solr/update;); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt ($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_HTTPHEADER, array(Content-Type: text/xml) ); curl_setopt($ch, CURLOPT_POSTFIELDS,$xmldoc); $result= curl_exec($ch); if(!curl_errno($ch)) { $info = curl_getinfo($ch); $header = substr($response, 0, $info['header_size']); echo 'Took ' . $info['total_time'] . ' seconds to send a request to ' . $info['url']; }else{ print_r('no idea'); } println('result of query'.' '.' - '.$result); It is throwing error htmlheadtitleApache Tomcat/6.0.18 - Error report/titlestyle!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--/style /headbodyh1HTTP Status 400 - Unexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1]/h1HR size=1 noshade=noshadepbtype/b Status report/ppbmessage/b uUnexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1]/u/ppbdescription/b uThe request sent by the client was syntactically incorrect (Unexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1])./u/pHR size=1 noshade=noshadeh3Apache Tomcat/6.0.18/h3/body/html Thanks Naveen
Re: how to Index and Search non-Eglish Text in solr
Thanks Erick for your help. I have another silly question. Suppose I created mutiple fieldTypes e.g. news_English, news_Chinese, news_Japnese etc. after creating these field, can I copy all these to CopyField *defaultquery *like below : *copyField source=news_English dest=defaultquery/ copyField source=news_Chinese dest=defaultquery/ copyField source=news_Japnese dest=defaultquery/ *and my defaultquery looks like :* field name=defaultquery type=query_text indexed=false stored=false multiValued=true/ *Is this right way to deal with multiple language Indexing and searching* * ???* * On 9 June 2011 19:06, Erick Erickson erickerick...@gmail.com wrote: No, you'd have to create multiple fieldTypes, one for each language Best Erick On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq shariqn...@gmail.com wrote: Can I specify multiple language in filter tag in schema.xml ??? like below fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr. WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.SnowballPorterFilterFactory language=Dutch / filter class=solr.SnowballPorterFilterFactory language=English / filter class=solr.SnowballPorterFilterFactory language=Chinese / tokenizer class=solr.WhitespaceTokenizerFactory/ tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/filter class=solr.SnowballPorterFilterFactory language=Hungarian / On 8 June 2011 18:47, Erick Erickson erickerick...@gmail.com wrote: This page is a handy reference for individual languages... http://wiki.apache.org/solr/LanguageAnalysis But the usual approach, especially for Chinese/Japanese/Korean (CJK) is to index the content in different fields with language-specific analyzers then spread your search across the language-specific fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords particularly give surprising results if you put words from different languages in the same field. Best Erick On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq shariqn...@gmail.com wrote: Hi, I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in English, but my requirement extend to index the news of other languages too. This is how my schema looks : field name=news type=text indexed=true stored=false required=false/ And the text Field in schema.xml looks like : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType My Problem is : Now I want to index the news articles in other languages to e.g. Chinese,Japnese. How I can I modify my text field so that I can Index the news in other lang too and make it searchable ?? Thanks Shariq -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Multiple Values not getting Indexed
it did not work :( On Thu, Jun 9, 2011 at 12:53 PM, Bill Bell billnb...@gmail.com wrote: You have to take the input and splitBy something like , to get it into an array and reposted back to Solr... I believe others have suggested that? On 6/8/11 10:14 PM, Pawan Darira pawan.dar...@gmail.com wrote: Hi I am trying to index 2 fields with multiple values. BUT, it is only putting 1 value for each ignoring rest of the values after comma(,). I am fetching query through DIH. It works fine if i have only 1 value each of the 2 fields E.g. Field1 - 150,178,461,151,310,306,305,179,137,162 Field2 - Chandigarh,Gurgaon,New Delhi,Ahmedabad,Rajkot,Surat,Mumbai,Nagpur,Pune,India - Others *Schema.xml* field name=city_type type=text indexed=true stored=true/ field name=city_desc type=text indexed=true stored=true/ p.s. i tried multivalued=true but of no help. -- Thanks, Pawan Darira -- Thanks, Pawan Darira
Re: Multiple Values not getting Indexed
On Fri, Jun 10, 2011 at 10:36 AM, Pawan Darira pawan.dar...@gmail.com wrote: it did not work :( [...] Please provide more details of what you tried, what was the error, and any error messages that you got. Just saying that it did not work makes it pretty much impossible for anyone to help you. You might take a look at http://wiki.apache.org/solr/UsingMailingLists Regards, Gora
Re: ERROR on posting update request using CURL in php
Hi, curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=idtestdoc/field/doc/add' Regards Naveen On Fri, Jun 10, 2011 at 10:18 AM, Naveen Gupta nkgiit...@gmail.com wrote: Hi This is my document in php $xmldoc = 'adddocfield name=idF_146/fieldfield name=userid74/fieldfield name=groupuseidgmail.com/fieldfield name=attachment_size121/fieldfield name=attachment_namesample.pptx/field/doc/add'; $ch = curl_init(http://localhost:8080/solr/update;); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt ($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_HTTPHEADER, array(Content-Type: text/xml) ); curl_setopt($ch, CURLOPT_POSTFIELDS,$xmldoc); $result= curl_exec($ch); if(!curl_errno($ch)) { $info = curl_getinfo($ch); $header = substr($response, 0, $info['header_size']); echo 'Took ' . $info['total_time'] . ' seconds to send a request to ' . $info['url']; }else{ print_r('no idea'); } println('result of query'.' '.' - '.$result); It is throwing error htmlheadtitleApache Tomcat/6.0.18 - Error report/titlestyle!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--/style /headbodyh1HTTP Status 400 - Unexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1]/h1HR size=1 noshade=noshadepbtype/b Status report/ppbmessage/b uUnexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1]/u/ppbdescription/b uThe request sent by the client was syntactically incorrect (Unexpected character ''' (code 39) in prolog; expected 'lt;' at [row,col {unknown-source}]: [1,1])./u/pHR size=1 noshade=noshadeh3Apache Tomcat/6.0.18/h3/body/html Thanks Naveen
Re: ERROR on posting update request using CURL in php
Hi, Basically i need to post something like this using curl in php The example of php explained in earlier thread, curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=idtestdoc/field/doc/add' Should we need to create a temp file and using put command can we do it using post Regards Naveen -- View this message in context: http://lucene.472066.n3.nabble.com/ERROR-on-posting-update-request-using-CURL-in-php-tp3047312p3047372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud questions
I am also planning to move to SolrCloud; since its still in under development, I am not sure about its behavior in Production. Please update us once you find it stable. On 10 June 2011 03:56, Upayavira u...@odoko.co.uk wrote: I'm exploring SolrCloud for a new project, and have some questions based upon what I've found so far. The setup I'm planning is going to have a number of multicore hosts, with cores being moved between hosts, and potentially with cores merging as they get older (cores are time based, so once today has passed, they don't get updated). First question: The solr/conf dir gets uploaded to Zookeeper when you first start up, and using system properties you can specify a name to be associated with those conf files. How do you handle it when you have a multicore setup, and different configs for each core on your host? Second question: Can you query collections when using multicore? On single core, I can query: http://localhost:8983/solr/collection1/select?q=blah On a multicore system I can query: http://localhost:8983/solr/core1/select?q=blah but I cannot work out a URL to query collection1 when I have multiple cores. Third question: For replication, I'm assuming that replication in SolrCloud is still managed in the same way as non-cloud Solr, that is as ReplicationHandler config in solrconfig? In which case, I need a different config setup for each slave, as each slave has a different master (or can I delegate the decision as to which host/core is its master to zookeeper?) Thanks for any pointers. Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source -- Thanks and Regards Mohammad Shariq