Re: Is it posible to exclude results from other languages?
On Wed, Feb 10, 2010 at 10:09 AM, Lance Norskog wrote: > > Thanks for the pointer to ngramj (LGPL license), which then leads to > another contender, http://tcatng.sourceforge.net/ (BSD license). The > latter would make a great DIH Transformer that could go into contrib/ > (hint hint). > > SOLR-1768 :) -- Regards, Shalin Shekhar Mangar.
Re: Question on Tokenizing email address
Thank you! it works very well. I think that the field type suggested by you will index words like DOT, AT, com also In order to prevent these words from getting indexed, I have changed the field type to I have added the words dot, com to the stoplist file (at was already there). Is this correct? -- View this message in context: http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: "after flush: fdx size mismatch" on query durring writes
We need more information. How big is the index in disk space? How many documents? How many fields? What's the schema? What OS? What Java version? Do you run this on a local hard disk or is it over an NFS mount? Does this software commit before shutting down? If you run with asserts on do you get errors before this happens. -ea:org.apache.lucene... as a JVM argument On Tue, Feb 9, 2010 at 5:08 PM, Acadaca wrote: > > We are using Solr 1.4 in a multi-core setup with replication. > > Whenever we write to the master we get the following exception: > > java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs 0 > length in bytes of _gqg.fdx file exists?=false > at > org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97) > at > org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50) > > Has anyone had any success debugging this one? > > thx. > -- > View this message in context: > http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Lance Norskog goks...@gmail.com
Re: Solr/Drupal Integration - Query Question
The admin/form.jsp is supposed to prepopulate fl= with '*,score' which means bring back all fields and the calculated relevance score. This is the Drupal search, decoded. I changed the %2B to + signs for readability. Have a look at the filter query fq= and the facet date range. Also, in Solr 1.4 the 'rord' function has become very slow. So the Drupal integration needs some updating anyway. INFO: [] webapp=/solr path=/select params={spellcheck=true& spellcheck.q=video& fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative& bf=recip(rord(created),4,19,19)^200.0& &hl.simple.post=& hl.simple.pre=&hl=&version=1.2& hl.fragsize=& hl.fl=& hl.snippets=& facet=true&facet.limit=20& facet.field=uid&facet.field=type&facet.field=language& facet.mincount=1& fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)& qf=name^3.0&facet.date=changed& json.nl=map&wt=json& f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR& f.changed.facet.date.end=2010-02-09T17:44:16Z+1HOUR/HOUR& f.changed.facet.date.gap=+1HOUR rows=10&start=0&facet.sort=true& q=video} hits=0 status=0 QTime=0 On Tue, Feb 9, 2010 at 1:28 PM, jaybytez wrote: > > I know this is not Drupal, but thought this question maybe more around the > Solr query. > > For instance, I pulled down LucidImaginations Solr install, just like the > apache solr install and ran the example solr and loaded the documents from > the exampledocs. > > I can go to: > > http://localhost:8983/solr/admin/ > > And search for video and get responses > > But on my solr if I go to the full interface and use the defaults, I get no > results back because of search fields, etc. > > http://localhost:8983/solr/admin/form.jsp > > So my admin Solr search query looks like this when searching "video": > > Feb 9, 2010 1:25:49 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/select > params={explainOther=&fl=&indent=on&start=0&q=video&hl.fl=&qt=&wt=&fq=&version=2.2&rows=10} > hits=2 status=0 QTime=0 > > But if I go into Drupal and search "video", this is the query and no results > come back: > > Feb 9, 2010 1:27:33 PM org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/select > params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=video&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&bf=recip(rord(created),4,19,19)^200.0&f.changed.facet.date.gap=%2B1HOUR&hl.simple.post=&facet.field=uid&facet.field=type&facet.field=language&fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&hl.fragsize=&facet.mincount=1&qf=name^3.0&facet.date=changed&hl.fl=&json.nl=map&wt=json&f.changed.facet.date.end=2010-02-09T17:44:16Z%2B1HOUR/HOUR&rows=10&hl.snippets=&start=0&facet.sort=true&q=video} > hits=0 status=0 QTime=0 > > Any thoughts on the search query that gets generated by the Drupal/Solr > module? > > Thanks...jay > -- > View this message in context: > http://old.nabble.com/Solr-Drupal-Integration---Query-Question-tp27522362p27522362.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Lance Norskog goks...@gmail.com
Re: Is it posible to exclude results from other languages?
That's what I was going to look up :) The nutch thing works reasonably well. It comes with a training database from various languages. It had some UTF-8 problems in the files. The trick here is to come up with a balanced volume of text for all languages so that one language's patterns do not overwhelm. Thanks for the pointer to ngramj (LGPL license), which then leads to another contender, http://tcatng.sourceforge.net/ (BSD license). The latter would make a great DIH Transformer that could go into contrib/ (hint hint). On Tue, Feb 9, 2010 at 7:21 AM, Jan Høydahl / Cominvent wrote: > Much more efficient to tag documents with language at index time. Look for > language identification tools such as > http://www.sematext.com/products/language-identifier/index.html or > http://ngramj.sourceforge.net/ or > http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html > > -- > Jan Høydahl - search architect > Cominvent AS - www.cominvent.com > > On 9. feb. 2010, at 05.19, Lance Norskog wrote: > >> There is >> >> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch wrote: >>> >>> >>> Yes, It's true that we could do it in index time if we had a way to know. I >>> was thinking in some solution in search time, maybe measuring the % of >>> stopwords of each document. Normally, a document of another language won't >>> have any stopword of its main language. >>> >>> If you know some external software to detect the language of a source text, >>> it would be useful too. >>> >>> Thanks, >>> Raimon Bosch. >>> >>> >>> >>> Ahmet Arslan wrote: > In our indexes, sometimes we have some documents written in > other languages > different to the most common index's language. Is there any > way to give less > boosting to this documents? If you are aware of those documents, at index time you can boost those documents with a value less than 1.0: // document written in other languages ... ... http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22 >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > > -- Lance Norskog goks...@gmail.com
Re: Autosuggest and highlighting
To select the whole string, I think you want hl.fragmenter=regex and to create a regex pattern for your entire strings: http://www.lucidimagination.com/search/document/CDRG_ch07_7.9?q=highlighter+multi-valued This will let you select the entire string field. But I don't know how to avoid the non-matching prefixes. That's a really interesting quirk of highlighting. On Tue, Feb 9, 2010 at 6:18 AM, gwk wrote: > On 2/9/2010 2:57 PM, Ahmet Arslan wrote: >>> >>> I'm trying to improve the search box on our website by >>> adding an autosuggest field. The dataset is a set of >>> properties in the world (mostly europe) and the searchbox is >>> intended to be filled with a country-, region- or city name. >>> To do this I've created a separate, simple core with one >>> document per geographic location, for example the document >>> for the country "France" contains several fields including >>> the number of properties (so we can show the approximate >>> amount of results in the autosuggest box) and the name of >>> the country France in several languages and some other >>> bookkeeping information. The name of the property is stored >>> in two fields: "name" which simple contains the canonical >>> name of the country, region or city and "names" which is a >>> multivalued field containing the name in several different >>> languages. Both fields use an EdgeNGramFilter during >>> analysis so the query "Fr" can match "France". >>> >>> This all seems to work, the autosuggest box gives >>> appropriate suggestions. But when I turn on highlighting the >>> results are less than desirable, for example the query "rho" >>> using dismax (and hl.snippets=5) returns the >>> following: >>> >>> >>> >>> Région >>> Rhône-Alpes >>> Rhône-Alpes >>> Rhône-Alpes >>> Rhône-Alpes >>> Rhône-Alpes >>> >>> >>> Région >>> Rhône-Alpes >>> >>> >>> >>> >>> Département du >>> Rhône >>> Département du >>> Rhône >>> Rhône >>> Département du >>> Rhône >>> Rhône >>> >>> >>> Département du >>> Rhône >>> >>> >>> >>> As you can see, no matter where the match is, the first 3 >>> characters are highlighted. Obviously not correct for many >>> of the fields. Is this because of the NGramFilterFactory or >>> am I doing something wrong? >>> >> I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime >> ago. It was giving correct highlights. >> > I just ran a test with the NGramFilter removed (and reindexing) which did > give correct highlighting results but I had to query using the whole word. > I'll try the PrefixingFilterFactory next although according to the comments > it's nothing but a subset of the EdgeNGramFilterFactory so unless I'm > configuring it wrong it should yield the same results... > >> However we are now using >> http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It >> automatically makes bold matching characters without using solr >> highlighting. >> > Using a pure javascript based solution isn't really an option for us as that > wouldn't work for the diacritical marks without a lot of transliteration > brouhaha. > > Regards, > > gwk > -- Lance Norskog goks...@gmail.com
analysing wild carded terms
hello *, quick question, what would i have to change in the query parser to allow wildcarded terms to go through text analysis?
Re: Distributed search and haproxy and connection build up
This goes through the Apache Commons HTTP client library: http://hc.apache.org/httpclient-3.x/ We used 'balance' at another project and did not have any problems. On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor wrote: > I have been using distributed search with haproxy but noticed that I am > suffering a little from tcp connections building up waiting for the OS level > closing/time out: > > netstat -a > ... > tcp6 1 0 10.0.16.170%34654:53789 10.0.16.181%363574:8893 > CLOSE_WAIT > tcp6 1 0 10.0.16.170%34654:43932 10.0.16.181%363574:8890 > CLOSE_WAIT > tcp6 1 0 10.0.16.170%34654:43190 10.0.16.181%363574:8895 > CLOSE_WAIT > tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:53770 > TIME_WAIT > tcp6 1 0 10.0.16.170%34654:41782 10.0.16.181%363574: > CLOSE_WAIT > tcp6 1 0 10.0.16.170%34654:52169 10.0.16.181%363574:8890 > CLOSE_WAIT > tcp6 1 0 10.0.16.170%34654:55947 10.0.16.181%363574:8887 > CLOSE_WAIT > tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:54040 > TIME_WAIT > tcp6 1 0 10.0.16.170%34654:40030 10.0.16.160%363574:8984 > CLOSE_WAIT > ... > > Digging a little into the haproxy documentation, it seems that they do not > support persistent connections. > > Does solr normally persist the connections between shards (would this > problem happen even without haproxy)? > > Ian. > -- Lance Norskog goks...@gmail.com
Re: Posting pdf file and posting from remote
stream.file= means read a local file from the server that solr runs on. It has to be a complete path that works from that server. To load the file over HTTP you have to use @filename to have curl open it. This path has to work from the program you run curl on, and relative paths work. Also, tika does not save the PDF binary, it only pulls words out of the PDF and stores those. There's a tika example in solr/trunk/example/exampleDIH in the current solr trunk. (I don't remember if it's in the solr 1.4 release.) With this you can save the pdf binary in one field and save the extracted text in another field. I'm doing this now with html. On Tue, Feb 9, 2010 at 2:08 AM, alendo wrote: > > Ok I'm going ahead (may be:). > I tried another curl command to send the file from remote: > > http://mysolr:/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf > > and the behaviour has been changed: now I get an error in solr log file: > > HTTP Status 500 - files/attach-8514.pdf (No such file or directory) > java.io.FileNotFoundException: files/attach-8514.pdf (No such file or > directory) at java.io.FileInputStream.open(Native Method) at > java.io.FileInputStream.(FileInputStream.java:106) at > org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at > > etc etc... > > -- > View this message in context: > http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Lance Norskog goks...@gmail.com
Re: Indexing / querying multiple data types
A couple of minor problems: The qt parameter (Que Tee) selects the parser for the q (Q for query) parameter. I think you mean 'qf': http://wiki.apache.org/solr/DisMaxRequestHandler#qf_.28Query_Fields.29 Another problems with atomID, atomId, atomid: Solr field names are case-sensitive. I don't know how this plays out. Now, to the main part: the part does not create a column named name1. The two queries only populate the same namespace of four fields: id, atomID, name, description. If you want data from each entity to have a constant field distinguishing it, you have to create a new field with a constant value. You do this with the TemplateTransformer. http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer Add this as an entity attribute to both entities: transformer="TemplateTransformer" and add this as a column to each entity: and then "name2". You may have to do something else for these to appear in the document. On Tue, Feb 9, 2010 at 12:41 AM, wrote: > Sven > > In my data-config.xml I have the following > > > > > > In my schema.xml I have > /> > > required="true" /> > > > And in my solrconfig.xml I have > class="org.apache.solr.handler.dataimport.DataImportHandler"> > > data-config.xml > > > > > > dismax > explicit > 0.01 > name^1.5 description^1.0 > > > > > > dismax > explicit > 0.01 > name^1.5 description^1.0 > > > > And the > > Has been untouched > > So when I run > http://localhost:7001/solr/select/?q=food&qt=name1 > I was expecting to get results form the data that had been indexed by name="name1" > > > Regards > Stefan Maric > -- Lance Norskog goks...@gmail.com
Copying dynamic fields into default text field messing up fieldNorm?
Hi All, I'm trying to create an index of documents, where for each document, I am trying to associate with it a set of related keywords, each with individual boost values that I compute externally. eg: Document Title: Democrats related keywords: liberal: 4.0 politics: 1.5 obama: 2.0 etc. (hundreds of related keywords) Since boosts in solr is per field instead of per field-instance, I am trying to get around this by creating dynamic fields for each related keyword, and setting boost values accordingly. To be able to surface this document by searching the related keywords, I have the schema setup to copy these related keyword fields into the default text field. But when I query any of these related keywords, I get back fieldNorms with the max value: 1.5409492E10 = (MATCH) weight(text:liberal in 11), product of: 0.8608541 = queryWeight(text:liberal), product of: 1.6840147 = idf(docFreq=109, maxDocs=218) 0.51119155 = queryNorm 1.79002368E10 = (MATCH) fieldWeight(text:liberal in 11), product of: 1.4142135 = tf(termFreq(text:liberal)=2) 1.6840147 = idf(docFreq=109, maxDocs=218) According to this email exchange between Koji and Mat Brown, http://www.mail-archive.com/solr-user@lucene.apache.org/msg23759.html The boost value from copyField's shouldn't be accumulated into the boost for the text field, can anyone else verify this? This seem to go against what I'm observing. When I turn off copyField, the fieldNorm goes back to normal (in the single digit range). Any idea what could be causing this? I'm running Solr 1.4 in case that matters. Any pointers/advice would be greatly appreciated! Thanks, Yu-Shan
Re: Bigram term vectors and weights possible with Solr?
Thank you Ahmet, this is exactly what I was looking for. Looks like the shingle filter can produce 3+-gram terms as well, that's great. I'm going to try this with both western and CJK language tokenizers and see how it turns out. On Tue, Feb 9, 2010 at 5:07 PM, Ahmet Arslan wrote: >> I've been looking at the Solr TermVectorComponent >> (http://wiki.apache.org/solr/TermVectorComponent) and it >> seems to have >> something similar to this, but it looks to me like this is >> a component >> that is processed at query time (?) and is limited to >> 1-gram terms. > > If you use outputUnigrams="false"/> it can give you info about 2-gram terms. > >> Also, the tf/idf scores are a little different as they come >> back in integer values as separate components. > > In wiki, example output only tf and df values - which are integer - are > displayed. You can calculate tf*idf (double) with these parameters: > > &qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true > > > >
Re: Solr usage with Auctions/Classifieds?
The class was added in 2007 and hasn't changed. I don't know if anyone uses it. Presumably sort-by-function will use it. On Tue, Feb 9, 2010 at 5:59 AM, Jan Høydahl / Cominvent wrote: > With the new sort by function in 1.5 > (http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function), will it now be > possible to include the ExternalFileField value in the sort formula? If so, > we could sort on last bid price or last bid time without updating the > document itself. > > However, to display the result with the fresh values, we need to go to DB, or > is there someone working on the possibility to return ExternalFileField > values for result view? > > -- > Jan Høydahl - search architect > Cominvent AS - www.cominvent.com > > On 4. feb. 2010, at 06.25, Lance Norskog wrote: > >> Oops, forgot to add the link: >> >> http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4 >> >> On Wed, Feb 3, 2010 at 9:17 PM, Andy wrote: >>> How do I set up and use this external file? >>> >>> Can I still use such a field in fq or boost? >>> >>> Can you point me to the right documentation? Thanks >>> >>> --- On Wed, 2/3/10, Lance Norskog wrote: >>> >>> From: Lance Norskog >>> Subject: Re: Solr usage with Auctions/Classifieds? >>> To: solr-user@lucene.apache.org >>> Date: Wednesday, February 3, 2010, 10:03 PM >>> >>> This field type allows you to have an external file that gives a float >>> value for a field. You can only use functions on it. >>> >>> On Sat, Jan 30, 2010 at 7:05 AM, Jan Høydahl / Cominvent >>> wrote: A follow-up on the auction use case. How do you handle the need for frequent updates of only one field, such as the last bid field (needed for sort on price, facets or range)? For high traffic sites, the document update rate becomes very high if you re-send the whole document every time the bid price changes. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 10. des. 2009, at 19.52, Grant Ingersoll wrote: > > On Dec 8, 2009, at 6:37 PM, regany wrote: > >> >> hello! >> >> just wondering if anyone is using Solr as their search for an auction / >> classified site, and if so how have you managed your setup in general? >> ie. >> searching against listings that may have expired etc. > > > I know several companies using Solr for classifieds/auctions. Some > remove the old listings while others leave them in and filter them or > even allow users to see old stuff (but often for reasons other than users > finding them, i.e. SEO). For those that remove, it's typically a batch > operation that takes place at night. > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >>> >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > > -- Lance Norskog goks...@gmail.com
"after flush: fdx size mismatch" on query durring writes
We are using Solr 1.4 in a multi-core setup with replication. Whenever we write to the master we get the following exception: java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs 0 length in bytes of _gqg.fdx file exists?=false at org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97) at org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50) Has anyone had any success debugging this one? thx. -- View this message in context: http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Bigram term vectors and weights possible with Solr?
> I've been looking at the Solr TermVectorComponent > (http://wiki.apache.org/solr/TermVectorComponent) and it > seems to have > something similar to this, but it looks to me like this is > a component > that is processed at query time (?) and is limited to > 1-gram terms. If you use it can give you info about 2-gram terms. > Also, the tf/idf scores are a little different as they come > back in integer values as separate components. In wiki, example output only tf and df values - which are integer - are displayed. You can calculate tf*idf (double) with these parameters: &qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true
Bigram term vectors and weights possible with Solr?
Hello, One of the commercial search platforms I work with has the concept of 'document vectors', which are 1-gram and 2-gram phrases and their associated tf/idf weights on a 0-1 scale, i.e. ["banana pie", 0.99] means banana pie is very relevant for this document. During the ingest/indexing process you can configure the engine to store the top N vectors (those with the highest weights) from a document into a field that is indexed along with the original content and is returned in a result set. This is great for reporting and other statistical analysis, and even some basic result clustering at query time. I've been looking at the Solr TermVectorComponent (http://wiki.apache.org/solr/TermVectorComponent) and it seems to have something similar to this, but it looks to me like this is a component that is processed at query time (?) and is limited to 1-gram terms. Also, the tf/idf scores are a little different as they come back in integer values as separate components. Does anyone know if Solr/Lucene has anything like what the commercial platform has as I described above? Thanks, appreciate any responses. Michael Hughes Lightcrest LLC
How to add SpellCheckResponse to Solritas?
Hi, I'm using the /itas requestHandler, and would like to add spell-check suggestions to the output. I'm having spell-check configured and working in the XML response writer, but nothing is output in Velocity. Debugging the JSON $response object, I cannot find any representation of spellcheck response in there. Where do I plug that in? -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com
Re: Question on Tokenizing email address
Hi, To match 1, 2, 3, 4 below you could use a fieldtype based on TextField, with just a simple WordDelimiterFactory. However, this would also match abc-def, def.alpha, xyz-com and a...@def, because all punctuation is treated the same. To avoid this, you could do some custom handling of "-", "." and "@": You will see that this splits "foo@apache.org" into "foo DOT bar AT apache DOT org" on both index and query side, and thus avoids false matches as above. To support the "must match" case, you could use the "lowercase" fieldtype, which will give a case insensitive match for the whole content of the field only. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 9. feb. 2010, at 18.13, Abhishek Srivastava wrote: > Hello Everyone, > > I have a field in my solr schema which stores emails. The way I want the > emails to be tokenized is like this. > if the email address is abc@alpha-xyz.com > User should be able to search on > > 1. abc@alpha-xyz.com (whole address) > 2. abc > 3. def > 4. alpha-xyz > > Which tokenizer should I use? > > Also, is there a feature like "Must Match" in solr? in my schema there is > field called "from" which contains the email address of the person who sent > an email. For this field, I don't want any tokenization. When the user > issues a search. The users email ID must exactly match the "for" column > value for that document/record to be returned. > How can I do this? > > Regards, > Abhishek
Solr/Drupal Integration - Query Question
I know this is not Drupal, but thought this question maybe more around the Solr query. For instance, I pulled down LucidImaginations Solr install, just like the apache solr install and ran the example solr and loaded the documents from the exampledocs. I can go to: http://localhost:8983/solr/admin/ And search for video and get responses But on my solr if I go to the full interface and use the defaults, I get no results back because of search fields, etc. http://localhost:8983/solr/admin/form.jsp So my admin Solr search query looks like this when searching "video": Feb 9, 2010 1:25:49 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={explainOther=&fl=&indent=on&start=0&q=video&hl.fl=&qt=&wt=&fq=&version=2.2&rows=10} hits=2 status=0 QTime=0 But if I go into Drupal and search "video", this is the query and no results come back: Feb 9, 2010 1:27:33 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=video&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&bf=recip(rord(created),4,19,19)^200.0&f.changed.facet.date.gap=%2B1HOUR&hl.simple.post=&facet.field=uid&facet.field=type&facet.field=language&fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&hl.fragsize=&facet.mincount=1&qf=name^3.0&facet.date=changed&hl.fl=&json.nl=map&wt=json&f.changed.facet.date.end=2010-02-09T17:44:16Z%2B1HOUR/HOUR&rows=10&hl.snippets=&start=0&facet.sort=true&q=video} hits=0 status=0 QTime=0 Any thoughts on the search query that gets generated by the Drupal/Solr module? Thanks...jay -- View this message in context: http://old.nabble.com/Solr-Drupal-Integration---Query-Question-tp27522362p27522362.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
On Tue, Feb 9, 2010 at 2:56 PM, Tom Burton-West wrote: > I'm not sure I understand. CheckIndex reported a negative number: > -16777214. Right, we are overflowing the positive ints, which wraps around to the smallest int (-2.1 billion), and then dividing by 128 = ~ -1677214. Lucene has an array of the indexed (every 128th) terms, keyed by int, and it has an API to seek to any of those indexed terms. The problem is, in setting the position (a long) in the term enum, it multiplies 128 by the index term, but fails to do this as a long multiply, so it overflows. I think your index isn't actually corrupt... it's just a limitation in Lucene that hopefully the patch will fix. Mike
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
I attached a patch to the issue that may fix it. Maybe start by running CheckIndex first? Mike On Tue, Feb 9, 2010 at 2:56 PM, Tom Burton-West wrote: > > Thanks Michael, > > I'm not sure I understand. CheckIndex reported a negative number: > -16777214. > > But in any case we can certainly try running CheckIndex from a patched > lucene We could also run a patched lucene on our dev server. > > Tom > > > > Yes, the term count reported by CheckIndex is the total number of unique > terms. > > It indeed looks like you are exceeding the unique term count limit -- > 16777214 * 128 (= the default term index interval) is 2147483392 which > is mighty close to max/min 32 bit int value. This makes sense, > because CheckIndex steps through the terms in order, one by one. So > the first term just over the limit triggered the exception. > > Hmm -- can you try a patched Lucene in your area? I have one small > change to try that may increase the limit to termIndexInterval > (default 128) * 2.1 billion. > > Mike > > > -- > View this message in context: > http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p2752.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
Thanks Michael, I'm not sure I understand. CheckIndex reported a negative number: -16777214. But in any case we can certainly try running CheckIndex from a patched lucene We could also run a patched lucene on our dev server. Tom Yes, the term count reported by CheckIndex is the total number of unique terms. It indeed looks like you are exceeding the unique term count limit -- 16777214 * 128 (= the default term index interval) is 2147483392 which is mighty close to max/min 32 bit int value. This makes sense, because CheckIndex steps through the terms in order, one by one. So the first term just over the limit triggered the exception. Hmm -- can you try a patched Lucene in your area? I have one small change to try that may increase the limit to termIndexInterval (default 128) * 2.1 billion. Mike -- View this message in context: http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p2752.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
I opened a Lucene issue w/ patch to try: https://issues.apache.org/jira/browse/LUCENE-2257 Tom let me know if you're able to test this... thanks! Mike On Tue, Feb 9, 2010 at 2:09 PM, Michael McCandless wrote: > Yes, the term count reported by CheckIndex is the total number of unique > terms. > > It indeed looks like you are exceeding the unique term count limit -- > 16777214 * 128 (= the default term index interval) is 2147483392 which > is mighty close to max/min 32 bit int value. This makes sense, > because CheckIndex steps through the terms in order, one by one. So > the first term just over the limit triggered the exception. > > Hmm -- can you try a patched Lucene in your area? I have one small > change to try that may increase the limit to termIndexInterval > (default 128) * 2.1 billion. > > Mike > > On Tue, Feb 9, 2010 at 12:23 PM, Tom Burton-West > wrote: >> >> Thanks Lance and Michael, >> >> >> We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from >> Solr admin panel appended below) >> >> I tried running CheckIndex (with the -ea: switch ) on one of the shards. >> CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger >> segment containing 500K+ documents. (Complete CheckIndex output appended >> below) >> >> Is it likely that all 10 shards are corrupted? Is it possible that we have >> simply exceeded some lucene limit? >> >> I'm wondering if we could have exceeded the lucene limit of unique terms of >> 2.1 billion as mentioned towards the end of the Lucene Index File Formats >> document. If the small 731 document index has nine million unique terms as >> reported by check index, then even though many terms are repeated, it is >> concievable that the 500,000 document index could have more than 2.1 billion >> terms. >> >> Do you know if the number of terms reported by CheckIndex is the number of >> unique terms? >> >> On the other hand, we previously optimized a 1 million document index down >> to 1 segment and had no problems. That was with an earlier version of Solr >> and did not include CommonGrams which could conceivably increase the number >> of terms in the index by 2 or 3 times. >> >> >> Tom >> --- >> >> Solr Specification Version: 1.3.0.2009.09.03.11.14.39 >> Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 >> 11:14:39 >> Lucene Specification Version: 2.9-dev >> Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55 >> >> >> [tburt...@slurm-4 ~]$ java -Xmx4096m -Xms4096m -cp >> /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib >> -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex >> /l/solrs/1/.snapshot/serve-2010-02-07/data/index >> >> Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index >> >> Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene >> 2.9] >> 1 of 2: name=_29dn docCount=554799 >> compound=false >> hasProx=true >> numFiles=9 >> size (MB)=267,131.261 >> diagnostics = {optimize=true, mergeFactor=2, >> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, >> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, >> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >> has deletions [delFileName=_29dn_7.del] >> test: open reader.OK [184 deleted docs] >> test: fields, norms...OK [6 fields] >> test: terms, freq, prox...FAILED >> WARNING: fixIndex() would remove reference to this segment; full >> exception: >> java.lang.ArrayIndexOutOfBoundsException: -16777214 >> at >> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) >> at >> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) >> at >> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57) >> at >> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474) >> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715) >> >> 2 of 2: name=_29im docCount=731 >> compound=false >> hasProx=true >> numFiles=8 >> size (MB)=421.261 >> diagnostics = {optimize=true, mergeFactor=3, >> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, >> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, >> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >> no deletions >> test: open reader.OK >> test: fields, norms...OK [6 fields] >> test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs; >> 144869629 tokens] >> test: stored fields...OK [3550 total field count; avg 4.856 fields >> per doc] >> test: term vectorsOK [0 total vector count; avg 0 term/freq >> vector fields per doc] >> >> WARNING: 1 broken
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
Yes, the term count reported by CheckIndex is the total number of unique terms. It indeed looks like you are exceeding the unique term count limit -- 16777214 * 128 (= the default term index interval) is 2147483392 which is mighty close to max/min 32 bit int value. This makes sense, because CheckIndex steps through the terms in order, one by one. So the first term just over the limit triggered the exception. Hmm -- can you try a patched Lucene in your area? I have one small change to try that may increase the limit to termIndexInterval (default 128) * 2.1 billion. Mike On Tue, Feb 9, 2010 at 12:23 PM, Tom Burton-West wrote: > > Thanks Lance and Michael, > > > We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from > Solr admin panel appended below) > > I tried running CheckIndex (with the -ea: switch ) on one of the shards. > CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger > segment containing 500K+ documents. (Complete CheckIndex output appended > below) > > Is it likely that all 10 shards are corrupted? Is it possible that we have > simply exceeded some lucene limit? > > I'm wondering if we could have exceeded the lucene limit of unique terms of > 2.1 billion as mentioned towards the end of the Lucene Index File Formats > document. If the small 731 document index has nine million unique terms as > reported by check index, then even though many terms are repeated, it is > concievable that the 500,000 document index could have more than 2.1 billion > terms. > > Do you know if the number of terms reported by CheckIndex is the number of > unique terms? > > On the other hand, we previously optimized a 1 million document index down > to 1 segment and had no problems. That was with an earlier version of Solr > and did not include CommonGrams which could conceivably increase the number > of terms in the index by 2 or 3 times. > > > Tom > --- > > Solr Specification Version: 1.3.0.2009.09.03.11.14.39 > Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 > 11:14:39 > Lucene Specification Version: 2.9-dev > Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55 > > > [tburt...@slurm-4 ~]$ java -Xmx4096m -Xms4096m -cp > /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib > -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex > /l/solrs/1/.snapshot/serve-2010-02-07/data/index > > Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index > > Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene > 2.9] > 1 of 2: name=_29dn docCount=554799 > compound=false > hasProx=true > numFiles=9 > size (MB)=267,131.261 > diagnostics = {optimize=true, mergeFactor=2, > os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, > lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, > os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > has deletions [delFileName=_29dn_7.del] > test: open reader.OK [184 deleted docs] > test: fields, norms...OK [6 fields] > test: terms, freq, prox...FAILED > WARNING: fixIndex() would remove reference to this segment; full > exception: > java.lang.ArrayIndexOutOfBoundsException: -16777214 > at > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) > at > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) > at > org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57) > at > org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474) > at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715) > > 2 of 2: name=_29im docCount=731 > compound=false > hasProx=true > numFiles=8 > size (MB)=421.261 > diagnostics = {optimize=true, mergeFactor=3, > os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, > lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, > os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > no deletions > test: open reader.OK > test: fields, norms...OK [6 fields] > test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs; > 144869629 tokens] > test: stored fields...OK [3550 total field count; avg 4.856 fields > per doc] > test: term vectorsOK [0 total vector count; avg 0 term/freq > vector fields per doc] > > WARNING: 1 broken segments (containing 554615 documents) detected > WARNING: would write new segments file, and 554615 documents would be lost, > if -fix were specified > > > [tburt...@slurm-4 ~]$ > > > The index is corrupted. In some places ArrayIndex and NPE are not > wrapped as CorruptIndexException. > > Try running your code with the Lucene assertions on. A
RE: HTTP caching and distributed search
I tried your suggestion, Hoss, but committing to the new coordinator core doesn't change the indexVersion and therefore the ETag value isn't changed. I opened a new JIRA issue for this http://issues.apache.org/jira/browse/SOLR-1765 Thanks, Charlie -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, February 04, 2010 2:16 PM To: solr-user@lucene.apache.org Subject: Re: HTTP caching and distributed search : > http://localhost:8080/solr/core1/select/?q=google&start=0&rows=10&shards : > =localhost:8080/solr/core1,localhost:8080/solr/core2 : You are right, etag is calculated using the searcher on core1 only and it : does not take other shards into account. Can you open a Jira issue? ...as a possible work arround i would suggest creating a seperate "coordinator" core that is neither core1 nor core2 ... it doesn't have to have any docs in it, it just has to have consistent schemas with the other two cores. That way you can use a distinct settings on the coordinator core (perhaps never304="true" but with an explicit setting? ... or lastModifiedFrom="openTime" and then you could send an explicit "commit" to the (empty) coordinator core anytime you modify one of the shards. -Hoss
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
Thanks Lance and Michael, We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from Solr admin panel appended below) I tried running CheckIndex (with the -ea: switch ) on one of the shards. CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger segment containing 500K+ documents. (Complete CheckIndex output appended below) Is it likely that all 10 shards are corrupted? Is it possible that we have simply exceeded some lucene limit? I'm wondering if we could have exceeded the lucene limit of unique terms of 2.1 billion as mentioned towards the end of the Lucene Index File Formats document. If the small 731 document index has nine million unique terms as reported by check index, then even though many terms are repeated, it is concievable that the 500,000 document index could have more than 2.1 billion terms. Do you know if the number of terms reported by CheckIndex is the number of unique terms? On the other hand, we previously optimized a 1 million document index down to 1 segment and had no problems. That was with an earlier version of Solr and did not include CommonGrams which could conceivably increase the number of terms in the index by 2 or 3 times. Tom --- Solr Specification Version: 1.3.0.2009.09.03.11.14.39 Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39 Lucene Specification Version: 2.9-dev Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55 [tburt...@slurm-4 ~]$ java -Xmx4096m -Xms4096m -cp /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /l/solrs/1/.snapshot/serve-2010-02-07/data/index Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 2: name=_29dn docCount=554799 compound=false hasProx=true numFiles=9 size (MB)=267,131.261 diagnostics = {optimize=true, mergeFactor=2, os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_29dn_7.del] test: open reader.OK [184 deleted docs] test: fields, norms...OK [6 fields] test: terms, freq, prox...FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.ArrayIndexOutOfBoundsException: -16777214 at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715) 2 of 2: name=_29im docCount=731 compound=false hasProx=true numFiles=8 size (MB)=421.261 diagnostics = {optimize=true, mergeFactor=3, os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true, lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge, os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields, norms...OK [6 fields] test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs; 144869629 tokens] test: stored fields...OK [3550 total field count; avg 4.856 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] WARNING: 1 broken segments (containing 554615 documents) detected WARNING: would write new segments file, and 554615 documents would be lost, if -fix were specified [tburt...@slurm-4 ~]$ The index is corrupted. In some places ArrayIndex and NPE are not wrapped as CorruptIndexException. Try running your code with the Lucene assertions on. Add this to the JVM arguments: -ea:org.apache.lucene... -- View this message in context: http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html Sent from the Solr - User mailing list archive at Nabble.com.
Question on Tokenizing email address
Hello Everyone, I have a field in my solr schema which stores emails. The way I want the emails to be tokenized is like this. if the email address is abc@alpha-xyz.com User should be able to search on 1. abc@alpha-xyz.com (whole address) 2. abc 3. def 4. alpha-xyz Which tokenizer should I use? Also, is there a feature like "Must Match" in solr? in my schema there is field called "from" which contains the email address of the person who sent an email. For this field, I don't want any tokenization. When the user issues a search. The users email ID must exactly match the "for" column value for that document/record to be returned. How can I do this? Regards, Abhishek
Re: unloading a solr core doesn't free any memory
Tim, The GC just automagically works right? :) There's been issues around thread local in Lucene. The main code for core management is CoreContainer, which I believe is fairly easy to digest. If there's an issue you may find it there. Jason 2010/2/9 Tim Terlegård : > If I unload the core and then click "Perform GC" in jconsole nothing > happens. The 8 GB RAM is still used. > > If I load the core again and then run the query with the sort fields, > then jconsole shows that the memory usage immediately drops to 1 GB > and then rises to 8 GB again as it caches the stuff. > > So my suspicion is that the sort cache still references all these > objects even after the core is unloaded. But somehow it knows that the > current sort cache is obsolete. After loading the core again and > executing the query with sort fields the sort cache references a new > object and the memory usage drops. > > Bug? I could check the source code, but don't know where to look. Any hints? > > /Tim > > 2010/2/9 Lance Norskog : >> The 'jconsole' program lets you monitor GC operation in real-time. >> >> http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html >> >> On Mon, Feb 8, 2010 at 8:44 AM, Simon Rosenthal >> wrote: >>> What Garbage Collection parameters is the JVM using ? the memory will not >>> always be freed immediately after an event like unloading a core or starting >>> a new searcher. >>> >>> 2010/2/8 Tim Terlegård >>> To me it doesn't look like unloading a Solr Core frees the memory that the core has used. Is this how it should be? I have a big index with 50 million documents. After loading a core it takes 300 MB RAM. After a query with a couple of sort fields Solr takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the core. The core is not shown in /solr/ anymore. Solr still takes 8 GB RAM. Creating new cores is super slow because I have hardly any memory left. Do I need to free the memory explicitly somehow? /Tim >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> >
Re: Replication and querying
Hi, Index replication in Solr makes an exact copy of the original index. Is it not possible to add the 6 extra fields to both instances? An alternative to replication is to feed two independent Solr instances -> full control :) Please elaborate on your specific use case if this is not useful answer to you. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 9. feb. 2010, at 13.21, Julian Hille wrote: > Hi, > > id like to know if its possible to have a solr Server with a schema and lets > say 10 fields indexed. > I know want to replicate this whole index to another solr server which has a > slightly different schema. > There are additional 6 fields these fields change the sort order for a > product which base is our solr database. > > Is this kind of replication possible? > > Is there another way to interact with data in solr? We'd like to calculate > some fields when they will be added. > I cant seem to find a good documentation about the possible calls in the > query itself nor documentaion about queries/calculation which should be done > on update. > > > so far, > Julian Hille > > > --- > NetImpact KG > Altonaer Straße 8 > 20357 Hamburg > > Tel: 040 / 6738363 2 > Mail: jul...@netimpact.de > > Geschäftsführer: Tarek Müller >
Re: joining two field for query
You may also want to play with other highlighting parameters to select how much text to do highlighting on, how many fragments etc. See http://wiki.apache.org/solr/HighlightingParameters -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 9. feb. 2010, at 13.08, Ahmet Arslan wrote: > >> I am searching by "nokia" and resulting (listing) 1,2,3 >> field with short >> description. >> There is link on search list(like google), by clicking on >> link performing >> new search (opening doc from index), for this search >> >> I want to join two fields: >> id:1 + queryString ("nokia samsung") to return only id:1 >> record and want to >> highlight the field "nokia samsung". >> something like : "q=id:1 + body:nokia samsung" >> >> basically I want to highlight the query string when >> clicking on link and >> opening the new windows (like google cache). > > When the user clicks document (id=1), you can use these parameters: > q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body > > > >
Re: Is it posible to exclude results from other languages?
Much more efficient to tag documents with language at index time. Look for language identification tools such as http://www.sematext.com/products/language-identifier/index.html or http://ngramj.sourceforge.net/ or http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 9. feb. 2010, at 05.19, Lance Norskog wrote: > There is > > On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch wrote: >> >> >> Yes, It's true that we could do it in index time if we had a way to know. I >> was thinking in some solution in search time, maybe measuring the % of >> stopwords of each document. Normally, a document of another language won't >> have any stopword of its main language. >> >> If you know some external software to detect the language of a source text, >> it would be useful too. >> >> Thanks, >> Raimon Bosch. >> >> >> >> Ahmet Arslan wrote: >>> >>> In our indexes, sometimes we have some documents written in other languages different to the most common index's language. Is there any way to give less boosting to this documents? >>> >>> If you are aware of those documents, at index time you can boost those >>> documents with a value less than 1.0: >>> >>> >>> >>> // document written in other languages >>> ... >>> ... >>> >>> >>> >>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22 >>> >>> >>> >>> >>> >> >> -- >> View this message in context: >> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > Lance Norskog > goks...@gmail.com
Re: DIH: delta-import not working
indeed that made it work. Looking back at the documentation, it's all there but one needs to read every single line with care :-) 2010/2/9 Noble Paul നോബിള് नोब्ळ् > try this > > deltaImportQuery="select id, bytes from attachment where application = > 'MYAPP' and id = '${dataimporter.delta.id}'" > > be aware that the names are case sensitive . if the id comes as 'ID' > this will not work > > > > On Tue, Feb 9, 2010 at 3:15 PM, Jorg Heymans > wrote: > > Hi, > > > > I am having problems getting the delta-import to work for my schema. > > Following what i have found in the list, jira and the wiki below > > configuration should just work but it doesn't. > > > > > > > url="jdbc:oracle:thin:@." user="" password=""/> > > > > > > > deltaImportQuery="select id, bytes from attachment where application > = > > 'MYAPP' and id = '${dataimporter.attachment.id}'" > > deltaQuery="select id from attachment where application = 'MYAPP' > and > > modified_on > to_date('${dataimporter.attachment.last_index_time}', > > '-mm-dd hh24:mi:ss')"> > > > > > url="bytes" dataField="attachment.bytes"> > > > > > > > > > > > > > > The sql generated in the deltaquery is correct, the timestamp is passed > > correctly. When i execute that query manually in the DB it returns the pk > of > > the rows that were added. However no documents are added to the index. > What > > am i missing here ?? I'm using a build snapshot from 03/02. > > > > > > Thanks > > Jorg > > > > > > -- > - > Noble Paul | Systems Architect| AOL | http://aol.com >
Re: Faceting
NOTE: Please start a new email thread for a new topic (See http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking) Your strategy could work. You might want to look into dedicated entity extraction frameworks like http://opennlp.sourceforge.net/ http://nlp.stanford.edu/software/CRF-NER.shtml http://incubator.apache.org/uima/index.html Or if that is too much work, look at http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your entity extraction code into Solr itself using a scripting language. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 5. feb. 2010, at 20.10, José Moreira wrote: > Hello, > > I'm planning to index a 'content' field for search and from that > fields text content i would like to facet (probably) according to if > the content has e-mails, urls and within urls, url's to pictures, > videos and others. > > As i'm a relatively new user to Solr, my plan was to regexp the > content in my application and add tags to a Solr field according to > the content, so for example the content "m...@email.com > http://www.site.com"; would have the tags "email, link". > > If i follow this path can i then facet on "email" and/or "link" ? For > example combining facet field with facet value params? > > Best > > -- > http://pt.linkedin.com/in/josemoreira > josemore...@irc.freenode.net > http://djangopeople.net/josemoreira/
Re: Autosuggest and highlighting
On 2/9/2010 2:57 PM, Ahmet Arslan wrote: I'm trying to improve the search box on our website by adding an autosuggest field. The dataset is a set of properties in the world (mostly europe) and the searchbox is intended to be filled with a country-, region- or city name. To do this I've created a separate, simple core with one document per geographic location, for example the document for the country "France" contains several fields including the number of properties (so we can show the approximate amount of results in the autosuggest box) and the name of the country France in several languages and some other bookkeeping information. The name of the property is stored in two fields: "name" which simple contains the canonical name of the country, region or city and "names" which is a multivalued field containing the name in several different languages. Both fields use an EdgeNGramFilter during analysis so the query "Fr" can match "France". This all seems to work, the autosuggest box gives appropriate suggestions. But when I turn on highlighting the results are less than desirable, for example the query "rho" using dismax (and hl.snippets=5) returns the following: Région Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Région Rhône-Alpes Département du Rhône Département du Rhône Rhône Département du Rhône Rhône Département du Rhône As you can see, no matter where the match is, the first 3 characters are highlighted. Obviously not correct for many of the fields. Is this because of the NGramFilterFactory or am I doing something wrong? I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It was giving correct highlights. I just ran a test with the NGramFilter removed (and reindexing) which did give correct highlighting results but I had to query using the whole word. I'll try the PrefixingFilterFactory next although according to the comments it's nothing but a subset of the EdgeNGramFilterFactory so unless I'm configuring it wrong it should yield the same results... However we are now using http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically makes bold matching characters without using solr highlighting. Using a pure javascript based solution isn't really an option for us as that wouldn't work for the diacritical marks without a lot of transliteration brouhaha. Regards, gwk
Re: How to send web pages(urls) to solr cell via solrj?
Hi, I did not try this, but could you not read the URL client side and pass it to SolrJ as a ContentStream? ContentStream urlStream = ContentStreamBase.URLStream("http://my.site/file.html";); req.addContentStream(urlStream); -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 4. feb. 2010, at 10.47, dhamu wrote: > > Hi, > I am newbie to solr and exploring solr last few days. > I am using solr cell with tika for parsing, indexing and searching > Posting the rich text documents via Solrj. > My actual requirement is instead of using local documents(pdf, doc & docx), > i want to use webpages(urls for eg..,(http://www.apache.org)). > > eg.., > req.addFile(new File("docs/mailing_lists.html")); > instead > req.url(new urlconnection("http://www.apache.org";) > anything like the above is there in solrj. > > Actually i am using curl for testing. it works fine > > curl > "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"; > -F "stream.url=http://wiki.apache.org/solr/SolrConfigXml"; > > but i am in need to use otherthan curl. > Below code works fine for local document indexing and searching. But instead > i want to post urls. > > here is my code., > >String url = "http://localhost:8983/solr";; >SolrServer server = new CommonsHttpSolrServer(url); > ContentStreamUpdateRequest req = new ContentStreamUpdateRequest( > "/update/extract"); > req.addFile(new File("docs/mailing_lists.html")); > req.setParam("literal.id", "index1"); > req.setParam("uprefix", "attr_"); > req.setParam("fmap.content", "attr_content"); > req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); > NamedList result = server.request(req); > assertNotNull("Couldn't upload index.pdf", result); > QueryResponse rsp = server.query(new SolrQuery("*:*")); > Assert.assertEquals(1, rsp.getResults().getNumFound()); > > any suggestion or answer will be appreciated. > > > -- > View this message in context: > http://old.nabble.com/How-to-send-web-pages%28urls%29-to-solr-cell-via-solrj--tp27450083p27450083.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Autosuggest and highlighting
> I'm trying to improve the search box on our website by > adding an autosuggest field. The dataset is a set of > properties in the world (mostly europe) and the searchbox is > intended to be filled with a country-, region- or city name. > To do this I've created a separate, simple core with one > document per geographic location, for example the document > for the country "France" contains several fields including > the number of properties (so we can show the approximate > amount of results in the autosuggest box) and the name of > the country France in several languages and some other > bookkeeping information. The name of the property is stored > in two fields: "name" which simple contains the canonical > name of the country, region or city and "names" which is a > multivalued field containing the name in several different > languages. Both fields use an EdgeNGramFilter during > analysis so the query "Fr" can match "France". > > This all seems to work, the autosuggest box gives > appropriate suggestions. But when I turn on highlighting the > results are less than desirable, for example the query "rho" > using dismax (and hl.snippets=5) returns the > following: > > > > Région > Rhône-Alpes > Rhône-Alpes > Rhône-Alpes > Rhône-Alpes > Rhône-Alpes > > > Région > Rhône-Alpes > > > > > Département du > Rhône > Département du > Rhône > Rhône > Département du > Rhône > Rhône > > > Département du > Rhône > > > > As you can see, no matter where the match is, the first 3 > characters are highlighted. Obviously not correct for many > of the fields. Is this because of the NGramFilterFactory or > am I doing something wrong? I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It was giving correct highlights. However we are now using http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically makes bold matching characters without using solr highlighting.
Distributed search and haproxy and connection build up
I have been using distributed search with haproxy but noticed that I am suffering a little from tcp connections building up waiting for the OS level closing/time out: netstat -a ... tcp6 1 0 10.0.16.170%34654:53789 10.0.16.181%363574:8893 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43932 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:43190 10.0.16.181%363574:8895 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:53770 TIME_WAIT tcp6 1 0 10.0.16.170%34654:41782 10.0.16.181%363574: CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:52169 10.0.16.181%363574:8890 CLOSE_WAIT tcp6 1 0 10.0.16.170%34654:55947 10.0.16.181%363574:8887 CLOSE_WAIT tcp6 0 0 10.0.16.170%346547:8984 10.0.16.181%36357:54040 TIME_WAIT tcp6 1 0 10.0.16.170%34654:40030 10.0.16.160%363574:8984 CLOSE_WAIT ... Digging a little into the haproxy documentation, it seems that they do not support persistent connections. Does solr normally persist the connections between shards (would this problem happen even without haproxy)? Ian.
Re: Solr usage with Auctions/Classifieds?
With the new sort by function in 1.5 (http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function), will it now be possible to include the ExternalFileField value in the sort formula? If so, we could sort on last bid price or last bid time without updating the document itself. However, to display the result with the fresh values, we need to go to DB, or is there someone working on the possibility to return ExternalFileField values for result view? -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 4. feb. 2010, at 06.25, Lance Norskog wrote: > Oops, forgot to add the link: > > http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4 > > On Wed, Feb 3, 2010 at 9:17 PM, Andy wrote: >> How do I set up and use this external file? >> >> Can I still use such a field in fq or boost? >> >> Can you point me to the right documentation? Thanks >> >> --- On Wed, 2/3/10, Lance Norskog wrote: >> >> From: Lance Norskog >> Subject: Re: Solr usage with Auctions/Classifieds? >> To: solr-user@lucene.apache.org >> Date: Wednesday, February 3, 2010, 10:03 PM >> >> This field type allows you to have an external file that gives a float >> value for a field. You can only use functions on it. >> >> On Sat, Jan 30, 2010 at 7:05 AM, Jan Høydahl / Cominvent >> wrote: >>> A follow-up on the auction use case. >>> >>> How do you handle the need for frequent updates of only one field, such as >>> the last bid field (needed for sort on price, facets or range)? >>> For high traffic sites, the document update rate becomes very high if you >>> re-send the whole document every time the bid price changes. >>> >>> -- >>> Jan Høydahl - search architect >>> Cominvent AS - www.cominvent.com >>> >>> On 10. des. 2009, at 19.52, Grant Ingersoll wrote: >>> On Dec 8, 2009, at 6:37 PM, regany wrote: > > hello! > > just wondering if anyone is using Solr as their search for an auction / > classified site, and if so how have you managed your setup in general? ie. > searching against listings that may have expired etc. I know several companies using Solr for classifieds/auctions. Some remove the old listings while others leave them in and filter them or even allow users to see old stuff (but often for reasons other than users finding them, i.e. SEO). For those that remove, it's typically a batch operation that takes place at night. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> >> >> >> > > > > -- > Lance Norskog > goks...@gmail.com
Autosuggest and highlighting
Hi, I'm trying to improve the search box on our website by adding an autosuggest field. The dataset is a set of properties in the world (mostly europe) and the searchbox is intended to be filled with a country-, region- or city name. To do this I've created a separate, simple core with one document per geographic location, for example the document for the country "France" contains several fields including the number of properties (so we can show the approximate amount of results in the autosuggest box) and the name of the country France in several languages and some other bookkeeping information. The name of the property is stored in two fields: "name" which simple contains the canonical name of the country, region or city and "names" which is a multivalued field containing the name in several different languages. Both fields use an EdgeNGramFilter during analysis so the query "Fr" can match "France". This all seems to work, the autosuggest box gives appropriate suggestions. But when I turn on highlighting the results are less than desirable, for example the query "rho" using dismax (and hl.snippets=5) returns the following: Région Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Rhône-Alpes Région Rhône-Alpes Département du Rhône Département du Rhône Rhône Département du Rhône Rhône Département du Rhône As you can see, no matter where the match is, the first 3 characters are highlighted. Obviously not correct for many of the fields. Is this because of the NGramFilterFactory or am I doing something wrong? The field definition for 'name' and 'names' is: generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> maxGramSize="20"/> ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> Regards, gwk
Re: joining two field for query (Solved)
Hi Ahmet, Thank you very much.. my problem solved.. with regards On Tue, Feb 9, 2010 at 5:38 PM, Ahmet Arslan wrote: > > > I am searching by "nokia" and resulting (listing) 1,2,3 > > field with short > > description. > > There is link on search list(like google), by clicking on > > link performing > > new search (opening doc from index), for this search > > > > I want to join two fields: > > id:1 + queryString ("nokia samsung") to return only id:1 > > record and want to > > highlight the field "nokia samsung". > > something like : "q=id:1 + body:nokia samsung" > > > > basically I want to highlight the query string when > > clicking on link and > > opening the new windows (like google cache). > > When the user clicks document (id=1), you can use these parameters: > q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body > > > > >
Replication and querying
Hi, id like to know if its possible to have a solr Server with a schema and lets say 10 fields indexed. I know want to replicate this whole index to another solr server which has a slightly different schema. There are additional 6 fields these fields change the sort order for a product which base is our solr database. Is this kind of replication possible? Is there another way to interact with data in solr? We'd like to calculate some fields when they will be added. I cant seem to find a good documentation about the possible calls in the query itself nor documentaion about queries/calculation which should be done on update. so far, Julian Hille --- NetImpact KG Altonaer Straße 8 20357 Hamburg Tel: 040 / 6738363 2 Mail: jul...@netimpact.de Geschäftsführer: Tarek Müller
Re: joining two field for query
> I am searching by "nokia" and resulting (listing) 1,2,3 > field with short > description. > There is link on search list(like google), by clicking on > link performing > new search (opening doc from index), for this search > > I want to join two fields: > id:1 + queryString ("nokia samsung") to return only id:1 > record and want to > highlight the field "nokia samsung". > something like : "q=id:1 + body:nokia samsung" > > basically I want to highlight the query string when > clicking on link and > opening the new windows (like google cache). When the user clicks document (id=1), you can use these parameters: q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body
Re: Call URL, simply parse the results using SolrJ
you can also try URL urlo = new URL(url);// ensure that the url has wt=javabin in that NamedList namedList = new JavaBinCodec().unmarshal(urlo.openConnection().getInputStream()); QueryResponse response = new QueryResponse(namedList, null); On Mon, Feb 8, 2010 at 11:49 PM, Jason Rutherglen wrote: > Here's what I did to resolve this: > > XMLResponseParser parser = new XMLResponseParser(); > URL urlo = new URL(url); > InputStreamReader isr = new > InputStreamReader(urlo.openConnection().getInputStream()); > NamedList namedList = parser.processResponse(isr); > QueryResponse response = new QueryResponse(namedList, null); > > On Mon, Feb 8, 2010 at 10:03 AM, Jason Rutherglen > wrote: >> So here's what happens if I pass in a URL with parameters, SolrJ chokes: >> >> Exception in thread "main" java.lang.RuntimeException: Invalid base >> url for solrj. The base URL must not contain parameters: >> http://locahost:8080/solr/main/select?q=video&qt=dismax >> at >> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:205) >> at >> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:180) >> at >> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:152) >> at org.apache.solr.util.QueryTime.main(QueryTime.java:20) >> >> >> On Mon, Feb 8, 2010 at 9:32 AM, Jason Rutherglen >> wrote: >>> Sorry for the poorly worded title... For SOLR-1761 I want to pass in a >>> URL and parse the query response... However it's non-obvious to me how >>> to do this using the SolrJ API, hence asking the experts here. :) >>> >> > -- - Noble Paul | Systems Architect| AOL | http://aol.com
Re: DIH: delta-import not working
try this deltaImportQuery="select id, bytes from attachment where application = 'MYAPP' and id = '${dataimporter.delta.id}'" be aware that the names are case sensitive . if the id comes as 'ID' this will not work On Tue, Feb 9, 2010 at 3:15 PM, Jorg Heymans wrote: > Hi, > > I am having problems getting the delta-import to work for my schema. > Following what i have found in the list, jira and the wiki below > configuration should just work but it doesn't. > > > url="jdbc:oracle:thin:@." user="" password=""/> > > > deltaImportQuery="select id, bytes from attachment where application = > 'MYAPP' and id = '${dataimporter.attachment.id}'" > deltaQuery="select id from attachment where application = 'MYAPP' and > modified_on > to_date('${dataimporter.attachment.last_index_time}', > '-mm-dd hh24:mi:ss')"> > > url="bytes" dataField="attachment.bytes"> > > > > > > > The sql generated in the deltaquery is correct, the timestamp is passed > correctly. When i execute that query manually in the DB it returns the pk of > the rows that were added. However no documents are added to the index. What > am i missing here ?? I'm using a build snapshot from 03/02. > > > Thanks > Jorg > -- - Noble Paul | Systems Architect| AOL | http://aol.com
joining two field for query
Hi all, I need logic in solr to join two field in query; I indexed two field : id and body(text type). 5 rows are indexed: id=1 : text= nokia samsung id=2 : text= sony vaio nokia samsung id=3 : text= vaio nokia etc.. I am searching by "q=id:1" returning result perfectly, returning "nokia samsung". I am searching by "nokia" and resulting (listing) 1,2,3 field with short description. There is link on search list(like google), by clicking on link performing new search (opening doc from index), for this search I want to join two fields: id:1 + queryString ("nokia samsung") to return only id:1 record and want to highlight the field "nokia samsung". something like : "q=id:1 + body:nokia samsung" basically I want to highlight the query string when clicking on link and opening the new windows (like google cache). please help.. thanks
Re: TermInfosReader.get ArrayIndexOutOfBoundsException
Which version of Solr/Lucene are you using? Can you run Lucene's CheckIndex tool (java -ea:org.apache.lucene org.apache.lucene.index.CheckIndex /path/to/index) and then post the output? Have you altered any of IndexWriter's defaults (via solrconfig.xml)? Eg the termIndexInterval? Mike On Mon, Feb 8, 2010 at 4:02 PM, Burton-West, Tom wrote: > Hello all, > > After optimizing rather large indexes on 10 shards (each index holds about > 500,000 documents and is about 270-300 GB in size) we started getting > intermittent TermInfosReader.get() ArrayIndexOutOfBounds exceptions. The > exceptions sometimes seem to occur on all 10 shards at the same time and > sometimes on one shard but not the others. We also sometimes get an > "Internal Server Error" but that might be either a cause or an effect of the > array index out of bounds. Here is the top part of the message: > > > java.lang.ArrayIndexOutOfBoundsException: -14127432 > at > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) > > Any suggestions for troubleshooting would be appreciated. > > Trace from tomcat logs appended below. > > Tom Burton-West > > --- > > Feb 5, 2010 8:09:02 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.ArrayIndexOutOfBoundsException: -14127432 > at > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) > at > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218) > at > org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:943) > at > org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:308) > at > org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:144) > at org.apache.lucene.search.Similarity.idf(Similarity.java:481) > at > org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:44) > at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146) > at > org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:186) > at > org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:366) > at org.apache.lucene.search.Query.weight(Query.java:95) > at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230) > at org.apache.lucene.search.Searcher.search(Searcher.java:171) > at > org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651) > at > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545) > at > org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:581) > at > org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903) > at > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) > at > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) > at > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:176) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) > at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) > at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) >
Re: unloading a solr core doesn't free any memory
If I unload the core and then click "Perform GC" in jconsole nothing happens. The 8 GB RAM is still used. If I load the core again and then run the query with the sort fields, then jconsole shows that the memory usage immediately drops to 1 GB and then rises to 8 GB again as it caches the stuff. So my suspicion is that the sort cache still references all these objects even after the core is unloaded. But somehow it knows that the current sort cache is obsolete. After loading the core again and executing the query with sort fields the sort cache references a new object and the memory usage drops. Bug? I could check the source code, but don't know where to look. Any hints? /Tim 2010/2/9 Lance Norskog : > The 'jconsole' program lets you monitor GC operation in real-time. > > http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html > > On Mon, Feb 8, 2010 at 8:44 AM, Simon Rosenthal > wrote: >> What Garbage Collection parameters is the JVM using ? the memory will not >> always be freed immediately after an event like unloading a core or starting >> a new searcher. >> >> 2010/2/8 Tim Terlegård >> >>> To me it doesn't look like unloading a Solr Core frees the memory that >>> the core has used. Is this how it should be? >>> >>> I have a big index with 50 million documents. After loading a core it >>> takes 300 MB RAM. After a query with a couple of sort fields Solr >>> takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the >>> core. The core is not shown in /solr/ anymore. Solr still takes 8 GB >>> RAM. Creating new cores is super slow because I have hardly any memory >>> left. Do I need to free the memory explicitly somehow? >>> >>> /Tim >>> >> > > > > -- > Lance Norskog > goks...@gmail.com >
Re: Posting pdf file and posting from remote
Ok I'm going ahead (may be:). I tried another curl command to send the file from remote: http://mysolr:/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf and the behaviour has been changed: now I get an error in solr log file: HTTP Status 500 - files/attach-8514.pdf (No such file or directory) java.io.FileNotFoundException: files/attach-8514.pdf (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:106) at org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at etc etc... -- View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: unloading a solr core doesn't free any memory
I don't use any garbage collection parameters. /Tim 2010/2/8 Simon Rosenthal : > What Garbage Collection parameters is the JVM using ? the memory will not > always be freed immediately after an event like unloading a core or starting > a new searcher. > > 2010/2/8 Tim Terlegård > >> To me it doesn't look like unloading a Solr Core frees the memory that >> the core has used. Is this how it should be? >> >> I have a big index with 50 million documents. After loading a core it >> takes 300 MB RAM. After a query with a couple of sort fields Solr >> takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the >> core. The core is not shown in /solr/ anymore. Solr still takes 8 GB >> RAM. Creating new cores is super slow because I have hardly any memory >> left. Do I need to free the memory explicitly somehow? >> >> /Tim >> >
DIH: delta-import not working
Hi, I am having problems getting the delta-import to work for my schema. Following what i have found in the list, jira and the wiki below configuration should just work but it doesn't. The sql generated in the deltaquery is correct, the timestamp is passed correctly. When i execute that query manually in the DB it returns the pk of the rows that were added. However no documents are added to the index. What am i missing here ?? I'm using a build snapshot from 03/02. Thanks Jorg
Re: Dynamic fields with more than 100 fields inside
Shalin Shekhar Mangar a écrit : On Tue, Feb 9, 2010 at 2:43 PM, Xavier Schepler < xavier.schep...@sciences-po.fr> wrote: Shalin Shekhar Mangar a écrit : On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler < xavier.schep...@sciences-po.fr> wrote: Hey, I'm thinking about using dynamic fields. I need one or more user specific field in my schema, for example, "concept_user_*", and I will have maybe more than 200 users using this feature. One user will send and retrieve values from its field. It will then be used to filter result. How would it impact query performance ? Can you give an example of such a query? Hi, it could be queries such as : allFr: état-unis AND concept_researcher_99 = 303 modalitiesFr: exactement AND questionFr: correspond AND concept_researcher_2 = 101 and facetting like this : q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex It doesn't impact query performance any more than filtering on other fields. Is there a performance problem or were you just asking generally? I was asking generally, thanks for your response.
Re: Dynamic fields with more than 100 fields inside
On Tue, Feb 9, 2010 at 2:43 PM, Xavier Schepler < xavier.schep...@sciences-po.fr> wrote: > Shalin Shekhar Mangar a écrit : > > On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler < >> xavier.schep...@sciences-po.fr> wrote: >> >> >> >>> Hey, >>> >>> I'm thinking about using dynamic fields. >>> >>> I need one or more user specific field in my schema, for example, >>> "concept_user_*", and I will have maybe more than 200 users using this >>> feature. >>> One user will send and retrieve values from its field. It will then be >>> used >>> to filter result. >>> >>> How would it impact query performance ? >>> >>> >>> >>> >> Can you give an example of such a query? >> >> >> > Hi, > > it could be queries such as : > > allFr: état-unis AND concept_researcher_99 = 303 > > modalitiesFr: exactement AND questionFr: correspond AND > concept_researcher_2 = 101 > > and facetting like this : > > > q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex > > It doesn't impact query performance any more than filtering on other fields. Is there a performance problem or were you just asking generally? -- Regards, Shalin Shekhar Mangar.
Posting pdf file and posting from remote
I understand that tika is able to index pdf content: its true? I tried to post a pdf from local and I've seen in the solr/admin schema browser another document, but when I search only the document id is available, the documents doesn't seem indexed. Do I need other products to index pdf content? Moreover I want to send a file from remote: it seems I must configure tika with a tika-config.xml file, enabling remote streaming as in the following: but I'm not able to find a tika-config.xml example... thanks a lot Alessandra -- View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512455.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic fields with more than 100 fields inside
Shalin Shekhar Mangar a écrit : On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler < xavier.schep...@sciences-po.fr> wrote: Hey, I'm thinking about using dynamic fields. I need one or more user specific field in my schema, for example, "concept_user_*", and I will have maybe more than 200 users using this feature. One user will send and retrieve values from its field. It will then be used to filter result. How would it impact query performance ? Can you give an example of such a query? Hi, it could be queries such as : allFr: état-unis AND concept_researcher_99 = 303 modalitiesFr: exactement AND questionFr: correspond AND concept_researcher_2 = 101 and facetting like this : q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex Thanks in advance, Xavier S.
Unsubscribe from mailing list
Please unsubscribe me from Mailing list
RE: Indexing / querying multiple data types
Sven In my data-config.xml I have the following In my schema.xml I have And in my solrconfig.xml I have data-config.xml dismax explicit 0.01 name^1.5 description^1.0 dismax explicit 0.01 name^1.5 description^1.0 And the Has been untouched So when I run http://localhost:7001/solr/select/?q=food&qt=name1 I was expecting to get results form the data that had been indexed by