Re: indexing data from rich documents - Tika with solr3.1

2011-09-20 Thread scorpking
Hi all, thanks everyone who help me very much, i indexed form http using DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3351278.html Sent from the Solr - User mailing list archive at Nabble.com.

Slow autocomplete(terms)

2011-09-20 Thread roySolr
Hello, I used the terms request for autocomplete. It works fine with 200.000 records but with 2 million docs it's very slow.. I use some regex to fix autocomplete in the middle of words, example: chest - manchester. My call(pecl PHP solr): $query = new SolrQuery(); $query-setTermsLimit(10);

Problemns querying for the keyword a

2011-09-20 Thread Héctor Trujillo
Hi all, I have find something curious probing Solr, and SolrJ, I don’t know If this is normal, a reserved word, or a Bug could be. I can’t explain it and I write here this question to get a reasonable explanation of this, If it exists. I created an index and I inserted about ten documents. I

Re: Problemns querying for the keyword a

2011-09-20 Thread Gora Mohanty
2011/9/20 Héctor Trujillo hecto...@gmail.com: [...]  I created an index and I inserted about ten documents. I defined a filed named source, and I created many rows with the value “a” in this field, and then I started to make queries, and then I took conscience that all the queries that asked

Re: Problemns querying for the keyword a

2011-09-20 Thread Héctor Trujillo
Yes exactly this is the reason, the trees didn't let me see the forest, thanks for your perfect and fast response. 2011/9/20 Gora Mohanty g...@mimirtech.com 2011/9/20 Héctor Trujillo hecto...@gmail.com: [...] I created an index and I inserted about ten documents. I defined a filed named

field value getting null with special char

2011-09-20 Thread Ranveer
Hi All, I am facing problem to get value from solr server for a particular field. My environment is : Red hat 5.3 Solr 3.3 Jdk 1.6.24 Tomcat 6.2x Fetching value using SolrJ When using *:* on browser it show but when using solrj all value coming except few fields those have special char. str

Re: Autocomplete(terms) performance problem

2011-09-20 Thread O. Klein
This issue has been discussed in http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-td3268994.html#a3338899 -- View this message in context: http://lucene.472066.n3.nabble.com/Autocomplete-terms-performance-problem-tp3351352p3351499.html Sent from the Solr - User mailing list

fuzzy search by default

2011-09-20 Thread elisabeth benoit
Hello, Does anynone know if it is possible to configure Solr to do by default fuzzy search on every query word? All examples I've seen are ponctual (ie the tilde operator follows one specific word in q parameter). Best regards, Elisabeth

Re: Autocomplete(terms) performance problem

2011-09-20 Thread roySolr
thanks Klein, If I understand correctly there is for now no solution for this problem. The best solution for me is to limit the count of suggestions. I still want to use the regex and with 100.000 docs it looks like it's no problem. -- View this message in context:

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-20 Thread Pranav Prakash
I managed to resolve this issue. Turns out that the issue was because of a faulty XML file being generated by ruby-solr gem. I had to install libxml-ruby, rsolr and I used rsolr gem instead of ruby-solr. Also, if you face this kind of issue, the test-utf8.sh file included in exampledocs is a good

XML injection interface in select servlet?

2011-09-20 Thread Jan Peter Stotz
Hi. At the moment I am playing around a bit with Apache Solr with an focus on security. I found one very strange feature that allows to inject any text you want including xml into the response of a query. Running the example installation that comes with Solr you can test the following

Re: Stemming and other tokenizers

2011-09-20 Thread Pranav Prakash
I have a similar use case, but slightly more flexible and straight forward. In my case, I have a field language which stores 'en', 'es' or whatever the language of the document is. Then the field 'transcript' stores the actual content which is in the language as described in language field.

Re: getting answers starting with a requested string first

2011-09-20 Thread elisabeth benoit
Hello all, I'm answering my own post, hoping someone will comment. I thought about two possibilities to solve my problem: 1) giving NAME_ANALYZED a type where omitNorms=false: I thought this would give answers with shorter NAME_ANALYZED field a higher score. I've tested that solution, but it's

Questions about LocalParams syntax

2011-09-20 Thread Demian Katz
I'm using the LocalParams syntax combined with the _query_ pseudo-field to build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm running into some syntax questions that don't seem to be addressed by the wiki page here: http://wiki.apache.org/solr/LocalParams 1.)

Re: How To perform SQL Like Join

2011-09-20 Thread darren
Hi, Maybe you are aware of this[1] already but I'm not sure its status. It seems to be removed currently from Solr. Also, I'm not sure if it works cross-index, but thought you might look there. Darren [1] https://issues.apache.org/jira/browse/SOLR-2272 On Mon, 19 Sep 2011 22:21:17 -0700

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-20 Thread Erik Hatcher
And to further clarify, the issue isn't in solr-ruby, it's in REXML (a lame Ruby XML library). Both rsolr and solr-ruby will use libxml instead of REXML if it is present. Erik On Sep 20, 2011, at 03:46 , Pranav Prakash wrote: I managed to resolve this issue. Turns out that the issue

term frequencies in sharded environment

2011-09-20 Thread Massimo Schiavon
Seems that when I submit a query in a sharded environment the idf component of the scoring formula takes into consideration the local terms frequencies (local to the single shard index). The effect of that is that the calculation is correct only if the distribution terms in the shards is

Re: Questions about LocalParams syntax

2011-09-20 Thread Jonathan Rochkind
I don't have the complete answer. But I _think_ if you do one 'bq' param with multiple space-seperated directives, it will work. And escaping is a pain. But can be made somewhat less of a pain if you realize that single quotes can sometimes be used instead of double-quotes. What I do:

Re: XML injection interface in select servlet?

2011-09-20 Thread Erik Hatcher
This doesn't seem like a vulnerability at all. What bad effect are you implying here? First of all, in those examples the XML that you injected is encoded in the response, it's not actually part of the XML DOM. And secondly, if you don't want the client to control hl.simple.pre/post you can

Re: term frequencies in sharded environment

2011-09-20 Thread darren
Please see [1] [1] https://issues.apache.org/jira/browse/SOLR-1632 On Tue, 20 Sep 2011 16:14:08 +0200, Massimo Schiavon mschia...@volunia.com wrote: Seems that when I submit a query in a sharded environment the idf component of the scoring formula takes into consideration the local terms

Re: How To perform SQL Like Join

2011-09-20 Thread stockii
http://wiki.apache.org/solr/Join - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2

Re: XML injection interface in select servlet?

2011-09-20 Thread Jonathan Rochkind
On Sep 20, 2011, at 04:33 , Jan Peter Stotz wrote: I am now asking myself why would someone implement such a bloodcurdling vulnerability into a web service? Until now I haven't found an exploit using the parameters in a way an attacker would get an advantage. But the way those parameters are

RE: Questions about LocalParams syntax

2011-09-20 Thread Demian Katz
Space-separation works for the qf field, but not for bq. If I try a bq of format:Book^50 format:Journal^150, I get a strange result -- I would expect in the case of a failed bq that either a) I would get a syntax error of some sort or b) I would get normal search results with no boosting

autocomplete with popularity

2011-09-20 Thread Sentsov Eugeny
hello, Is there autocomplete which counts requests and sorts suggestions according to this count? Ie if users request redlands 50 times and reckless 20 times then suggestions for re should be redlands reckless

Re: autocomplete with popularity

2011-09-20 Thread O. Klein
From http://wiki.apache.org/solr/Suggester : spellcheck.onlyMorePopular=true - if this parameter is set to true then the suggestions will be sorted by weight (popularity) - the count parameter will effectively limit this to a top-N list of best suggestions. If this is set to false then

facet with pivot for docs in multiple hierarchy

2011-09-20 Thread abhayd
hi I have been following http://wiki.apache.org/solr/HierarchicalFaceting We have hierarchical data which we want to use as facets and some of the docs are part of multiple hierarchy. Reason we want to use this is because we want to get entire facet tree in one query. so for example Doc#1:

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-20 Thread Michael McCandless
Since you hit OOME during mmap, I think this is an OS issue not a JVM issue. Ie, the JVM isn't running out of memory. How many segments were in the unoptimized index? It's possible the OS rejected the mmap because of process limits. Run cat /proc/sys/vm/max_map_count to see how many mmaps are

Issues with Solr Highlight

2011-09-20 Thread Krlin, Jiri
Our organization is adopting Solr to facilitate our search functionality. One of the features we are employing is Highlights so that we can give the user a list or search results with context in which they appear. We are experiencing 2 issues with the snippets being returned. I have tried

Troubleshooting OOM in DIH w/ FileListEntityProcessor and XPathEntityProcessor

2011-09-20 Thread Pulkit Singhal
Hello Everyone, I need help in: (a) figuring out the causes of OutOfMemoryError (OOM) when I run Data Import Handler (DIH), (b) finding workarounds and fixes to get rid of the OOM issue per cause. The stacktrace is at the very bottom to avoid having your eyes glaze over and to prevent you from

dataimport.properties

2011-09-20 Thread Barry Harding
Hi I am currently using the DIH to connect to and import data from a MS SQL Server, and in general doing full, delta or deletes seems to work perfectly. The issue is that I spotted some errors being logged in the tomcat logs for SOLR which are : 19-Sep-2011 07:45:25

Facet count problem : Multi-Select Faceting After grouping results

2011-09-20 Thread Ramzi Alqrainy
Dear All , Kindly note that I using Solr 4.0 and Kindly note that /*group.truncate=true*/ calculates facet counts that based on the most relevant document of each group matching the query. *But* when I used Multi-Select Faceting [Tagging and excluding Filters] , the solr can't calculate the facet

Is doc verboten?

2011-09-20 Thread chadsteele.com
It seems xml docs that use doc fail to be indexed properly and I've recently discovered the following fails on my installation. /solr/update?stream.body=doc/doc thoughts? I need to allow content to have doc in the xml. -- View this message in context:

autocomplete with popularity

2011-09-20 Thread Sentsov Eugeny
hello, Is there autocomplete which counts requests and sorts suggestions according to this count? Ie if users request redlands 50 times and reckless 20 times then suggestions for re should be redlands reckless

Re: fuzzy search by default

2011-09-20 Thread yongtao_liu
Do not do that. That will cause query very slow. I you really need that, you can either change solr code or writer a wrapper. -- View this message in context: http://lucene.472066.n3.nabble.com/fuzzy-search-by-default-tp3351526p3352922.html Sent from the Solr - User mailing list archive at

Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-20 Thread Robert Muir
On Mon, Sep 19, 2011 at 9:57 AM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, Removing set from setMaxMergedSegmentMB and using maxMergedSegmentMB fixed the problem. ( Sorry about the multiple posts.  Our mail server was being flaky and the client lied to me about whether the

Re: How to set up the schema to avoid NumberFormatException

2011-09-20 Thread Pulkit Singhal
Hi Hoss, Thanks for the input! Something rather strange happened. I fixed my regex such that instead of returning just 1,000 ... it would return 1,000.00 and voila it worked! So Parsing group separators is already supported apparently then ... its just that the format is also looking for a

how to perform joins with function queries?

2011-09-20 Thread Jason Toy
I had a join query that was originally written as : {!join from=self_id_i to=user_id_i}data_text:hello and that works fine. I later added an fq filter: {!frange l=0.05 }div(termfreq(data_text,'hello'),max_i) and the query doesn't work anymore. if I do the fq by itself without the join the query

Re: autocomplete with popularity

2011-09-20 Thread Markus Jelsma
No. The spellchecker and suggester only operate on the index (tf*idf) and do not account for user generated input which is what the user asks for. You need to parse the query logs periodically index query strings and #occurences in the query logs as a float value (or use ExternalFileField) to

How to skip fields when using DIH?

2011-09-20 Thread Pulkit Singhal
The data I'm running through the DIH looks like: products product newfalse/new activefalse/active regularPrice349.99/regularPrice salesRankShortTerm/ /product /products As you can see, in this particular instance of a product, there is no value for salesRankShortTerm which

Re: autocomplete with popularity

2011-09-20 Thread Markus Jelsma
At least, i assumed this is what the user asked for when i read which counts requests and sorts suggestions according to this count No. The spellchecker and suggester only operate on the index (tf*idf) and do not account for user generated input which is what the user asks for. You need to

Re: autocomplete with popularity

2011-09-20 Thread Walter Underwood
Ranking suggestions based on query count would be trivially easy to spam. Have a bot make my preferred queries over and over again, and boom they are the most-preferred. wunder On Sep 20, 2011, at 3:41 PM, Markus Jelsma wrote: At least, i assumed this is what the user asked for when i read

Re: How to skip fields when using DIH?

2011-09-20 Thread Pulkit Singhal
OMG, I'm so sorry, please ignore. Its so simple, just had to use: row.remove( 'salesRankShortTerm' ); because the script runs at the end after the entire entity has been processed (I suppose) rather than per field. Thanks! On Tue, Sep 20, 2011 at 5:42 PM, Pulkit Singhal pulkitsing...@gmail.com

Re: autocomplete with popularity

2011-09-20 Thread Markus Jelsma
A query log parser can be written to detect spam. At first you can use cookies (e.g. sessions) and IP-addresses to detect term spam. You can also limit a popularity spike to a reasonable mean size over a longer period. And you can limit rates using logarithms. There are many ways to deal with

Re: autocomplete with popularity

2011-09-20 Thread Walter Underwood
Of course you can fight spam. And the spammers can fight back. I prefer algorithms that don't require an arms race with spammers. There are other problems with using query frequency. What about all the legitimate users that type google or facebook into the query box instead of into the

Re: autocomplete with popularity

2011-09-20 Thread Markus Jelsma
Of course you can fight spam. And the spammers can fight back. I prefer algorithms that don't require an arms race with spammers. There are other problems with using query frequency. What about all the legitimate users that type google or facebook into the query box instead of into the

Re: autocomplete with popularity

2011-09-20 Thread Walter Underwood
The original request was for suggestions ranked purely by request count. You have designed something more complicated that probably works better. When I built query completion at Netflix, I used the movie rental rates to rank suggestions. That was simple and very effective. We didn't need a

Re: autocomplete with popularity

2011-09-20 Thread Markus Jelsma
The original request was for suggestions ranked purely by request count. You have designed something more complicated that probably works better. When I built query completion at Netflix, I used the movie rental rates to rank suggestions. That was simple and very effective. We didn't need a

More Moderators for solr-user ?

2011-09-20 Thread Chris Hostetter
I haven't had a chance to read solr-user email in about 2 days, and when looking at this list today i discovered a few emails thta were 2 days old awaiting moderation -- which suggests to me that we don't have enough moderators for the oslr-user mailing list. If anyone is interested in

Re: Solr and internationalization

2011-09-20 Thread Chris Hostetter
: It seems that this is because my solr app cannot find a ressource bundle while : writing the exception message. Lucene supports internationalization in query's : exception messages thanks to the NLS [2] class. ... : Creating the file with the current local, i.e., :

Re: Lucene Grid question

2011-09-20 Thread Chris Hostetter
: E.g. say I have a chain of book-stores, in different countries, and I'm aiming for the following: : - Each country has its own index file, on its own machine (e.g. books from Japan are indexed on machine japan1) : - Most users search only within their own country (e.g. search only the japan1

Re: field value getting null with special char

2011-09-20 Thread Ranveer Kumar
Is any help.. I am unable to figure out.. On 20-Sep-2011 2:22 PM, Ranveer ranveer.s...@gmail.com wrote: Hi All, I am facing problem to get value from solr server for a particular field. My environment is : Red hat 5.3 Solr 3.3 Jdk 1.6.24 Tomcat 6.2x Fetching value using SolrJ When using

Re: a weird error of embedded server initiaizationl

2011-09-20 Thread Chris Hostetter
: However, My testing project is a pure Java project. I still don't : understand how it messed up with J2EE stuff. Actually, the same code is : working for CommonsHttpSolrServer without any error like embedded one. but that makes total sense: if you are using CommonsHttpSolrServer then you

Re: a weird error of embedded server initiaizationl

2011-09-20 Thread Xue-Feng Yang
I had solved this problem. Thanks. From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Xue-Feng Yang just4l...@yahoo.com Sent: Tuesday, September 20, 2011 9:28:24 PM Subject: Re: a weird error of

cache invalidation in slaves

2011-09-20 Thread roz dev
Hi All Solr has different types of caches such as filterCache, queryResultCache and document Cache . I know that if a commit is done then a new searcher is opened and new caches are built. And, this makes sense. What happens when commits are happening on master and slaves are pulling all the

Re: cache invalidation in slaves

2011-09-20 Thread Chris Hostetter
: What happens when commits are happening on master and slaves are pulling all : the delta updates. : : Do slaves trash their cache and rebuild them every time there is a new delta : index updates downloaded to slave? replication triggers a commit on slaves so a new searcher (with new caches)

Re: solr 1.4 facet.limit behaviour in merging from several shards

2011-09-20 Thread Chris Hostetter
: document in a shard has a field, which contains date in milliseconds which : is a result of subtraction of the original document's date from a very big : date in the future. In this way, if you issue a facet query against a shard : and use facet.method=index you get hits from the shard ordered

q and fq in solr 1.4.1

2011-09-20 Thread roz dev
Hi All I am sure that q vs fq question has been answered several times. But, I still have a question which I would like to know the answers for: if we have a solr query like this q=*fq=field_1:XYZfq=field_2:ABCsortBy=field_3+asc How does SolrIndexSearcher fire query in 1.4.1 Will it fire

Re: solr 1.4 facet.limit behaviour in merging from several shards

2011-09-20 Thread Chris Hostetter
: with the setup you describe, there's no why i can imagine executing a : search that results in constraints being returned that come from multiple : shards with some constraints being missing from the middle of hte list, : near the border of values for that field that signify a change in