Re: PolySearcher in Solr

2012-04-20 Thread Ramprakash Ramamoorthy
On Thu, Apr 19, 2012 at 9:21 PM, Jeevanandam Madanagopal je...@myjeeva.comwrote: Please have a look http://wiki.apache.org/solr/DistributedSearch -Jeevanandam On Apr 19, 2012, at 9:14 PM, Ramprakash Ramamoorthy wrote: Dear all, I came across this while browsing through lucy

Re: # open files with SolrCloud

2012-04-20 Thread Sami Siren
On Thu, Apr 19, 2012 at 3:12 PM, Sami Siren ssi...@gmail.com wrote: I have a simple solrcloud setup from trunk with default configs; 1 shard with one replica. As few other people have reported there seems to be some kind of leak somewhere that causes the number of open files to grow over time

Importing formats - Which works best with Solr?

2012-04-20 Thread Spadez
Hi, I am designing a custom scrapping solution. I need to store my data, do some post processing on it and then import it into SOLR. If I want to import data into SOLR in the quickest, easiest way possible, what format should I be saving my scrapped data in? I get the impression that .XML would

Re: PolySearcher in Solr

2012-04-20 Thread Lance Norskog
In SolrLucene, a shard is one part of an index. There cannot be multiple indices in one shard. All of the shards in an index share the same schema, and no document is in two or more shards. distributed search as implemented by solr searches several shards in one index. On Thu, Apr 19, 2012 at

Re: PolySearcher in Solr

2012-04-20 Thread Lance Norskog
The PolySearcher in Lucy seems to do exactly what is Distributed Search in Solr. On Fri, Apr 20, 2012 at 2:58 AM, Lance Norskog goks...@gmail.com wrote: In SolrLucene, a shard is one part of an index. There cannot be multiple indices in one shard. All of the shards in an index share the same

Re: Importing formats - Which works best with Solr?

2012-04-20 Thread Dmitry Kan
James, You could create xml files of format: add docfield name=id1/fieldfield name=Name![CDATA[James]]/fieldfield name=Surname![CDATA[Willson]]/field/doc !-- more doc's here -- /add and then post them to SOLR using, for example, the post.sh utility from SOLR's binary distribution. HTH, Dmitry

Re: Solr Cloud vs sharding vs grouping

2012-04-20 Thread Lance Norskog
The implementation of grouping in the trunk is completely different from 236. Grouping works across distributed search: https://issues.apache.org/jira/browse/SOLR-2066 committed last September. On Thu, Apr 19, 2012 at 6:04 PM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com

Re: Wrong categorization with DIH

2012-04-20 Thread Lance Norskog
Working with the DIH is a little easier if you make database view and load from that. You can set all of the field names and see exactly what the DIH gets. On Thu, Apr 19, 2012 at 10:11 AM, Ramo Karahasan ramo.karaha...@googlemail.com wrote: Hi, yes i use every oft hem. Thanks for your

Re: Solr file size limit?

2012-04-20 Thread Lance Norskog
Good point! Do you store the large file in your documents, or just index them? Do you have a largest file limit in your environment? Try this: ulimit -a What is the file size? On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey s...@elyograg.org wrote: On 4/19/2012 7:49 AM, Bram Rongen wrote:

Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-20 Thread Boon Low
Thanks. My colleague also pointed a previous thread and the solution out: add a new update.chain for data import/update handlers to bypass the distributed update processor. A simpler use case example for SolrCloud newbies could be on distributed search, to experience the features of the

Re: Solr with UIMA

2012-04-20 Thread dsy99
Hi Rahul, Thank you for the reply. I tried by modifying the updateRequestProcessorChain as follows: updateRequestProcessorChain name=uima default=true But still I am not able to see the UIMA fields in the result. I executed the following curl command to index a file named test.docx curl

Re: Date granularity

2012-04-20 Thread Erick Erickson
The only way to get more elegant would be to index the dates with the granularity you want, i.e. truncate to DAY at index time then truncate to DAY at query time as well. Why do you consider ranges inelegant? How else would you imagine it would be done? Best Erick On Thu, Apr 19, 2012 at 4:07

Re: Importing formats - Which works best with Solr?

2012-04-20 Thread Erick Erickson
CSV files can also be imported, which may be more compact. Best Erick On Fri, Apr 20, 2012 at 6:01 AM, Dmitry Kan dmitry@gmail.com wrote: James, You could create xml files of format: add docfield name=id1/fieldfield name=Name![CDATA[James]]/fieldfield

Re: Solr Cloud vs sharding vs grouping

2012-04-20 Thread Martijn v Groningen
Hi Jean-Sebastien, For some grouping features (like total group count and grouped faceting), the distributed grouping requires you to partition your documents into the right shard. Basically groups can't cross shards. Otherwise the group counts or grouped facet counts may not be correct. If you

SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Hi, I just wanted to make sure I understand how distributed indexing works in solrcloud. Can I index locally at each shard to avoid throttling a central port? Or all the indexing has to go through a single shard leader? thanks

Re: Solr file size limit?

2012-04-20 Thread Bram Rongen
Yeah, I'm indexing some PDF documents.. I've extracted the text through tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite extensive ;) My Solution for the moment is to cut this text to the first 500KB, that should be enough for a decent index and search capabilities..

Re: Solr file size limit?

2012-04-20 Thread Bram Rongen
Hmm, reading your reply again I see that Solr only uses the first 10k tokens from each field so field length should not be a problem per se.. It could be my document contain very large tokens and unorganized tokens, could this startle Solr? On Fri, Apr 20, 2012 at 2:03 PM, Bram Rongen

Convert a SolrDocumentList to DocList

2012-04-20 Thread Ramprakash Ramamoorthy
Dear all, Is there any way I can convert a SolrDocumentList to a DocList and set it in the QueryResult object? Or, the workaround adding a SolrDocumentList object to the QueryResult object? -- With Thanks and Regards, Ramprakash Ramamoorthy, Project Trainee, Zoho Corporation.

Re: Large Index and OutOfMemoryError: Map failed

2012-04-20 Thread Gopal Patwa
We cannot avoid auto soft commit, since we need Lucene NRT feature. And I use StreamingUpdateSolrServer for adding/updating index. On Thu, Apr 19, 2012 at 7:42 AM, Boon Low boon@brightsolid.com wrote: Hi, Also came across this error recently, while indexing with 10 DIH processes in

Re: Date granularity

2012-04-20 Thread vybe3142
... Inelegant as opposed to the possibility of using /DAY to specify day granularity on a single term query In any case, if that's how SOLR works, that's fine Any rough idea of the performance of range queries vs truncated day queries? Otherwise, I might just write up a quick program to compare

Re: SolrCloud indexing question

2012-04-20 Thread Jamie Johnson
my understanding is that you can send your updates/deletes to any shard and they will be forwarded to the leader automatically. That being said your leader will always be the place where the index happens and then distributed to the other replicas. On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni

Re: How can I use a function or fieldvalue as the default for query(subquery, default)?

2012-04-20 Thread jimtronic
I was able to use solr 3.1 functions to accomplish this logic: /solr/select?q=_val_:sum(query({!dismax qf=text v='solr rocks'}),product(map(query({!dismax qf=text v='solr rocks'},-1),0,100,0,1), product(this_field,that_field))) -- View this message in context:

Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread kuchenbrett
Hi, I want to build an index of quite a number of pdf and msword files using the Data Import Request Handler and the Tika Entity Processor. It works very well. Now I would like to use the md5 digest of the binary (pdf/word) file as the unique key in t he index. But I do not know how to

RE: Maximum Open Cursors using JdbcDataSource and cacheImpl

2012-04-20 Thread Keith Naas
I have removed most of the file to protect the innocent. As you can see we have a high level item that has subentity called skus, and then those skus contain subentities for size/width/etc. The database is configured for only 10 open cursors, and voila, when the 11th item is being processed

Re: Abbreviations with KeywordTokenizerFactory

2012-04-20 Thread Erick Erickson
Yeah, this is a pretty ugly problem. You have two problems, neither of which is all that amenable to simple solutions. 1 context at index time. St, in your example, is either Saint or Street. Solr has nothing built in to it to distinguish this. so you need to do some processing

Re: Dismax request handler and Dismax query parser

2012-04-20 Thread Erick Erickson
Right, this is often a source of confusion and there's a discussion about this on the dev list (but the URL escapes me).. Anyway, qt and defType have pretty much completely different meanings. Saying defType=dismax means you're providing all the dismax parameters on the URL. Saying

Re: Further questions about behavior in ReversedWildcardFilterFactory

2012-04-20 Thread neosky
I have to discard this method at this time. Thank you all the same. -- View this message in context: http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3926423.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: String ordering appears different with sort vs range query

2012-04-20 Thread Erick Erickson
BTW, nice problem statement... Anyway, I see this too in 3.5. I do NOT see this in 3.6 or trunk, so it looks like a bug that got fixed in the 3.6 time-frame. Don't have the time right now to go back over the JIRA's to see... Best Erick On Thu, Apr 19, 2012 at 3:39 PM, Cat Bieber

Special characters in synonyms.txt on Solr 3.5

2012-04-20 Thread carl.nordenf...@bwinparty.com
Hi, I'm having issues with special characters in synonyms.txt on Solr 3.5. I'm running a multi-lingual index and need certain terms to give results across all languages no matter what language the user uses. I figured that this should be easily resovled by just adding the different words to

Re: Special characters in synonyms.txt on Solr 3.5

2012-04-20 Thread Robert Muir
On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com carl.nordenf...@bwinparty.com wrote: Directly injecting the letter ö into synonyms like so: island, ön island, ön renders the following exception on startup (both lines renders the same error): java.lang.RuntimeException:

How can I get the top term in solr?

2012-04-20 Thread neosky
Actually I would like to know two meaning of the top term in document level and index file level. 1.The top term in document level means that I would like to know the top term frequency in all document(only calculate once in one document) The solr schema.jsp seems to provide to top 10 term, but

Re: String ordering appears different with sort vs range query

2012-04-20 Thread Cat Bieber
Thanks for looking at this. I'll see if we can sneak an upgrade to 3.6 into the project to get this working. -Cat On 04/20/2012 12:03 PM, Erick Erickson wrote: BTW, nice problem statement... Anyway, I see this too in 3.5. I do NOT see this in 3.6 or trunk, so it looks like a bug that got

Re: SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Gotcha. Now does that mean if I have 5 threads all writing to a local shard, will that shard piggyhop those index requests onto a SINGLE connection to the leader? Or will they spawn 5 connections from the shard to the leader? I really hope the formerthe latter won't scale well. On Fri,

Crawling an SCM to update a Solr index

2012-04-20 Thread Van Tassell, Kristian
Hello everyone, I'm in the process of pulling together requirements for a SCM (source code manager) crawling mechanism for our Solr index. I probably don't need to argue the need for a crawler, but to be specific, we have an index which receives its updates from a custom built application. I

Language Identification

2012-04-20 Thread Bai Shen
I'm working on using Shuyo's work to improve the language identification of our search. Apparently, it's been moved from Nutch to Solr. Is there a reason for this? http://code.google.com/p/language-detection/issues/detail?id=34 I would prefer to have the processing done in Nutch as that has

Re: How to escape “” character in regex in Solr schema.xml?

2012-04-20 Thread smooth almonds
Thanks Jeevanandam. I couldn't get any regex pattern to work except a basic one to look for sentence-ending punctuation followed by whitespace: [.!?](?=\s) However, this isn't good enough for my needs so I'm switching tactics at the moment and working on plugging in OpenNLP's SentenceDetector

Re: SolrCloud indexing question

2012-04-20 Thread Jamie Johnson
I believe the SolrJ code round robins which server the request is sent to and as such probably wouldn't send to the same server in your case, but if you had an HttpSolrServer for instance and were pointing to only one particular intsance my guess would be that would be 5 separate requests from the

Re: Language Identification

2012-04-20 Thread Jan Høydahl
Hi, Solr just reuses Tika's language identifier. But you are of course free to do your language detection on the Nutch side if you choose and not invoke the one in Solr. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 20. apr.

Re: Opposite to MoreLikeThis?

2012-04-20 Thread Darren Govoni
You could run the MLT for the document in question, then gather all those doc id's in the MLT results and negate those in a subsequent query. Not sure how robust that would work with very large result sets, but something to try. Another approach would be to gather the interesting terms from the

null pointer error with solr deduplication

2012-04-20 Thread Peter Markey
Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But,

How to index pdf's content with SolrJ?

2012-04-20 Thread vasuj
0 down vote favorite share [g+] share [fb] share [tw] I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's the code: import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;

Re: Convert a SolrDocumentList to DocList

2012-04-20 Thread Erick Erickson
OK, this description really sounds like an XY problem. Why do you want to do this? What is the higher-level problem you're trying to solve? Best Erick On Fri, Apr 20, 2012 at 9:18 AM, Ramprakash Ramamoorthy youngestachie...@gmail.com wrote: Dear all,        Is there any way I can convert a

Re: How to index pdf's content with SolrJ?

2012-04-20 Thread Erick Erickson
This might help: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/ The bit here is you have to have Tika parse your file and then extract the content to send to Solr... Best Erick On Fri, Apr 20, 2012 at 7:36 PM, vasuj vasu.j...@live.in wrote: 0 down vote favorite share

Re: Crawling an SCM to update a Solr index

2012-04-20 Thread Otis Gospodnetic
Kristian, For what it's worth, for http://search-lucene.com and http://search-hadoop.com we simply check out the source code from the SCM and index from the file system.  It works reasonably well.  The only issues that I can recall us having is with the source code organization under SCM -

Re: Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread Otis Gospodnetic
Hi Joe, You could write a custom URP - Update Request Processor.  This URP would take the value from one SolrDocument field (say the one that has the full path to your PDF and is thus unique), compute MD5 using Java API for doing that, and would stick that MD5 value in some field that you've

Question concerning date fields

2012-04-20 Thread Bill Bell
We are loading a long (number of seconds since 1970?) value into Solr using java and Solrj. What is the best way to convert this into the right Solr date fields? Sent from my Mobile device 720-256-8076

Re: Question concerning date fields

2012-04-20 Thread Gora Mohanty
On 21 April 2012 09:12, Bill Bell billnb...@gmail.com wrote: We are loading a long (number of seconds since 1970?) value into Solr using java and Solrj. What is the best way to convert this into the right Solr date fields? [...] There are various options, depending on the source of your