Re: processing documents in solr

2013-07-27 Thread Joe Zhang
On a related, inspired by what you said, Shawn, an auto increment id seems perfect here. Yet I found there is no such support in solr. The UUID only guarantees uniqueness. On Fri, Jul 26, 2013 at 10:50 PM, Joe Zhang smartag...@gmail.com wrote: Thanks for your kind reply, Shawn. On Fri, Jul

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/26/2013 11:50 PM, Joe Zhang wrote: == Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time? Yes, it's pagination. Just like the other method that I've described in detail,

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
On Fri, Jul 26, 2013 at 11:18 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2013 11:50 PM, Joe Zhang wrote: == Essentially we are doing paigination here, right? If performance is not the concern, given that the index is dynamic, does the order of entries remain stable over time?

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 12:30 AM, Joe Zhang wrote: == so a url field would work fine? As long as it's guaranteed unique on every document (especially if it is your uniqueKey) and goes into the index as a single token, that should work just fine for the range queries I've described. Thanks, Shawn

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks. On Fri, Jul 26, 2013 at 11:34 PM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 12:30 AM, Joe Zhang wrote: == so a url field would work fine? As long as it's guaranteed unique on every document (especially if it is your uniqueKey) and goes into the index as a single token,

Re: Synonym Phrase

2013-07-27 Thread Mikhail Khludnev
Hello, As far as I know http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ has some usage in the industry. On Fri, Jul 26, 2013 at 8:28 PM, Jack Krupansky j...@basetechnology.comwrote: Hmmm... Actually, I think there was also a solution where you could specify an alternate

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
Dear list, I'vw written a special processor exactly for this kind of operations https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch This is how we use it http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch It is capable of

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me

RE: How to Make That Domains Should Be First?

2013-07-27 Thread Markus Jelsma
Hi - To make this work you'll need a homepage flag and some specific hostname analysis and function query boosting. I assume you're still using Nutch so getting detecting homepages is easy using NUTCH-1325. To actually get the homepage flag in Solr you need to modify the indexer to ingest the

Re: problems about solr replication in 4.3

2013-07-27 Thread Erick Erickson
Well, a full import is going to re-import everything in the database, and the presumption is that each every document would be replaced (because presumably you're uniqueKey is the same). So every document will be deleted and re-added. So essentially you'll get a completely new index every time.

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Erick Erickson
What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter. FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt

Re: Solr-4663 - Alternatives to use same data dir in different cores for optimal cache performance

2013-07-27 Thread Erick Erickson
You can certainly have multiple Solrs pointing to the same underlying physical index if (and only if) you absolutely guarantee that only one Solr will write to the index at a time. But I'm not sure this is premature optimization or not. Problem is that your multiple Solrs are eating up the same

Re: Querying a specific core in solr cloud

2013-07-27 Thread Erick Erickson
Not quite sure what's happening here. It would be interesting to see whether the requests are actually going to the right IP, by tailing out the logs. It _may_ be that the distrib=false isn't honored if there is no core on the target machine (I haven't looked at the code). To test that, go ahead

Re: Sending shard requests to all replicas

2013-07-27 Thread Erick Erickson
This has been suggested, but so far it's not been implemented as far as I know. I'm curious though, how many shards are you dealing with? I wonder if it would be a better idea to try to figure out _why_ you so often have a slow shard and whether the problem could be cured with, say, better

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
Thanks for sharing, Roman. I'll look into your code. One more thought on your suggestion, Shawn. In fact, for the id, we need more than unique and rangeable; we also need some sense of atomic values. Your approach might run into risk with a text-based id field, say: the id/key has values 'a',

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:17 AM, Joe Zhang wrote: Thanks for sharing, Roman. I'll look into your code. One more thought on your suggestion, Shawn. In fact, for the id, we need more than unique and rangeable; we also need some sense of atomic values. Your approach might run into risk with a text-based

Re: processing documents in solr

2013-07-27 Thread Joe Zhang
I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla *:*+pagination approach, would the ordering of documents remain stable? the index is dynamic: update/insertion only, no deletion. On Sat,

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output

Re: processing documents in solr

2013-07-27 Thread Shawn Heisey
On 7/27/2013 11:38 AM, Joe Zhang wrote: I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla *:*+pagination approach, would the ordering of documents remain stable? the index is dynamic:

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Tim Vaillancourt
Thanks for the reply Erick, Hard Commit - 15000ms, openSearcher=false Soft Commit - 1000ms, openSearcher=true 15sec hard commit was sort of a guess, I could try a smaller number. When you say getting too large what limit do you think it would be hitting: a ulimit (nofiles), disk space, number

Early Access Release #4 for Solr 4.x Deep Dive book is now available for download on Lulu.com

2013-07-27 Thread Jack Krupansky
Okay, it’s hot off the e-presses: Solr 4.x Deep Dive, Early Access Release #4 is now available for purchase and download as an e-book for $9.99 on Lulu.com at: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html (That link says “1”, but

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Hi Mikhail, I can see it is lazy-loading, but I can't judge how much complex it becomes (presumably, the filter dispatching mechanism is doing also other things - it is there not only for streaming). Let me just explain better what I found when I dug inside solr: documents (results of the query)

Re: processing documents in solr

2013-07-27 Thread Roman Chyla
On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey s...@elyograg.org wrote: On 7/27/2013 11:38 AM, Joe Zhang wrote: I have a constantly growing index, so not updating the index can't be practical... Going back to the beginning of this thread: when we use the vanilla *:*+pagination approach,

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Jack Krupansky
No hard numbers, but the general guidance is that you should set your hard commit interval to match your expectations for how quickly nodes should come up if they need to be restarted. Specifically, a hard commit assures that all changes have been committed to disk and are ready for immediate

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 4:30 PM, Roman Chyla roman.ch...@gmail.com wrote: Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello, Please find below Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these documents were loaded by one of the components

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Erick Erickson
Tim: 15 seconds isn't unreasonable, I was mostly wondering if it was hours. Take a look at the size of the tlogs as you're indexing, you should see them truncate every 15 seconds or so. There'll be a varying number of tlogs kept around, although under heavy indexing I'd only expect 1 or 2

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 5:05 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: anyway, even if writer pulls docs one by one, it doesn't allow to stream a billion of them. Solr writes out DocList, which is really problematic even in deep-paging scenarios. Which part is problematic... the

Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Hi Erick, thanks. I have about 40 shards. repFactor=2. The cause of slower shards is very interesting, and this is the main approach we took. Note that in every query, it is another shard which is the slowest. In 20% of the queries, the slowest shard takes about 4 times more than the average

Re: SolrCloud 4.3.1 - Failure to open existing log file (non fatal) errors under high load

2013-07-27 Thread Tim Vaillancourt
Thanks Jack/Erick, I don't know if this is true or not, but I've read there is a tlog per soft commit, which is then truncated by the hard commit. If this were true, a 15sec hard-commit with a 1sec soft-commit could generate around 15~ tlogs, but I've never checked. I like Erick's scenario

Re: Sending shard requests to all replicas

2013-07-27 Thread Shawn Heisey
On 7/27/2013 3:33 PM, Isaac Hebsh wrote: I have about 40 shards. repFactor=2. The cause of slower shards is very interesting, and this is the main approach we took. Note that in every query, it is another shard which is the slowest. In 20% of the queries, the slowest shard takes about 4 times

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-27 Thread Shawn Heisey
On 7/26/2013 2:03 PM, Gustav wrote: The problem here is that in my client's application, the query beign encoded in iso-8859-1 its a *must*. So, this is kind of a trouble here. I just dont get how this encoding could work on queries in version 3.5, but it doesnt in 4.3. I brought up the issue

Re: Sending shard requests to all replicas

2013-07-27 Thread Isaac Hebsh
Shawn, thank you for the tips. I know the significant cons of virtualization, but I don't want to move this thread into a virtualization pros/cons in the Solr(Cloud) case. I've just asked what is the minimal code change should be made, in order to examine whether this is a possible solution or

Searching in stopwords

2013-07-27 Thread Rohit Kumar
I have a company search which uses stopwords during quezary time. In my stopwords list i have entries like : HR Club India Pvt. Ltd. So if i search for companies like HR Club i get no results. Similarly search for India HR giving no results. How can i get results in query for following

Re: Searching in stopwords

2013-07-27 Thread Jack Krupansky
Edismax should be able to handle a query consisting of only query-time stop words. What does your text field type analyzer look like? -- Jack Krupansky -Original Message- From: Rohit Kumar Sent: Saturday, July 27, 2013 9:59 PM To: solr-user@lucene.apache.org Subject: Searching in

Re: processing documents in solr

2013-07-27 Thread Maurizio Cucchiara
In both cases, for better performance, first I'd load just all the IDs, after, during processing I'd load each document. For what concern the incremental requirement, it should not be difficult to write an hash function which maps a non-numerical I'd to a value. On Jul 27, 2013 7:03 AM, Joe Zhang

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley yo...@lucidworks.com wrote: Which part is problematic... the creation of the DocList (the search), Literally DocList is a copy of TopDocs. Creating TopDocs is not a search, but ranking. And ranking costs is log(rows+start) beside of numFound, which