Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood
). And so on. As one of my coworkers said, trying to turn a PDF into structured text is like trying to turn hamburger back into a cow. PDF is where text goes to die. Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 17, 2015, at 2:48 AM, Charlie H

Re: TPS with Solr Cloud

2015-12-21 Thread Walter Underwood
cache. Test with production logs. Choose logs where the number of distinct queries is much larger than your cache sizes. If your caches are 1024, it would be good to have a 100K distinct queries. That might mean of total log size of a few million queries. wunder Walter Underwood wun

Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-22 Thread Walter Underwood
questions. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore wrote: > > Hi All, > > my website is under pressure, there is a big number of concurrent searches. > When the connected

Re: Limit fields returned in solr based on content

2015-12-24 Thread Walter Underwood
I would do that in a middle tier. You can’t do every single thing in Solr. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 24, 2015, at 1:21 PM, Upayavira wrote: > > You could create a custom DocTransformer. They can enhance t

Re: Memory Usage increases by a lot during and after optimization .

2015-12-29 Thread Walter Underwood
ex is continually updated, clicking that is a complete waste of resources. Don’t do it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 29, 2015, at 6:35 PM, Zheng Lin Edwin Yeo wrote: > > Hi, > > I am facing a situation, when I d

Re: Memory Usage increases by a lot during and after optimization .

2015-12-29 Thread Walter Underwood
fter > optimization, the index size reduces. Do we still need to do that? > > Regards, > Edwin > > On 30 December 2015 at 10:45, Walter Underwood > wrote: > >> Do not “optimize". >> >> It is a forced merge, not an optimization. It was a mistake to eve

Re: Solr index segment level merge

2015-12-29 Thread Walter Underwood
You probably do not NEED to merge your indexes. Have you tried not merging the indexes? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 29, 2015, at 7:31 PM, jeba earnest wrote: > > I have a scenario that I need to merge the sol

Re: Data migration from one collection to the other collection

2016-01-05 Thread Walter Underwood
You could send the documents to both and filter out the recent ones in the history collection. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 5, 2016, at 5:46 AM, vidya wrote: > > Hi > > I would like to maintain two cores f

Re: Solrcloud for Java 1.6

2016-01-07 Thread Walter Underwood
require Java 7 was made at some point in the 4.x development. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 7, 2016, at 7:26 PM, billnb...@gmail.com wrote: > > Run it on 2 separate boxes > > Bill Bell > Sent from mobile >

Re: solr score threashold

2016-01-20 Thread Walter Underwood
give worse results than a vector space model, but you can have thresholds. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 20, 2016, at 5:11 AM, Emir Arnautovic > wrote: > > Hi Sara, > You can use funct and frange to achive n

Re: Scaling SolrCloud

2016-01-21 Thread Walter Underwood
would still be up. If you are OK with that risk, run three nodes. If not, run five. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 21, 2016, at 9:27 AM, Erick Erickson wrote: > > NP. My usual question though is "how often do you ex

Re: Taking Solr to production

2016-01-22 Thread Walter Underwood
very unusual queries. Median response time was much better, about 50 milliseconds. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 22, 2016, at 2:45 PM, Toke Eskildsen wrote: > > Aswath Srinivasan (TMS) wrote: >> * To

Re: schemaless vs schema based core

2016-01-22 Thread Walter Underwood
Yo. That is the truth. You can get stuff indexed with an automatic schema, but if you want to make your customers happy, tune it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 22, 2016, at 6:22 PM, Erick Erickson wrote: > >

Re: Memory leak defect or misssuse of SolrJ API?

2016-01-30 Thread Walter Underwood
think they even point out what is thread safe. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 30, 2016, at 7:42 AM, Susheel Kumar wrote: > > Hi Steve, > > Can you please elaborate what error you are getting and i didn't un

Re: Memory leak defect or misssuse of SolrJ API?

2016-01-31 Thread Walter Underwood
Solr server, you have one object. There is no leak in HttpSolrClient, you are misusing the class, massively. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 31, 2016, at 2:10 PM, Steven White wrote: > > Thank you all for your

Re: Memory leak defect or misssuse of SolrJ API?

2016-01-31 Thread Walter Underwood
be a lot faster after you reuse the client class. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 31, 2016, at 3:46 PM, Steven White wrote: > > Thanks Walter. Yes, I saw your answer and fixed the issue per your > suggestion. >

Re: large number of fields

2016-02-05 Thread Walter Underwood
. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 5, 2016, at 8:13 AM, Jack Krupansky wrote: > > This doesn't sound like a great use case for Solr - or any other search > engine for that matter. I'm not sure what yo

Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Making two indexing calls, one to each, works until one system is not available. Then they are out of sync. You might want to put the updates into a persistent message queue, then have both systems indexed from that queue. wunder Walter Underwood wun...@wunderwood.org http

Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Updating two systems in parallel gets into two-phase commit, instantly. So you need a persistent pool of updates that both clusters pull from. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 9, 2016, at 4:15 PM, Shawn Heisey wrote: > &g

Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
I agree. If the system updates synchronously, then you are in two-phase commit land. If you have a persistent store that each index can track, then things are good. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 9, 2016, at 7:37 PM, Sh

Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-11 Thread Walter Underwood
. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 11, 2016, at 10:06 AM, Erick Erickson wrote: > > Steven's solution is a very common one, complete to the > notion of re-chunking. Depending on the throughput requirements, >

Re: words with spaces within

2016-02-22 Thread Walter Underwood
This happens for fonts where Tika does not have font metrics. Open the document in Adobe Reader, then use document info to find the list of fonts. Then post this question to the Tika list. Fix it in Tika, don’t patch it in Solr. wunder Walter Underwood wun...@wunderwood.org http

Re: What search metrics are useful?

2016-02-24 Thread Walter Underwood
good introduction. http://rosenfeldmedia.com/books/search-analytics-for-your-site/ <http://rosenfeldmedia.com/books/search-analytics-for-your-site/> Sea Urchin is doing some good work in search metrics: https://seaurchin.io/ <https://seaurchin.io/> wunder Walter Underwood wun...@wu

Re: Query time de-boost

2016-02-25 Thread Walter Underwood
by that value. I haven’t tried any of these, of course. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 25, 2016, at 3:33 PM, Binoy Dalal wrote: > > According to the edismax documentation, negative boosts are supported, so >

Disable phrase search in edismax?

2016-02-26 Thread Walter Underwood
I’m creating a query from MLT terms, then sending it to edismax. The neighboring words in the query are not meaningful phrases. Is there a way to turn off phrase creation and search for one query? Or should I separate them all with “OR”? wunder Walter Underwood wun...@wunderwood.org http

Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Walter Underwood
ple of the need for shingle-type synonyms. wunder Walter Underwood Former GO.com/Infoseek search engineer wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Indexing books, chapters and pages

2016-03-01 Thread Walter Underwood
You could index both pages and chapters, with a type field. You could index by chapter with the page number as a payload for each token. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati wrote: > >

Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
If you need transactions, you should use a different system, like MarkLogic. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 3, 2016, at 8:46 PM, sangs8788 > wrote: > > Hi Emir, > > Right now we are having only inserts i

Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
So batch them. You get a response back from Solr whether the document was accepted. If that fail, there is a failure. What do you do then? After every 100 docs or one minute, do a commit. Then delete the documents from the input queue. What do you do when the commit fails? wunder Walter

Re: What is the best way to index 15 million documents of total size 425 GB?

2016-03-04 Thread Walter Underwood
> On Mar 3, 2016, at 9:54 AM, Aneesh Mon N wrote: > > To be noted that all the fields are stored so as to support the atomic > updates. Are you doing all of these updates as atomic? That could be slow. If you are supplying all the fields, then just do a regular add. wunder Walt

Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Walter Underwood
ing> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 6, 2016, at 7:27 AM, Jack Krupansky wrote: > > Back to the original question... there are two answers: > > 1. Yes - for guru-level Solr experts. > 2. No - for anybody else

Re: Solr Cloud sharding strategy

2016-03-07 Thread Walter Underwood
queries. * 5000 queries is not nearly enough. That totally fits in cache. I usually start with 100K, though I’d like more. Benchmarking a cached system is one of the hardest things in devops. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 7, 2

Re: Why is multiplicative boost prefered over additive?

2016-03-18 Thread Walter Underwood
the popularity scale. I gave up and made it work for popular movies. Here at Chegg, multiplicative boost works fine. Don’t think so much about the absolute values of the scores. All we care about is ordering. Work with real user queries, not with theory. wunder Walter Underwood wun

Re: Why is multiplicative boost prefered over additive?

2016-03-18 Thread Walter Underwood
hundreds of views? People really will notice when the 1978 animated version shows up before the Peter Jackson films. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 18, 2016, at 8:18 AM, > wrote: > > On Friday, March 18, 2016

Re: Why is multiplicative boost prefered over additive?

2016-03-19 Thread Walter Underwood
one rented one million time and the one rented 800 thousand times (think about the Twilight movies at Netflix). But it also distinguishes between the one rented 100 times and the one rented 80 times. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) >

Re: Why is multiplicative boost prefered over additive?

2016-03-20 Thread Walter Underwood
” not be the first hit for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 18, 2016, at 8:48 AM, > wrote: > > On Friday, March 18, 2016 4:25 PM, wun...@wunderwood.org wrote: >> >> That works fine if you have a q

Re: Delete by query using JSON?

2016-03-22 Thread Walter Underwood
on this list is an “XY problem”, where the poster has problem X and has assumed solution Y, which is not the right solution. But they ask about Y. So we will tell people that their approach is wrong, because that is the most helpful thing we can do. wunder Walter Underwood wun...@wunderwood.org

Re: Can Solr recognize daylight savings time?

2016-03-25 Thread Walter Underwood
If possible, log in UTC. Daylight time causes amusing problems in logs, like one day with 23 hours and one day with 25. You can always convert to local time when you display it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 25, 2016, at 8

Re: Solr slave is doing full replication (entire index) of index after master restart

2016-04-09 Thread Walter Underwood
I’m not sure this is a legal polling interval: 00:00:60 Try: 00:01:00 Also, polling every minute is very fast. Try a longer period. Check the clocks on the two systems. If the clocks are not synchronized, that could cause problem. wunder Walter Underwood wun

Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-14 Thread Walter Underwood
age+Analysis> 3. Learn the analysis tool in the Solr admin UI. That allows you to explore the behavior. 4. If you really need a high grade morphological analyzer, consider purchasing one from Basis Technology: http://www.rosette.com/solr/ <http://www.rosette.com/solr/> wunder Walter U

Re: Referencing incoming search terms in searchHandler XML

2016-04-14 Thread Walter Underwood
e two. If your customer absolutely insists on having every single figo doc above non-figo docs, well, they deserve what they get. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-15 Thread Walter Underwood
connections open or pool them, because PHP doesn’t do that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 15, 2016, at 8:39 AM, Sara Woodmansee wrote: > > Hi Shawn, > > No clue what PHP client they are using. > >

Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-17 Thread Walter Underwood
No, Zookeeper is used for managing the locations of replicas and the leader for indexing. Queries should still be distributed with a load balancer. Queries do NOT go through Zookeeper. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 17, 2

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Walter Underwood
http://dvd.netflix.com/Search?v1=blade+runner <http://dvd.netflix.com/Search?v1=blade+runner> At Netflix (when I was there), those were shown in popularity order with a boost function. And for stemming, should the movie “Saw” match “see”? Maybe not. wunder Walter Underwood wun...@wund

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Walter Underwood
32 GB is a pretty big heap. If the working set is really smaller than that, the extra heap just makes a full GC take longer. How much heap is used after a full GC? Take the largest value you see there, then add a bit more, maybe 25% more or 2 GB more. wunder Walter Underwood wun

Absolute path name for external file field

2015-08-13 Thread Walter Underwood
that still possible? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)

Re: Admin Login

2015-08-15 Thread Walter Underwood
No one runs a public-facing Solr server. Just like no one runs a public-facing MySQL server. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 15, 2015, at 4:15 PM, Scott Derrick wrote: > I'm somewhat puzzled there is no built in secu

Re: Cache

2015-08-19 Thread Walter Underwood
Why? Do you evaluate Unix performance with and without file buffers? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 19, 2015, at 5:00 PM, Nagasharath wrote: > Trying to evaluate the performance of queries with and without cache > >

Re: Multiple concurrent queries to Solr

2015-08-23 Thread Walter Underwood
, it can block. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 23, 2015, at 8:49 AM, Shawn Heisey wrote: > On 8/23/2015 7:46 AM, Ashish Mukherjee wrote: >> I want to run few Solr queries in parallel, which are being done in a >> multi

Re: any easy way to find out when a core's index physical file has been last updated?

2015-09-03 Thread Walter Underwood
Instead of writing new code, you could configure an autocommit interval in Solr. That already does what you want, no more than one commit in the interval and no commits if there were no adds or deletes. Then the clients would never need to commit. wunder Walter Underwood wun...@wunderwood.org

Re: Strange interpretation of invalid ISO date strings

2015-09-07 Thread Walter Underwood
Yes, ISO 8601 gets pretty baroque in the far nooks and crannies of the spec. I use the “web profile” of ISO 8601, which is very simple. I’ve never seen any software mishandle dates using this subset of the spec. http://www.w3.org/TR/NOTE-datetime wunder Walter Underwood wun...@wunderwood.org

Re: Solr facets implementation question

2015-09-08 Thread Walter Underwood
Every faceting implementation I’ve seen (not just Solr/Lucene) makes big in-memory lists. Lots of values means a bigger list. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Sep 8, 2015, at 8:33 AM, Shawn Heisey wrote: > On 9/8/2015 9:10 AM, adfe

Re: Detect term occurrences

2015-09-10 Thread Walter Underwood
Doing a query for each term should work well. Solr is fast for queries. Write a script. I assume you only need to do this once. Running all the queries will probably take less time than figuring out a different approach. wunder Walter Underwood wun...@wunderwood.org http

Re: Ideas

2015-09-21 Thread Walter Underwood
. That was using too much CPU. Right now, block the IPs. Those are hostile. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 21, 2015, at 10:31 AM, Paul Libbrecht wrote: > > Writing a query component would be pretty easy or? > It wo

Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-22 Thread Walter Underwood
Faceting on an author field is almost always a bad idea. Or at least a slow, expensive idea. Faceting makes big in-memory lists. More values, bigger lists. An author field usually has many, many values, so you will need a lot of memory. wunder Walter Underwood wun...@wunderwood.org http

Re: is there a way to remove deleted documents from index without optimize

2015-09-22 Thread Walter Underwood
Don’t do anything. Solr will automatically clean up the deleted documents for you. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 22, 2015, at 6:01 PM, CrazyDiamond wrote: > > my index is updating frequently and i need to remo

Re: solr get score of each doc in edis max search and more like this search result

2015-09-23 Thread Walter Underwood
limit will almost certainly not do what you want. Because it doesn’t do anything useful. I recommend reading this document for more info: https://wiki.apache.org/lucene-java/ScoresAsPercentages <https://wiki.apache.org/lucene-java/ScoresAsPercentages> wunder Walter Underwo

Re: firstSearcher cache warming with own QuerySenderListener

2015-09-25 Thread Walter Underwood
Right. I chose the twenty most frequent terms from our documents and use those for cache warming. The list of most frequent terms is pretty stable in most collections. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 25, 2015, at 8:38

Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
Sure. 1. Delete all the docs (no commit). 2. Add all the docs (no commit). 3. Commit. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 25, 2015, at 2:17 PM, Ravi Solr wrote: > > I have been trying to re-index the docs (about 1.5 mi

Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
them. No guarantee, but it is worth a try. Good luck. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 25, 2015, at 2:59 PM, Ravi Solr wrote: > > Walter, Not in a mood for banter right now Its 6:00pm on a friday and > Iam stuck

Re: Cost of having multiple search handlers?

2015-09-28 Thread Walter Underwood
We did the same thing, but reporting performance metrics to Graphite. But we won’t be able to add servlet filters in 6.x, because it won’t be a webapp. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 28, 2015, at 11:32 AM, Gili Nachum wr

Re: Cost of having multiple search handlers?

2015-09-28 Thread Walter Underwood
We built our own because there was no movement on that. Don’t hold your breath. Glad to contribute it. We’ve been running it in production for a year, but the config is pretty manual. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 28, 2

Re: Solr vs Lucene

2015-10-01 Thread Walter Underwood
If you want a spell checker, don’t use a search engine. Use a spell checker. Something like aspell (http://aspell.net/ <http://aspell.net/>) will be faster and better than Solr. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 1, 2015

Re: How to disable the admin interface

2015-10-05 Thread Walter Underwood
You understand that disabling the admin API will leave you with an unmaintainable Solr installation, right? You might not even be able to diagnose the problem. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 5, 2015, at 11:34 AM, Siddhar

Re: Best Indexing Approaches - To max the throughput

2015-10-06 Thread Walter Underwood
It depends on the document. In a e-commerce search, you might want to fail immediately and be notified. That is what we do, fail, rollback, and notify. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 6, 2015, at 7:58 AM, Alessandro Benede

Re: Best Indexing Approaches - To max the throughput

2015-10-06 Thread Walter Underwood
get an accurate report of which document was rejected. I wrote that same thing back at Netflix, before SolrJ. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 6, 2015, at 9:49 AM, Alessandro Benedetti > wrote: > > Hi Walter, >

Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Walter Underwood
LDP/sag/html/buffer-cache.html> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 7, 2015, at 3:40 AM, Toke Eskildsen wrote: > > On Wed, 2015-10-07 at 07:03 -0300, Eric Torti wrote: >> I'm sorry to diverge this thread a li

Re: EdgeNGramFilterFactory question

2015-10-07 Thread Walter Underwood
different analysis chains stored in separate fields. The exact example you list will work fine with stemming and phrase search. Check out the phrase search support in the edismax query parser. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oc

Re: How to show some documents ahead of others

2015-10-08 Thread Walter Underwood
items using the “boost” parameter in edismax. Adjust it to be a tiebreaker between documents with similar score. 2. Show two lists, one with the five most relevant paid, the next with the five most relevant unpaid. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my

Re: Exclude documents having same data in two fields

2015-10-09 Thread Walter Underwood
Please explain why you do not want to use an extra field. That is the only solution that will perform well on your large index. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 9, 2015, at 7:47 AM, Aman Tandon wrote: > > No Sushee

Re: Exclude documents having same data in two fields

2015-10-10 Thread Walter Underwood
After several days, we finally get the real requirement. It really does waste a lot of time and energy when people won’t tell us that. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 10, 2015, at 8:19 AM, Upayavira wrote: > > In w

Re: How to show some documents ahead of others - requirements

2015-10-10 Thread Walter Underwood
thing. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 10, 2015, at 9:31 AM, Erick Erickson wrote: > > Would result grouping work here? If the group key was "paid", then > you'd get two groups back, "paid"

Re: catchall fields or multiple fields

2015-10-12 Thread Walter Underwood
phonetic representation, then you can weight the lower case higher than the stemmed field, and stemmed higher than phonetic. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 12, 2015, at 6:12 AM, Ahmet Arslan wrote: > > Hi, > > Catc

Re: LIX readability index calculation by solr

2015-10-21 Thread Walter Underwood
Can you reload all the content? If so, I would calculate this in an update request processor and put the result in its own field. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 21, 2015, at 2:53 AM, Roland Szűcs wrote: > > Thank

Re: [newbie] Configuration for SolrCloud + DataImportHandler

2015-10-21 Thread Walter Underwood
Does the collection reload do a rolling reload of each node or does it do them all at once? We were planning on using the core reload on each system, one at a time. That would make sure the collection stays available. I read the documentation, it didn’t say anything about that. wunder Walter

Re: Best strategy for indexing multiple tables with multiple fields

2015-10-26 Thread Walter Underwood
with tens of thousands of fields. A thousand fields might be cumbersome, but it won’t break Solr. If the tables contain different kinds of things, you might have different collections (one per document), or one collection with a “type” field for each kind of document. wunder Walter Underwood

Re: restore quorum after majority of zk nodes down

2015-10-29 Thread Walter Underwood
igure the Solr cluster to talk to it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 29, 2015, at 10:08 AM, Matteo Grolla wrote: > > I'm designing a solr cloud installation where nodes from a single cluster > are distributed

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
g fast. In only 21 lines of Python. http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 30, 2015, at 11:37 AM, Robert Oschler wrote: > > Hello everyone, &

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
short article to learn more about spelling correction. http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 30, 2015, at 4:45 PM, Robert Oschler wrote: > > H

Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
Read the links I have sent. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 30, 2015, at 7:10 PM, Robert Oschler wrote: > > Thanks Walter. Are there any open source spell checkers that implement the > Peter Norvig or Damerau

Re: Solr getting irrelevant results when use block join

2015-10-31 Thread Walter Underwood
This will probably work better without child documents and joins. I would denormalize into actor documents and movie documents. At least, that’s what I did at Netflix. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 31, 2015, at 1:17

Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
use the EdgeNgramFilter to index prefixes. That will make your index larger, but prefix searches will be very fast. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 2, 2015, at 5:17 AM, Toke Eskildsen wrote: > > On Mon, 2015-11-02 at 17

Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 2, 2015, at 9:39 PM, Modassar Ather wrote: > > Thanks Walter for your response, > > It is around 90GB of index (around 8 million documents) on one shard and > there are 12 such shards. As pe

Re: Boosting a document score when advertised! Please help!

2015-11-05 Thread Walter Underwood
approach is nice and clear. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 5, 2015, at 3:33 AM, Alessandro Benedetti > wrote: > > Hi Christian, > there are several ways : > > 1) Elevation query component - it should be

Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Walter Underwood
It is pretty handy, though. Great for expunging docs that are marked deleted or are expired. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 6, 2015, at 5:31 PM, Alexandre Rafalovitch wrote: > > Elasticsearch removed deleteByQuery

Re: Best way to track cumulative GC pauses in Solr

2015-11-13 Thread Walter Underwood
Also, what GC settings are you using? We may be able to make some suggestions. Cumulative GC pauses aren’t very interesting to me. I’m more interested in the longest ones, 90th percentile, 95th, etc. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) >

Re: Solr logging in local time

2015-11-16 Thread Walter Underwood
I’m sure it is possible, but think twice before logging in local time. Do you really want one day with 23 hours and one day with 25 hours each year? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 16, 2015, at 8:04 AM, tedsolr wrote: >

Re: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Walter Underwood
That is the approach I’ve been using for years. Simple and effective. It probably makes the index bigger. Make sure that only one of the fields is stored, because the stored text will be exactly the same in both. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my

Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood
those lists will fit in memory. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 19, 2015, at 3:46 PM, Steven White wrote: > > Hi everyone > > What is considered too many fields for qf and fq? On average I will have > 1500 field

Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood
The implementation for fq has changed from 4.x to 5.x, so I’ll let someone else answer that in detail. In 4.x, the result of each filter query can be cached. After that, they are quite fast. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov

Re: Setting up Solr on multiple machines

2015-11-29 Thread Walter Underwood
operating. Specifying a list of all the zk nodes is robust. If one goes down, it tries another. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 29, 2015, at 12:19 PM, Don Bosco Durai wrote: > > This should answer your question:

Re: Setting up Solr on multiple machines

2015-11-29 Thread Walter Underwood
e ensemble. > > Regards, > Salman > > On Mon, Nov 30, 2015 at 1:07 AM, Walter Underwood > wrote: > >> Why would that link answer the question? >> >> Each Solr connects to one Zookeeper node. If that node goes down, >> Zookeeper is still available, but

Re: fuzzy searches and EDISMAX

2015-12-08 Thread Walter Underwood
629> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 8, 2015, at 9:56 AM, Felley, James wrote: > > I am trying to build an edismax search handler that will allow a fuzzy > search, using the "query fields" property (qf). >

Re: Long Running Data Import Handler - Notifications

2015-12-08 Thread Walter Underwood
grep '"status":"idle"' > /dev/null [ $? -ne 0 ] || break sleep 300 done echo Solr indexing is finished wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 8, 2015, at 5:37 PM, Brian Narsi wrote: > &

Re: Unstructured/Structured data for indexing

2015-12-09 Thread Walter Underwood
Often Solr documents are “semi-structured”. They have some structured fields and some free-text fields. e-mail messages are like that, with structured headers and an unstructured body. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 9, 2

Re: Committed before 500

2015-02-20 Thread Walter Underwood
Since you are getting these failures, the 90 second timeout is not “good enough”. Try increasing it. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 20, 2015, at 5:22 AM, NareshJakher wrote: > Hi Shawn, > > I do not want to increase t

Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
The HTTP protocol does not set a limit on GET URL size, but individual web servers usually do. You should get a response code of “414 Request-URI Too Long” when the URL is too long. This limit is usually configurable. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org

Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
, you may need to re-think your design. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 21, 2015, at 4:45 PM, Shawn Heisey wrote: > On 2/21/2015 1:46 AM, steve wrote: >> Careful with the GETs! There is a real, hard limit on the length

Re: syntax for increasing java memory

2015-02-23 Thread Walter Underwood
That depends on the JVM you are using. For the Oracle JVMs, use this to get a list of extended options: java -X wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:21 AM, Kevin Laurie wrote: > Hi Guys, > I am a newbie on Solr

Re: Basic Multilingual search capability

2015-02-23 Thread Walter Underwood
-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb

  1   2   3   4   5   6   7   8   9   10   >