Re: Issues when indexing PDF files

2015-12-17 Thread Walter Underwood
PDF isn’t really text. For example, it doesn’t have spaces, it just moves the 
next letter over farther. Letters might not be in reading order — two column 
text could be printed as horizontal scans. Custom fonts might not use an 
encoding that matches Unicode, which makes them encrypted (badly). And so on.

As one of my coworkers said, trying to turn a PDF into structured text is like 
trying to turn hamburger back into a cow.

PDF is where text goes to die.

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 17, 2015, at 2:48 AM, Charlie Hull  wrote:
> 
> On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> Hi Alexandre,
>> 
>> Thanks for your reply.
>> 
>> So the only way to solve this issue is to explore with PDF specific tools
>> and change the encoding of the file?
>> Is there any way to configure it in Solr?
> 
> Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
> in a way that Tika cannot easily extract the text, there's nothing you can do 
> in Solr that will help.
> 
> Unfortunately PDF isn't a content format but a presentation format - so 
> extracting plain text is fraught with difficulty. You may see a character on 
> a PDF page, but exactly how that character is generated (using a specific 
> encoding, font, or even by drawing a picture) is outside your control. There 
> are various businesses built on this premise - they charge for creating clean 
> extracted text from PDFs - and even they have trouble with some PDFs.
> 
> HTH
> 
> Charlie
> 
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
>> wrote:
>> 
>>> They could be using custom fonts and non-Unicode characters. That's
>>> probably something to explore with PDF specific tools.
>>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" 
>>> wrote:
>>> 
>>>> I've checked all the files which has problem with the content in the Solr
>>>> index using the Tika app. All of them shows the same issues as what I see
>>>> in the Solr index.
>>>> 
>>>> So does the issues lies with the encoding of the file? Are we able to
>>> check
>>>> the encoding of the file?
>>>> 
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> 
>>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
>>>> wrote:
>>>> 
>>>>> Hi Erik,
>>>>> 
>>>>> I've shared the file on dropbox, which you can access via the link
>>> here:
>>>>> 
>>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>>>>> 
>>>>> This is what I get from the Tika app after dropping the file in.
>>>>> 
>>>>> Content-Length: 75092
>>>>> Content-Type: application/pdf
>>>>> Type: COSName{Info}
>>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>>>>> X-TIKA:digest:SHA256:
>>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>>>>> access_permission:assemble_document: true
>>>>> access_permission:can_modify: true
>>>>> access_permission:can_print: true
>>>>> access_permission:can_print_degraded: true
>>>>> access_permission:extract_content: true
>>>>> access_permission:extract_for_accessibility: true
>>>>> access_permission:fill_in_form: true
>>>>> access_permission:modify_annotations: true
>>>>> dc:format: application/pdf; version=1.3
>>>>> pdf:PDFVersion: 1.3
>>>>> pdf:encrypted: false
>>>>> producer: null
>>>>> resourceName: Desmophen+670+BAe.pdf
>>>>> xmpTPg:NPages: 3
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>>> 
>>>>> 
>>>>> On 17 December 2015 at 00:15, Erik Hatcher 
>>>> wrote:
>>>>> 
>>>>>> Edwin - Can you share one of those PDF files?
>>>>>> 
>>>>>> Also, drop the file into the Tika app and see what it sees directly -
>>>> get
>>>>>> the tika-app JAR and run that desktop application.
>>>>>> 
>>>>>> Could be an encoding issue?
>>>>>> 
>>>>>> Erik
>>>>>> 
>>>>>> —
>>>>>> Erik Hatcher, Senior Solutions Architect
>>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>>> edwinye...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm using Solr 5.3.0
>>>>>>> 
>>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
>>> there
>>>>>> are
>>>>>>> chinese text in the documents, but after indexing, what is indexed
>>> in
>>>>>> the
>>>>>>> content is either a series of "??" or an empty content.
>>>>>>> 
>>>>>>> I'm using the post.jar that comes together with Solr.
>>>>>>> 
>>>>>>> What could be the reason that causes this?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Edwin
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk



Re: TPS with Solr Cloud

2015-12-21 Thread Walter Underwood
How many documents do you have? How big is the index?

You can increase total throughput with replicas. Shards will make it slower, 
but allow more documents.

At 8000 queries/s, I assume you are using the same query over and over. If so, 
that is a terrible benchmark. Everything is served out of cache.

Test with production logs. Choose logs where the number of distinct queries is 
much larger than your cache sizes. If your caches are 1024, it would be good to 
have a 100K distinct queries. That might mean of total log size of a few 
million queries.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 21, 2015, at 9:47 AM, Upayavira  wrote:
> 
> 
> You add shards to reduce response times. If your responses are too slow
> for 1 shard, try it with three. Skip two for reasons stated above.
> 
> Upayavira
> 
> On Mon, Dec 21, 2015, at 04:27 PM, Erick Erickson wrote:
>> 8,000 TPS almost certainly means you're firing the same (or
>> same few) requests over and over and hitting the queryResultCache,
>> look in the adminUI>>core>>plugins/stats>>cache>>queryResultCache.
>> I bet you're seeing a hit ratio near 100%. This is what Toke means
>> when he says your tests are too lightweight.
>> 
>> 
>> As others have outlined, to increase TPS (after you straighten out
>> your test harness) you add _replicas_ rather than add _shards_.
>> Only add shards when your collections are too big to fit on a single
>> Solr instance.
>> 
>> Best,
>> Erick
>> 
>> On Mon, Dec 21, 2015 at 1:56 AM, Emir Arnautovic
>>  wrote:
>>> Hi Anshul,
>>> TPS depends on number of concurrent request you can run and request
>>> processing time. With sharding you reduce processing time with reducing
>>> amount of data single node process, but you have overhead of inter shard
>>> communication and merging results from different shards. If that overhead is
>>> smaller than time you get when processing half of index, you will see
>>> increase of TPS. If you are running same query in a loop, first request will
>>> be processed and others will likely be returned from cache, so response time
>>> will not vary with index size hence sharding overhead will cause TPS to go
>>> down.
>>> If you are happy with your response time, and want more TPS, you go with
>>> replications - that will increase number of concurrent requests you can run.
>>> 
>>> Also, make sure your tests are realistic in order to avoid having false
>>> estimates and have surprises when start running real load.
>>> 
>>> Regards,
>>> Emir
>>> 
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>> 
>>> 
>>> 
>>> 
>>> On 21.12.2015 08:18, Anshul Sharma wrote:
>>>> 
>>>> Hi,
>>>> I am trying to evaluate solr for one of my project for which i need to
>>>> check the scalability in terms of tps(transaction per second) for my
>>>> application.
>>>> I have configured solr on 1 AWS server as standalone application which is
>>>> giving me a tps of ~8000 for my query.
>>>> In order to test the scalability, i have done sharding of the same data
>>>> across two AWS servers with 2.5 milion records each .When i try to query
>>>> the cluster with the same query as before it gives me a tps of ~2500 .
>>>> My understanding is the tps should have been increased in a cluster as
>>>> these are two different machines which will perform separate I/O
>>>> operations.
>>>> I have not configured any seperate load balancer as the document says that
>>>> by default solr cloud will perform load balancing in a round robin
>>>> fashion.
>>>> Can you please help me in understanding the issue.
>>>> 
>>> 



Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-22 Thread Walter Underwood
We need to know a LOT more about your site. Number of documents, size of index, 
frequency of updates, length of queries approximate size of server (CPUs, RAM, 
type of disk), version of Solr, version of Java, and features you are using 
(faceting, highlighting, etc.).

After that, we’ll have more questions.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore  wrote:
> 
> Hi All,
> 
> my website is under pressure, there is a big number of concurrent searches.
> When the connected users are too many, the searches becomes so slow that in
> some cases users have to wait many seconds.
> The queue of searches becomes so long that, in same cases, servers are
> blocked trying to serve all these requests.
> As far as I know because some searches are very expensive, and when many
> expensive searches clog the queue server becomes unresponsive.
> 
> In order to quickly workaround this herd effect, I have added a
> default timeAllowed to 15 seconds, and this seems help a lot.
> 
> But during stress tests but I'm unable to understand when and what requests
> are affected by timeAllowed parameter.
> 
> Just be clear, I have configure timeAllowed parameter in a SolrCloud
> environment, given that partial results may be returned (if there are any),
> how can I know when this happens? When the timeAllowed parameter trigger a
> partial answer?
> 
> Best regards,
> Vincenzo
> 
> 
> 
> -- 
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251



Re: Limit fields returned in solr based on content

2015-12-24 Thread Walter Underwood
I would do that in a middle tier. You can’t do every single thing in Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 24, 2015, at 1:21 PM, Upayavira  wrote:
> 
> You could create a custom DocTransformer. They can enhance the fields
> included in the search results. So, instead of fl=somefield you could
> have fl=[my-filter:somefield], and your MyFieldDocTransformer makes the
> decision as to whether or not to include somefield in the output.
> 
> This would of course, require some Java coding.
> 
> Upayavira
> 
> On Thu, Dec 24, 2015, at 09:17 PM, Jamie Johnson wrote:
>> Sorry hit send too early
>> 
>> Is there a mechanism in solr/lucene that allows customization of the
>> fields
>> returned that would have access to the field content and payload?
>> On Dec 24, 2015 4:15 PM, "Jamie Johnson"  wrote:
>> 
>>> I have what I believe is a unique requirement discussed here in the past
>>> to limit data sent to users based on some marking in the field.
>>> 



Re: Memory Usage increases by a lot during and after optimization .

2015-12-29 Thread Walter Underwood
Do not “optimize".

It is a forced merge, not an optimization. It was a mistake to ever name it 
“optimize”. Solr automatically merges as needed. There are a few situations 
where a force merge might make a small difference. Maybe 10% or 20%, no one had 
bothered to measure it.

If your index is continually updated, clicking that is a complete waste of 
resources. Don’t do it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 29, 2015, at 6:35 PM, Zheng Lin Edwin Yeo  wrote:
> 
> Hi,
> 
> I am facing a situation, when I do an optimization by clicking on the
> "Optimized" button on the Solr Admin Overview UI, the memory usage of the
> server increases gradually, until it reaches near the maximum memory
> available. There is 64GB of memory available in the server.
> 
> Even after the optimized is completed, the memory usage stays near the 100%
> range, and could not be reduced until I stop Solr. Why could this be
> happening?
> 
> Also, I don't think the optimization is completed, as the admin page says
> the index is not optimized again after I go back to the Overview page, even
> though I did not do any updates to the index.
> 
> I am using Solr 5.3.0, with 1 shard and 2 replica. My index size is 183GB.
> 
> Regards,
> Edwin



Re: Memory Usage increases by a lot during and after optimization .

2015-12-29 Thread Walter Underwood
The only time that a force merge might be useful is when you reindex all 
content every night or every week, then do not make any changes until the next 
reindex. But even then, it probably does not matter.

Just let Solr do its thing. Solr is pretty smart.

A long time ago (1996-2006), I worked on an enterprise search engine with the 
same merging algorithm as Solr (Ultraseek Server). We always had customers 
asking about force-merge/optimize. It never made a useful difference. Even with 
twenty servers at irs.gov <http://irs.gov/>, it didn’t make a difference.

wunder
K6WRU
Walter Underwood
CM87wj
http://observer.wunderwood.org/ (my blog)

> On Dec 29, 2015, at 6:59 PM, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Walter,
> 
> Thanks for your reply.
> 
> Then how about optimization after indexing?
> Normally the index size is much larger after indexing, then after
> optimization, the index size reduces. Do we still need to do that?
> 
> Regards,
> Edwin
> 
> On 30 December 2015 at 10:45, Walter Underwood 
> wrote:
> 
>> Do not “optimize".
>> 
>> It is a forced merge, not an optimization. It was a mistake to ever name
>> it “optimize”. Solr automatically merges as needed. There are a few
>> situations where a force merge might make a small difference. Maybe 10% or
>> 20%, no one had bothered to measure it.
>> 
>> If your index is continually updated, clicking that is a complete waste of
>> resources. Don’t do it.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Dec 29, 2015, at 6:35 PM, Zheng Lin Edwin Yeo 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I am facing a situation, when I do an optimization by clicking on the
>>> "Optimized" button on the Solr Admin Overview UI, the memory usage of the
>>> server increases gradually, until it reaches near the maximum memory
>>> available. There is 64GB of memory available in the server.
>>> 
>>> Even after the optimized is completed, the memory usage stays near the
>> 100%
>>> range, and could not be reduced until I stop Solr. Why could this be
>>> happening?
>>> 
>>> Also, I don't think the optimization is completed, as the admin page says
>>> the index is not optimized again after I go back to the Overview page,
>> even
>>> though I did not do any updates to the index.
>>> 
>>> I am using Solr 5.3.0, with 1 shard and 2 replica. My index size is
>> 183GB.
>>> 
>>> Regards,
>>> Edwin
>> 
>> 



Re: Solr index segment level merge

2015-12-29 Thread Walter Underwood
You probably do not NEED to merge your indexes. Have you tried not merging the 
indexes?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 29, 2015, at 7:31 PM, jeba earnest  wrote:
> 
> I have a scenario that I need to merge the solr indexes online. I have a
> primary solr index of 100 Gb and it is serving the end users and it can't
> go offline for a moment. Everyday new lucene indexes(2 GB) are generated
> separately.
> 
> I have tried coreadmin
> https://cwiki.apache.org/confluence/display/solr/Merging+Indexes
> 
> And it will create a new core or new folder. which means it will copy 100Gb
> every time to a new folder.
> 
> Is there a way I can do a segment level merging?
> 
> Jeba



Re: Data migration from one collection to the other collection

2016-01-05 Thread Walter Underwood
You could send the documents to both and filter out the recent ones in the 
history collection.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 5, 2016, at 5:46 AM, vidya  wrote:
> 
> Hi
> 
> I would like to maintain two cores for history data and current data where
> hdfs is my datasource. My requirement is that data input should be given to
> only one collection and previous data should be moved to history collection.
> 1)Creating two cores and migrating data from current to history collection
> by data-config.xml using solrEntityProcessor. In data-config.xml, where
> should i represent two collections for migrating source collection to the
> other collection.And how to make sure that happens.Do I need to run a job or
> how to make sure that data migration occurs.
> https://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
> 
> 2)collection aliasing is a concept which creates new collection after a
> period of time.
> I read that concept but lagging in how to implement it.Like where do i need
> to make changes in my solrcloud.
> http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
> 
> Please help me on this.
> 
> Thanks in advance
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Data-migration-from-one-collection-to-the-other-collection-tp4248646.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solrcloud for Java 1.6

2016-01-07 Thread Walter Underwood
You can have multiple Java versions installed on the same system. Well, the 
same non-Windows system. Use the PATH environment variable to set the right 
Java for each application.

If you really, really must run Java 1.6 for everything, you will not be running 
Solr 5.x. I think the switch to require Java 7 was made at some point in the 
4.x development.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 7, 2016, at 7:26 PM, billnb...@gmail.com wrote:
> 
> Run it on 2 separate boxes
> 
> Bill Bell
> Sent from mobile
> 
> 
>> On Jan 7, 2016, at 3:11 PM, Aswath Srinivasan (TMS) 
>>  wrote:
>> 
>> Hi fellow developers,
>> 
>> I have a situation where the search front-end application is using java 1.6. 
>> Upgrading Java version is out of the question.
>> 
>> Planning to use Solrcloud 5.x version for the search implementation. The 
>> show stopper here is, solrj for solrcloud needs atleast java1.7
>> 
>> What best can be done to use the latest version of solrcloud and solrj for a 
>> portal that runs on java 1.6?
>> 
>> I was thinking, in solrj, instead of using zookeeper (which also acts as the 
>> load balancer) I can mention the exact replica's http://solr-cloud-HOST:PORT 
>> pairs using some kind of round-robin with some external load balancer.
>> 
>> Any suggestion is highly appreciated.
>> 
>> Aswath NS
>> 



Re: solr score threashold

2016-01-20 Thread Walter Underwood
The ScoresAsPercentages page is not really instructions for how to normalize 
scores. It is an explanation of why a score threshold does not do what you want.

Don’t use thresholds. If you want thresholds, you will need a search engine 
with a probabilistic model, like Verity K2. Those generally give worse results 
than a vector space model, but you can have thresholds.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 20, 2016, at 5:11 AM, Emir Arnautovic  
> wrote:
> 
> Hi Sara,
> You can use funct and frange to achive needed, but note that scores are not 
> normalized meaning score 8 does not mean it is good match - it is just best 
> match. There are examples online how to normalize score (e.g. 
> http://wiki.apache.org/lucene-java/ScoresAsPercentages).
> Other approach is to write custom component that will filter out docs below 
> some threshold.
> 
> Thanks,
> Emir
> 
> On 20.01.2016 13:58, sara hajili wrote:
>> hi all,
>> i wanna to know about solr search relevency scoreing threashold.
>> can i change it?
>> i mean immagine when i searching i get this result
>> doc1 score =8
>> doc2 score =6.4
>> doc3 score=6
>> doc8score=5.5
>> doc5 score=2
>> i wana to change solr score threashold .in this way i set threashold for
>> example >4
>> and then i didn't get doc5 as result.can i do this?if yes how?
>> and if not how i can modified search to don't get docs as a result that
>> these docs have a lot distance from doc with max score?
>> in other word i wanna to delete this gap between solr results
>> 
> 
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 



Re: Scaling SolrCloud

2016-01-21 Thread Walter Underwood
Alternatively, do you still want to be protected against a single failure 
during scheduled maintenance?

With a three node ensemble, when one Zookeeper node is being updated or moved 
to a new instance, one more failure means it does not have a quorum. With a 
five node ensemble, three nodes would still be up.

If you are OK with that risk, run three nodes. If not, run five.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 21, 2016, at 9:27 AM, Erick Erickson  wrote:
> 
> NP. My usual question though is "how often do you expect to lose a
> second ZK node before you can replace the first one that died?"
> 
> My tongue-in-cheek statement is often "If you're losing two nodes
> regularly, you have problems with your hardware that you're not really
> going to address by adding more ZK nodes" ;).
> 
> And do note that even if you lose quorum, SolrCloud will continue to
> serve _queries_, albeit the "picture" each individual Solr node has of
> the current state of all the Solr nodes will get stale. You won't be
> able to index though. That said, the internal Solr load balancers
> auto-distribute queries anyway to live nodes, so things can limp
> along.
> 
> As always, it's a tradeoff between expense/complexity and robustness
> though, and each and every situation is different in how much risk it
> can tolerate.
> 
> FWIW,
> Erick
> 
> On Thu, Jan 21, 2016 at 1:49 AM, Yago Riveiro  wrote:
>> Is not a typo. I was wrong, for zookeeper 2 nodes still count as majority.
>> It's not the desirable configuration but is tolerable.
>> 
>> 
>> 
>> Thanks Erick.
>> 
>> 
>> 
>> \--
>> 
>> /Yago Riveiro
>> 
>>> On Jan 21 2016, at 4:15 am, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>>> 
>> 
>>> bq: 3 are to risky, you lost one you lost quorum
>> 
>>> 
>> 
>>> Typo? You need to lose two.
>> 
>>> 
>> 
>>> On Wed, Jan 20, 2016 at 6:25 AM, Yago Riveiro <yago.rive...@gmail.com>
>> wrote:
>> > Our Zookeeper cluster is an ensemble of 5 machines, is a good starting
>> point,
>> > 3 are to risky, you lost one you lost quorum and with 7 sync cost
>> increase.
>> >
>> >
>> >
>> > ZK cluster is in machines without IO and rotative hdd (don't not use SDD
>> to
>> > gain IO performance, zookeeper is optimized to spinning disks).
>> >
>> >
>> >
>> > The ZK cluster behaves without problems, the first deploy of ZK was in
>> the
>> > same machines that the Solr Cluster (ZK log in its own hdd) and that
>> didn't
>> > wok very well, CPU and networking IO from Solr Cluster was too much.
>> >
>> >
>> >
>> > About schema modifications.
>> >
>> > Modify the schema to add new fields is relative simple with new API, in
>> the
>> > pass all the work was manually uploading the schema to ZK and reloading
>> all
>> > collections (indexing must be disable or timeouts and funny errors
>> happen).
>> >
>> > With the new Schema API this is more user friendly. Anyway, I stop
>> indexing
>> > and for reload the collections (I don't know if it's necessary 
>> nowadays).
>> >
>> > About Indexing data.
>> >
>> >
>> >
>> > We have self made data importer, it's not java and not performs batch
>> indexing
>> > (with 500 collections buffer data and build the batch is expensive and
>> > complicate for error handling).
>> >
>> >
>> >
>> > We use regular HTTP post in json. Our throughput is about 1000 docs/s
>> without
>> > any type of optimization. Some time we have issues with replication, the
>> slave
>> > can keep pace with leader insertion and a full sync is requested, this 
>> is
>> bad
>> > because sync the replica again implicates a lot of IO wait and CPU and
>> with
>> > replicas with 100G take an hour or more (normally when this happen, we
>> disable
>> > indexing to release IO and CPU and not kill the node with a load of 50 
>> or
>> 60).
>> >
>> > In this department my advice is "keep it simple" in the end is an HTTP
>> POST to
>> > a node of the cluster.
>> >
>> >
>> >
>> > \\--
>> >
>> > /Yago Riveiro
>> >
>> >> On Jan 20 2016, at 1:39 pm, Troy Edwards
>

Re: Taking Solr to production

2016-01-22 Thread Walter Underwood
I agree, sharding may hurt more than it helps. And estimate the text size after 
the documents are processed.

We all love Solr Cloud, but this could be a good application for traditional 
master/slave Solr. That means no Zookeeper nodes and it is really easy to add a 
new query slave, just clone the instance.

We run an index with homework questions which seems similar to yours.

* 7 million documents.
* 50 Gbyte index.
* Request rates of 5000 to 10,000 q/minute per server.
* No facets or highlighting (highlighting soon, we store term vectors).
* Amazon EC2 instances with 16 cores, 30 Gbytes RAM, index is in ephemeral SSD.
* Index updates once per day.
* Master/slave.
* Solr 4.10.4.

During peak traffic, the 95th percentile response time was about three seconds, 
but that is because the queries are entire homework questions, up to 1000 
words, pasted into the query box. Yes, we have very unusual queries. Median 
response time was much better, about 50 milliseconds.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 22, 2016, at 2:45 PM, Toke Eskildsen  wrote:
> 
> Aswath Srinivasan (TMS)  wrote:
>> * Totally about 2.5 million documents to  be indexed
>> * Documents average size is 512 KB - pdfs and htmls
> 
>> This being said I was thinking I would take the Solr to production with,
>> * 2 shards, 1 Leader & 3 Replicas
> 
>> Do you all think this set up will work? Will this server me 150 QPS?
> 
> It certainly helps that you are batch updating. What is missing in this 
> estimation is how large the documents are when indexed, as I guess the ½MB 
> average is for the raw files? If they are your everyday short PDFs with 
> images, meaning not a lot of text, handling 2M+ of them is easy. If they are 
> all full-length books, it is another matter.
> 
> Your document count is relatively low and if your index data end up being 
> not-too-big (let's say 100GB), then you ought to consider having just a 
> single shard with 4 replicas: There is a non-trivial overhead going from 1 
> shard to more than one, especially if you are doing faceting.
> 
> - Toke Eskildsen



Re: schemaless vs schema based core

2016-01-22 Thread Walter Underwood
Yo. That is the truth. You can get stuff indexed with an automatic schema, but 
if you want to make your customers happy, tune it.

wunder 
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 22, 2016, at 6:22 PM, Erick Erickson  wrote:
> 
> And, more generally, schemaless makes a series of assumptions, any
> of which may be wrong.
> 
> You _must_ hand-tweak your schema to squeeze all the performance out of Solr
> that you can. If your collection isn't big enough that you need to squeeze,
> don't bother
> 
> FWIW,
> Erick
> 
> On Fri, Jan 22, 2016 at 11:19 AM, Steve Rowe  wrote:
>> Yes, and also underflow in the case of double/float.
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Jan 22, 2016, at 12:25 PM, Shyam R  wrote:
>>> 
>>> I think, schema-less mode might allocate double instead of float, long
>>> instead of int to guard against overflow, which increases index size. Is my
>>> assumption valid?
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jan 21, 2016 at 10:48 PM, Erick Erickson 
>>> wrote:
>>> 
>>>> I guess it's all about whether schemaless really supports
>>>> 1> all the docs you index.
>>>> 2> all the use-cases for search.
>>>> 3> the assumptions it makes scale to you needs.
>>>> 
>>>> If you've established rigorous tests and schemaless does all of the
>>>> above, I'm all for shortening the cycle by using schemaless.
>>>> 
>>>> But if it's just being sloppy and "success" is "I managed to index 50
>>>> docs and get some results back by searching", expect to find some
>>>> "interesting" issues down the road.
>>>> 
>>>> And finally, if it's "we use schemaless to quickly try things in the
>>>> UI and for the _real_ prod environment we need to be more rigorous
>>>> about the schema", well shortening development time is A Good Thing.
>>>> Part of moving to prod could be taking the schema generated by
>>>> schemaless and tweaking it for instance.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Thu, Jan 21, 2016 at 8:54 AM, Shawn Heisey  wrote:
>>>>> On 1/21/2016 2:22 AM, Prateek Jain J wrote:
>>>>>> Thanks Erick,
>>>>>> 
>>>>>> Yes, I took same approach as suggested by you. The issue is some
>>>> developers started with schemaless configuration and now they have started
>>>> liking it and avoiding restrictions (including increased time to deploy
>>>> application, in managed enterprise environment). I was more concerned about
>>>> pushing best practices around this in team, because allowing anyone to new
>>>> attributes will become overhead in terms of management, security and
>>>> maintainability. Regarding your concern about not storing documents on
>>>> separate disk; we are storing them in solr but not as backup copies. One
>>>> doubt still remains in mind w.r.t auto-detection of types in  solr:
>>>>>> 
>>>>>> Is there a performance benefit of using defined types (schema based)
>>>> vs un-defined types while adding documents? Does "solrj" ships this
>>>> meta-information like type of attributes to solr, because code looks
>>>> something like?
>>>>>> 
>>>>>> SolrInputDocument doc = new SolrInputDocument();
>>>>>> doc.addField("category", "book"); // String
>>>>>> doc.addField("id", 1234); //Long
>>>>>> doc.addField("name", "Trying solrj"); //String
>>>>>> 
>>>>>> In my opinion, any auto-detector code will have some overhead vs the
>>>> other; any thoughts around this?
>>>>> 
>>>>> Although the true reality may be more complex, you should consider that
>>>>> everything Solr receives from SolrJ will be text -- as if you had sent
>>>>> the JSON or XML indexing format manually, which has no type information.
>>>>> 
>>>>> When you are building a document with SolrInputDocument, SolrJ has no
>>>>> knowledge of the schema in Solr.  It doesn't know whether the target
>>>>> field is numeric, string, date, or something else.
>>>>> 
>>>>> Using different object types for input to SolrJ just gives you general
>>>>> Java benefits -- things like detecting certain programming errors at
>>>>> compile time.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ph: 9845704792
>> 



Re: Memory leak defect or misssuse of SolrJ API?

2016-01-30 Thread Walter Underwood
Create one HttpSolrClient object for each Solr server you are talking to. Reuse 
it for all requests to that Solr server.

It will manage a pool of connections and keep them alive for faster 
communication.

I took a look at the JavaDoc and the wiki doc, neither one explains this well. 
I don’t think they even point out what is thread safe.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 30, 2016, at 7:42 AM, Susheel Kumar  wrote:
> 
> Hi Steve,
> 
> Can you please elaborate what error you are getting and i didn't understand
> your code above, that why initiating Solr client object  is in loop.  In
> general  creating client instance should be outside the loop and a one time
> activity during the complete execution of program.
> 
> Thanks,
> Susheel
> 
> On Sat, Jan 30, 2016 at 8:15 AM, Steven White  wrote:
> 
>> Hi folks,
>> 
>> I'm getting memory leak in my code.  I narrowed the code to the following
>> minimal to cause the leak.
>> 
>>while (true) {
>>HttpSolrClient client = new HttpSolrClient("
>> http://192.168.202.129:8983/solr/core1";);
>>client.close();
>>}
>> 
>> Is this a defect or an issue in the way I'm using HttpSolrClient?
>> 
>> I'm on Solr 5.2.1
>> 
>> Thanks.
>> 
>> Steve
>> 



Re: Memory leak defect or misssuse of SolrJ API?

2016-01-31 Thread Walter Underwood
I already answered this.

Move the creation of the HttpSolrClient outside the loop. Your code will run 
much fast, because it will be able to reuse the connections.

Put another way, your program should have exactly as many HttpSolrClient 
objects as there are servers it talks to. If there is one Solr server, you have 
one object.

There is no leak in HttpSolrClient, you are misusing the class, massively.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 31, 2016, at 2:10 PM, Steven White  wrote:
> 
> Thank you all for your feedback.
> 
> This is code that I inherited and the example i gave is intended to
> demonstrate the memory leak which based on YourKit is
> on java/util/LinkedHashMap$Entry.  In short, I'm getting core dumps with
> "Detail "java/lang/OutOfMemoryError" "Java heap space" received "
> 
> Here is a more detailed layout of the code.  This is a crawler that runs
> 24x7 without any recycle logic in place:
> 
>init_data()
> 
>while (true)
>{
>HttpSolrClient client = new HttpSolrClient("
> http://localhost:8983/solr/core1 <http://192.168.202.129:8983/solr/core1>/");
> <<<< this is real code
> 
>see_if_we_have_new_data();
> 
>send_new_data_to_solr();
> 
>client.close();<<<< this is real code
> 
>sleep_for_a_bit(N);<<<< 'N' can be any positive int
>}
> 
> By default, our Java program is given 4gb of ram "-Xmx4g" and N is set for
> 5 min.  We had a customer set N to 10 second and we started seeing core
> dumps with OOM.  As I started to debug, I narrowed the OOM to
> HttpSolrClient per my original email.
> 
> The follow up answers I got suggest that I move the construction of
> HttpSolrClient object outside the while loop which I did (but I also had to
> move "client.close()" outside the loop) and the leak is gone.
> 
> Give this, is this how HttpSolrClient is suppose to be used?  If so, what's
> the point of HttpSolrClient.close()?
> 
> Another side question.  I noticed HttpSolrClient has a setBaseUrl().  Now,
> if I call it and give it "http://localhost:8983/solr/core1
> <http://192.168.202.129:8983/solr/core1>/" (ntoice the "/" at the end) next
> time I use HttpSolrClient to send Solr data, I get back 404. The fix is to
> remove the ending "/".  This is not how the constructor of HttpSolrClient
> behaves; HttpSolrClient will take the URL with or without "/".
> 
> In summary, it would be good if someone can confirm f we have a memory leak
> in HttpSolrClient if used per my example; if so this is a defect.  Also,
> can someone confirm the fix I used for this issue: move the constructor of
> HttpSolrClient outside the loop and reuse the existing object "client".
> 
> Again, thank you all for the quick response it is much appreciated.
> 
> Steve
> 
> 
> 
> On Sat, Jan 30, 2016 at 1:24 PM, Erick Erickson 
> wrote:
> 
>> Assuming you're not really using code like above and it's a test case
>> 
>> What's your evidence that memory consumption goes up? Are you sure
>> you're not just seeing uncollected garbage?
>> 
>> When I attached Java Mission Control to this program it looked pretty
>> scary at first, but the heap allocated after old generation garbage
>> collections leveled out to a steady state.
>> 
>> 
>> On Sat, Jan 30, 2016 at 9:29 AM, Walter Underwood 
>> wrote:
>>> Create one HttpSolrClient object for each Solr server you are talking
>> to. Reuse it for all requests to that Solr server.
>>> 
>>> It will manage a pool of connections and keep them alive for faster
>> communication.
>>> 
>>> I took a look at the JavaDoc and the wiki doc, neither one explains this
>> well. I don’t think they even point out what is thread safe.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jan 30, 2016, at 7:42 AM, Susheel Kumar 
>> wrote:
>>>> 
>>>> Hi Steve,
>>>> 
>>>> Can you please elaborate what error you are getting and i didn't
>> understand
>>>> your code above, that why initiating Solr client object  is in loop.  In
>>>> general  creating client instance should be outside the loop and a one
>> time
>>>> activity during the complete execution of program.
>>>> 
>>>> Thanks,
>>>> Susheel
>>>> 
>>>> On Sat, Jan 30, 2016 at 8:15 AM, Steven White 
>> wrote:
>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> I'm getting memory leak in my code.  I narrowed the code to the
>> following
>>>>> minimal to cause the leak.
>>>>> 
>>>>>   while (true) {
>>>>>   HttpSolrClient client = new HttpSolrClient("
>>>>> http://192.168.202.129:8983/solr/core1";);
>>>>>   client.close();
>>>>>   }
>>>>> 
>>>>> Is this a defect or an issue in the way I'm using HttpSolrClient?
>>>>> 
>>>>> I'm on Solr 5.2.1
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> Steve
>>>>> 
>>> 
>> 



Re: Memory leak defect or misssuse of SolrJ API?

2016-01-31 Thread Walter Underwood
The JavaDoc needs a lot more information. As I remember it, SolrJ started as a 
thin layer over Apache HttpClient, so the authors may have assumed that 
programmers were familiar with that library. HttpClient makes a shared object 
that manages a pool of connections to the target server. HttpClient is 
seriously awesome—I first used it in the late 1990’s when I hit the limitations 
of the URL classes written by Sun.

I looked at the JavaDoc and various examples and none of them make this clear. 
Not your fault, we need a serious upgrade on those docs.

On the plus side, your program should be a lot faster after you reuse the 
client class.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 31, 2016, at 3:46 PM, Steven White  wrote:
> 
> Thanks Walter.  Yes, I saw your answer and fixed the issue per your
> suggestion.
> 
> The JavaDoc need to make this clear.  The fact there is a close() on this
> class and the JavaDoc does not say "your program should have exactly as
> many HttpSolrClient objects as there are servers it talks to" is a prime
> candidate for missuses of the class.
> 
> Steve
> 
> 
> On Sun, Jan 31, 2016 at 5:20 PM, Walter Underwood 
> wrote:
> 
>> I already answered this.
>> 
>> Move the creation of the HttpSolrClient outside the loop. Your code will
>> run much fast, because it will be able to reuse the connections.
>> 
>> Put another way, your program should have exactly as many HttpSolrClient
>> objects as there are servers it talks to. If there is one Solr server, you
>> have one object.
>> 
>> There is no leak in HttpSolrClient, you are misusing the class, massively.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jan 31, 2016, at 2:10 PM, Steven White  wrote:
>>> 
>>> Thank you all for your feedback.
>>> 
>>> This is code that I inherited and the example i gave is intended to
>>> demonstrate the memory leak which based on YourKit is
>>> on java/util/LinkedHashMap$Entry.  In short, I'm getting core dumps with
>>> "Detail "java/lang/OutOfMemoryError" "Java heap space" received "
>>> 
>>> Here is a more detailed layout of the code.  This is a crawler that runs
>>> 24x7 without any recycle logic in place:
>>> 
>>>   init_data()
>>> 
>>>   while (true)
>>>   {
>>>   HttpSolrClient client = new HttpSolrClient("
>>> http://localhost:8983/solr/core1 <http://192.168.202.129:8983/solr/core1
>>> /");
>>> <<<< this is real code
>>> 
>>>   see_if_we_have_new_data();
>>> 
>>>   send_new_data_to_solr();
>>> 
>>>   client.close();<<<< this is real code
>>> 
>>>   sleep_for_a_bit(N);<<<< 'N' can be any positive int
>>>   }
>>> 
>>> By default, our Java program is given 4gb of ram "-Xmx4g" and N is set
>> for
>>> 5 min.  We had a customer set N to 10 second and we started seeing core
>>> dumps with OOM.  As I started to debug, I narrowed the OOM to
>>> HttpSolrClient per my original email.
>>> 
>>> The follow up answers I got suggest that I move the construction of
>>> HttpSolrClient object outside the while loop which I did (but I also had
>> to
>>> move "client.close()" outside the loop) and the leak is gone.
>>> 
>>> Give this, is this how HttpSolrClient is suppose to be used?  If so,
>> what's
>>> the point of HttpSolrClient.close()?
>>> 
>>> Another side question.  I noticed HttpSolrClient has a setBaseUrl().
>> Now,
>>> if I call it and give it "http://localhost:8983/solr/core1
>>> <http://192.168.202.129:8983/solr/core1>/" (ntoice the "/" at the end)
>> next
>>> time I use HttpSolrClient to send Solr data, I get back 404. The fix is
>> to
>>> remove the ending "/".  This is not how the constructor of HttpSolrClient
>>> behaves; HttpSolrClient will take the URL with or without "/".
>>> 
>>> In summary, it would be good if someone can confirm f we have a memory
>> leak
>>> in HttpSolrClient if used per my example; if so this is a defect.  Also,
>>> can someone confirm the fix I used for this issue: move the constructor
>> of
>>> HttpSolrClient outside the loop and reuse the existing object "client".
>&g

Re: large number of fields

2016-02-05 Thread Walter Underwood
I would add a multiValued field for buying_customers. Add the customer ID for 
each relevant customer to that field. Then use a boost query “bq”, to boost 
those.

Try that first before using the hit rate. Always try on/off control before 
going proportional. The simple approach will probably give you 80% of the 
benefit. Then you can declare victory and go on to the next idea.

If you do need hit rate, try quantizing that into high/medium/low, or deciles, 
or something. Then you have one multiValued field for each level and one bq for 
each level. The bq will include a weight: bq=customer_hi:1234^8. Logarithmic 
levels are probably your friend here.

You can get some unwanted idf scoring with bq. Customers that only buy a few 
things get a higher bq weight that customers that buy a lot of things. You can 
fix that with function queries, but I’d get it working with a boost query first.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 5, 2016, at 8:13 AM, Jack Krupansky  wrote:
> 
> This doesn't sound like a great use case for Solr - or any other search
> engine for that matter. I'm not sure what you are really trying to
> accomplish, but you are trying to put way too many balls in the air to
> juggle efficiently. You really need to re-conceptualize your problem so
> that it has far fewer moving parts. Sure, Solr can handle many millions or
> even billions of documents, but the focus for scaling Solr is on more
> documents and more nodes, not incredibly complex or large documents. The
> key to effective and efficient use of Solr is that queries are "quite
> short", definitely not "quite long."
> 
> That said, the starting point for any data modeling effort is to look at
> the full range of desired queries and that should drive the data model. So,
> give us more info on queries, in terms of plain English descriptions of
> what the user is trying to achieve.
> 
> 
> -- Jack Krupansky
> 
> On Fri, Feb 5, 2016 at 8:20 AM, Jan Verweij - Experts in search <
> j...@searchxperts.nl> wrote:
> 
>> Hi,
>> We store 50K products stored in Solr. We have 10K customers and each
>> customer buys up to 10K of these products. Now we want to influence the
>> results by adding a field for every customer.
>> So we end up with 10K fields to influence the results on the buying
>> behavior of
>> each customer (personal results). Don't think this is the way to go so I'm
>> looking for suggestions how to solve
>> this.
>> One other option would be to: 1. create one multivaluefield
>> 'company_hitrate'
>> 2. store for each company their [companyID]_[hitrate]
>> 
>> During search use boostfields [companyID]_50 …. [companyID]_100 So in this
>> case the query can become quit long (51 options) but the number of
>> fields is limited to 1. What kind of effect would this have on the search
>> performance
>> Any other suggestions?
>> Jan.



Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Making two indexing calls, one to each, works until one system is not 
available. Then they are out of sync.

You might want to put the updates into a persistent message queue, then have 
both systems indexed from that queue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2016, at 1:49 PM, Upayavira  wrote:
> 
> There is a Cross Datacenter replication feature in the works - not sure
> of its status.
> 
> In lieu of that, I'd simply have two copies of your indexing code -
> index everything simultaneously into both clusters.
> 
> There is, of course risks that both get out of sync, so you might want
> to find some ways to identify/manage that.
> 
> Upayavira
> 
> On Tue, Feb 9, 2016, at 08:43 PM, tedsolr wrote:
>> I have a Solr Cloud cluster (v5.2.1) using a Zookeeper ensemble in my
>> primary
>> data center. I am now trying to plan for disaster recovery with an
>> available
>> warm site. I have read (many times) the disaster recovery section in the
>> Apache ref guide. I suppose I don't fully understand it.
>> 
>> What I'd like to know is the best way to sync up the existing data, and
>> the
>> best way to keep that data in sync. Assume that the warm site is an exact
>> copy (not at the network level) of the production cluster - so the same
>> servers with the same config. All servers are virtual. The use case is
>> the
>> active cluster goes down and cannot be repaired, so the warm site would
>> become the active site. This is a manual process that takes many hours to
>> accomplish (I just need to fit Solr into this existing process, I can't
>> change the process :).
>> 
>> I expect that rsync can be used initially to copy the collection data
>> folders and the zookeeper data and transaction log folders. So after
>> verifying Solr/ZK is functional after the install, shut it down and
>> perform
>> the copy. This may sound slow but my production index size is < 100GB. Is
>> this approach reasonable?
>> 
>> So now to keep the warm site in sync, I could use rsync on a scheduled
>> basis
>> but I assume there's a better way. The ref guide says to send all
>> indexing
>> requests to the second cluster at the same time they are sent to the
>> active
>> cluster. I use SolrJ for all requests. So would this entail using a
>> second
>> CloudSolrClient instance that only knows about the second cluster? Seems
>> reasonable but I don't want to lengthen the response time for the users.
>> Is
>> this just a software problem to work out (separate thread)? Or is there a
>> SolrJ solution (asyc calls)?
>> 
>> Thanks!!
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/replicate-indexing-to-second-site-tp4256240.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
Updating two systems in parallel gets into two-phase commit, instantly. So you 
need a persistent pool of updates that both clusters pull from.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 9, 2016, at 4:15 PM, Shawn Heisey  wrote:
> 
> On 2/9/2016 1:43 PM, tedsolr wrote:
>> I expect that rsync can be used initially to copy the collection data
>> folders and the zookeeper data and transaction log folders. So after
>> verifying Solr/ZK is functional after the install, shut it down and perform
>> the copy. This may sound slow but my production index size is < 100GB. Is
>> this approach reasonable?
>> 
>> So now to keep the warm site in sync, I could use rsync on a scheduled basis
>> but I assume there's a better way. The ref guide says to send all indexing
>> requests to the second cluster at the same time they are sent to the active
>> cluster. I use SolrJ for all requests. So would this entail using a second
>> CloudSolrClient instance that only knows about the second cluster? Seems
>> reasonable but I don't want to lengthen the response time for the users. Is
>> this just a software problem to work out (separate thread)? Or is there a
>> SolrJ solution (asyc calls)?
> 
> The way I would personally handle keeping both systems in sync at the
> moment would be to modify my indexing system to update both systems in
> parallel.  That likely would involve a second CloudSolrClient instance.
> 
> There's a new feature called "Cross Data Center Replication" but as far
> as I know, it is only available in development versions, and has not
> been made available in any released version of Solr.
> 
> http://yonik.com/solr-cross-data-center-replication/
> 
> This new feature may become available in 6.0 or a later 6.x release.  I
> do not have any concrete information about the expected release date for
> 6.0.
> 
> Thanks,
> Shawn
> 



Re: replicate indexing to second site

2016-02-09 Thread Walter Underwood
I agree. If the system updates synchronously, then you are in two-phase commit 
land. If you have a persistent store that each index can track, then things are 
good.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 9, 2016, at 7:37 PM, Shawn Heisey  wrote:
> 
> On 2/9/2016 5:48 PM, Walter Underwood wrote:
>> Updating two systems in parallel gets into two-phase commit, instantly. So 
>> you need a persistent pool of updates that both clusters pull from.
> 
> My indexing system does exactly what I have suggested for tedsolr -- it
> updates multiple copies of my index in parallel.  My data source is MySQL.
> 
> For each copy, information about the last successful update is
> separately tracked, so if one of the index copies goes offline, the
> other stays current.  When the offline system comes back, it will be
> updated from the saved position, and will eventually have the same
> information as the system that did not go offline.
> 
> As far as two-phase commit goes, that would make it so that neither copy
> of the index would stay current if one of them went offline.  In most
> situations I can think of, that's not really very useful.
> 
> Thanks,
> Shawn
> 



Re: Knowing which doc failed to get added in solr during bulk addition in Solr 5.2

2016-02-11 Thread Walter Underwood
I first wrote the “fall back to one at a time” code for Solr 1.3.

It is pretty easy if you plan for it. Make the batch size variable. When a 
batch fails, retry with a batch size of 1 for that particular batch. Then keep 
going or fail, either way, you have good logging on which one failed.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 11, 2016, at 10:06 AM, Erick Erickson  wrote:
> 
> Steven's solution is a very common one, complete to the
> notion of re-chunking. Depending on the throughput requirements,
> simply resending the offending packet one at a time is often
> sufficient (but not _efficient). I can imagine fallback scenarios
> like "try chunking 100 at a time, for those chunks that fail
> do 10 at a time and for those do 1 at a time".
> 
> That said, in a lot of situations, the number of failures is low
> enough that just falling back to one at a time while not elegant
> is sufficient
> 
> It sure will be nice to have SOLR-445 done, if we can just keep
> Hoss from going crazy before he gets done.
> 
> Best,
> Erick
> 
> On Thu, Feb 11, 2016 at 7:39 AM, Steven White  wrote:
>> For my application, the solution I implemented is I log the chunk that
>> failed into a file.  This file is than post processed one record at a
>> time.  The ones that fail, are reported to the admin and never looked at
>> again until the admin takes action.  This is not the most efficient
>> solution right now but I intend to refactor this code so that the failed
>> chunk is itself re-processed in smaller chunks till the chunk with the
>> failed record(s) is down to 1 record "chunk" that will fail.
>> 
>> Like Debraj, I would love to hear from others how they handle such failures.
>> 
>> Steve
>> 
>> 
>> On Thu, Feb 11, 2016 at 2:29 AM, Debraj Manna 
>> wrote:
>> 
>>> Thanks Erik. How do people handle this scenario? Right now the only option
>>> I can think of is to replay the entire batch by doing add for every single
>>> doc. Then this will give me error for all the docs which got added from the
>>> batch.
>>> 
>>> On Tue, Feb 9, 2016 at 10:57 PM, Erick Erickson 
>>> wrote:
>>> 
>>>> This has been a long standing issue, Hoss is doing some current work on
>>> it
>>>> see:
>>>> https://issues.apache.org/jira/browse/SOLR-445
>>>> 
>>>> But the short form is "no, not yet".
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Tue, Feb 9, 2016 at 8:19 AM, Debraj Manna 
>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> 
>>>>> I have a Document Centric Versioning Constraints added in solr schema:-
>>>>> 
>>>>> 
>>>>>  false
>>>>>  doc_version
>>>>> 
>>>>> 
>>>>> I am adding multiple documents in solr in a single call using SolrJ
>>> 5.2.
>>>>> The code fragment looks something like below :-
>>>>> 
>>>>> 
>>>>> try {
>>>>>UpdateResponse resp = solrClient.add(docs.getDocCollection(),
>>>>>500);
>>>>>if (resp.getStatus() != 0) {
>>>>>throw new Exception(new StringBuilder(
>>>>>"Failed to add docs in solr ").append(resp.toString())
>>>>>.toString());
>>>>>}
>>>>>} catch (Exception e) {
>>>>>logError("Adding docs to solr failed", e);
>>>>>}
>>>>> 
>>>>> 
>>>>> If one of the document is violating the versioning constraints then
>>> Solr
>>>> is
>>>>> returning an exception with error message like "user version is not
>>> high
>>>>> enough: 1454587156" & the other documents are getting added perfectly.
>>> Is
>>>>> there a way I can know which document is violating the constraints
>>> either
>>>>> in Solr logs or from the Update response returned by Solr?
>>>>> 
>>>>> Thanks
>>>> 
>>> 



Re: words with spaces within

2016-02-22 Thread Walter Underwood
This happens for fonts where Tika does not have font metrics. Open the document 
in Adobe Reader, then use document info to find the list of fonts.

Then post this question to the Tika list.

Fix it in Tika, don’t patch it in Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 22, 2016, at 6:40 PM, Binoy Dalal  wrote:
> 
> Is there some set pattern to how these words occur or do they occur
> randomly in the text, i.e., somewhere it'll be "subtitle" and somewhere "s
> u b t i t l e"?
> 
> On Tue, 23 Feb 2016, 05:01 Francisco Andrés Fernández 
> wrote:
> 
>> Hi all,
>> I'm extracting some text from pdf. As result, some important words end with
>> spaces between characters. I know they are words but, don't know how to
>> make Solr detect and index them.
>> For example, I could have the word "Subtitle" that I want to detect,
>> written like "S u b t i t l e". If I would parse the text with a standard
>> tokenizer, the word will be lost.
>> How could I make Solr detect this type of word occurrence?
>> Many thanks,
>> 
>> Francisco
>> 
> -- 
> Regards,
> Binoy Dalal



Re: What search metrics are useful?

2016-02-24 Thread Walter Underwood
Click through rate (CTR) is fundamental. That is easy to understand and 
integrates well with other business metrics like conversion. CTR is at least 
one click anywhere in the result set (first page, second page, …). Count 
multiple clicks as a single success. The metric is, “at least one click”.

No hit rate is sort of useful, but you need to know which queries are getting 
no hits, so you can fix it.

For latency metrics, look at 90th percentile or 95th percentile. Average is 
useless because response time is a one-sided distribution, so it will be thrown 
off by outliers. Percentiles have a direct customer satisfaction 
interpretation. 90% of searches were under one second, for example. Median 
response time should be very, very fast because of caching in Solr. During busy 
periods, our median response time is about 1.5 ms.

Number of different queries per conversion is a good way to look how query 
assistance is working. Things like autosuggest, fuzzy, etc.

About 10% of queries will be misspelled, so you do need to deal with that.

Finding underperforming queries is trickier. I really need to write an article 
on that.

“Search Analytics for Your Site” by Lou Rosenfeld is a good introduction.

http://rosenfeldmedia.com/books/search-analytics-for-your-site/ 
<http://rosenfeldmedia.com/books/search-analytics-for-your-site/>

Sea Urchin is doing some good work in search metrics: https://seaurchin.io/ 
<https://seaurchin.io/>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)
Search Guy, Chegg

> On Feb 24, 2016, at 2:38 AM, Emir Arnautovic  
> wrote:
> 
> Hi Bill,
> You can take a look at Sematext's search analytics 
> (https://sematext.com/search-analytics). It provides some of metrics you 
> mentioned, plus some additional (top queries, CTR, click stats, paging stats 
> etc.). In combination with Sematext's performance metrics 
> (https://sematext.com/spm) you can have full picture of your search 
> infrastructure.
> 
> Regards,
> Emir
> 
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> On 24.02.2016 04:07, William Bell wrote:
>> How do others look at search metrics?
>> 
>> 1. Search conversion? Do you look at searches and if the user does not
>> click on a result, and reruns the search that would be a failure?
>> 
>> 2. How to measure auto complete success metrics?
>> 
>> 3. Facets/filters could be considered negative, since we did not find the
>> results that the user wanted, and now they are filtering - who to measure?
>> 
>> 4. One easy metric is searches with 0 results. We could auto expand the geo
>> distance or ask the user "did you mean" ?
>> 
>> 5. Another easy one would be tech performance: "time it takes in seconds to
>> get a result".
>> 
>> 6. How to measure fuzzy? How do you know you need more synonyms? How to
>> measure?
>> 
>> 7. How many searches it takes before the user clicks on a result?
>> 
>> Other ideas? Is there a video or presentation on search metrics that would
>> be useful?
>> 
> 



Re: Query time de-boost

2016-02-25 Thread Walter Underwood
Another approach is to boost everything but that content.

This bq should work:

*:* -ContentGroup:”Developer’s Documentation”

Or a function query in the boost parameter, with an if statement.

Or make ContentGroup an enum with different values for each group, and use a 
function query to boost by that value. 

I haven’t tried any of these, of course.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 25, 2016, at 3:33 PM, Binoy Dalal  wrote:
> 
> According to the edismax documentation, negative boosts are supported, so
> you should certainly give it a try.
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
> 
> On Fri, 26 Feb 2016, 03:45 shamik  wrote:
> 
>> Emir, I don't Solr supports a negative boosting *^-99* syntax like this. I
>> can certainly do something like:
>> 
>> bq=(*:* -ContetGroup:"Developer's Documentation")^99 , but then I can't
>> have
>> my other bq parameters.
>> 
>> This doesn't work --> bq=Source:simplecontent^10 Source:Help^20 (*:*
>> -ContetGroup:"Developer's Documentation")^99
>> 
>> Are you sure something like *bq=ContenGroup-local:Developer^-99* worked for
>> you?
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259879.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
> -- 
> Regards,
> Binoy Dalal



Disable phrase search in edismax?

2016-02-26 Thread Walter Underwood
I’m creating a query from MLT terms, then sending it to edismax. The 
neighboring words in the query are not meaningful phrases.

Is there a way to turn off phrase creation and search for one query? Or should 
I separate them all with “OR”?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Walter Underwood

> On Feb 28, 2016, at 2:40 PM,  
>  wrote:
> 
> But Disney Land and Disney World are actually really good examples of places 
> where the magic stuff is suitable, ...


As a former Disney employee, those are properly “Disneyland” and “Walt Disney 
World”, which makes a good example of the need for shingle-type synonyms.

wunder
Walter Underwood
Former GO.com/Infoseek search engineer
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Indexing books, chapters and pages

2016-03-01 Thread Walter Underwood
You could index both pages and chapters, with a type field.

You could index by chapter with the page number as a payload for each token.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati  wrote:
> 
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
> 
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
> 
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> wrote:
>> 
>>> Original data is quite well structured: it comes in XML with chapters and
>>> tags to mark the original page breaks on the paper version. In this way
>> we
>>> have the possibility to restructure it almost as we want before creating
>>> SOLR index.
>>> 
>>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>>> jack.krupan...@gmail.com> ha scritto:
>>> 
>>>> To start, what is the form of your input data - is it already divided
>>> into
>>>> chapters and pages? Or... are you starting with raw PDF files?
>>>> 
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> I'm searching for ideas on how to define schema and how to perform
>>>> queries
>>>>> in this use case: we have to index books, each book is split into
>>>> chapters
>>>>> and chapters are split into pages (pages represent original page
>>> cutting
>>>> in
>>>>> printed version). We should show the result grouped by books and
>>> chapters
>>>>> (for the same book) and pages (for the same chapter). As far as I
>> know,
>>>> we
>>>>> have 2 options:
>>>>> 
>>>>> 1. index pages as SOLR documents. In this way we could theoretically
>>>>> retrieve chapters (and books?)  using grouping but
>>>>>a. we will miss matches across two contiguous pages (page cutting
>>> is
>>>>> only due to typographical needs so concepts could be split... as in
>>>> printed
>>>>> books)
>>>>>b. I don't know if it is possible in SOLR to group results on two
>>>>> different levels (books and chapters)
>>>>> 
>>>>> 2. index chapters as SOLR documents. In this case we will have the
>>> right
>>>>> matches but how to obtain the matching pages? (we need pages because
>>> the
>>>>> client can only display pages)
>>>>> 
>>>>> we have been struggling on this problem for a lot of time and we're
>>> not
>>>>> able to find a suitable solution so I'm looking if someone has ideas
>> or
>>>> has
>>>>> already solved a similar issue.
>>>>> Thanks
>>>>> 
>>>> 
>>> 
>> 



Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
If you need transactions, you should use a different system, like MarkLogic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 3, 2016, at 8:46 PM, sangs8788  
> wrote:
> 
> Hi Emir,
> 
> Right now we are having only inserts into SOLR. The main reason for having
> commit after each document is to get a guarantee that the document has got
> indexed in solr. Until the commit status is received back the document will
> not be deleted from MQ. So that even if there is a commit failure the
> document can be resent from MQ.
> 
> Thanks
> Sangeetha
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
So batch them. You get a response back from Solr whether the document was 
accepted. If that fail, there is a failure. What do you do then?

After every 100 docs or one minute, do a commit. Then delete the documents from 
the input queue. What do you do when the commit fails?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 3, 2016, at 8:46 PM, sangs8788  
> wrote:
> 
> Hi Emir,
> 
> Right now we are having only inserts into SOLR. The main reason for having
> commit after each document is to get a guarantee that the document has got
> indexed in solr. Until the commit status is received back the document will
> not be deleted from MQ. So that even if there is a commit failure the
> document can be resent from MQ.
> 
> Thanks
> Sangeetha
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: What is the best way to index 15 million documents of total size 425 GB?

2016-03-04 Thread Walter Underwood

> On Mar 3, 2016, at 9:54 AM, Aneesh Mon N  wrote:
> 
> To be noted that all the fields are stored so as to support the atomic
> updates.

Are you doing all of these updates as atomic? That could be slow. If you are 
supplying all the fields, then just do a regular add.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Walter Underwood
This is a very good presentation on using entity extraction in query 
understanding. As you’ll see from the preso, it is not easy.

http://www.slideshare.net/dtunkelang/better-search-through-query-understanding 
<http://www.slideshare.net/dtunkelang/better-search-through-query-understanding>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 6, 2016, at 7:27 AM, Jack Krupansky  wrote:
> 
> Back to the original question... there are two answers:
> 
> 1. Yes - for guru-level Solr experts.
> 2. No - for anybody else.
> 
> For starters, (as always), you would need to do a lot more upfront work on
> mapping out the forms of query which will be supported. For example, is
> your focus on precision or recall. And, are you looking to analyze all
> matching tweets or just a sample. And, the load, throughput, and latency
> requirements. And, any spatial search requirements. And, any entity search
> requirements. Without a clear view of the query requirements it simply
> isn't possible to even begin defining a data model. And without a data
> model, indexing is a fool's errand. In short, no focus, no progress.
> 
> -- Jack Krupansky
> 
> On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar  wrote:
> 
>> Entity Recognition means you may want to recognize different entities
>> name/person, email, location/city/state/country etc. in your
>> tweets/messages with goal of  providing better relevant results to users.
>> NER can be used at query or indexing (data enrichment) time.
>> 
>> Thanks,
>> Susheel
>> 
>> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
>> joseph.obernber...@gmail.com> wrote:
>> 
>>> Thank you all very much for all the responses so far.  I've enjoyed
>> reading
>>> them!  We have noticed that storing data inside of Solr results in
>>> significantly worse performance (particularly faceting); so we store the
>>> values of all the fields elsewhere, but index all the data with Solr
>>> Cloud.  I think the suggestion about splitting the data up into blocks of
>>> date/time is where we would be headed.  Having two Solr-Cloud clusters -
>>> one to handle ~30 days of data, and one to handle historical.  Another
>>> option is to use a single Solr Cloud cluster, but use multiple
>>> cores/collections.  Either way you'd need a job to come through and clean
>>> up old data. The historical cluster would have much worse performance,
>>> particularly for clustering and faceting the data, but that may be
>>> acceptable.
>>> I don't know what you mean by 'entity recognition in the queries' - could
>>> you elaborate?
>>> 
>>> We would want to index and potentially facet on any of the fields - for
>>> example entities_media_url, username, even background color, but we do
>> not
>>> know a-priori what fields will be important to users.
>>> As to why we would want to make the data searchable; well - I don't make
>>> the rules!  Tweets is not the only data source, but it's certainly the
>>> largest that we are currently looking at handling.
>>> 
>>> I will read up on the Berlin Buzzwords - thank you for the info!
>>> 
>>> -Joe
>>> 
>>> 
>>> 
>>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky >> 
>>> wrote:
>>> 
>>>> As always, the initial question is how you intend to query the data -
>>> query
>>>> drives data modeling. How real-time do you need queries to be? How fast
>>> do
>>>> you need archive queries to be? How many fields do you need to query
>> on?
>>>> How much entity recognition do you need in queries?
>>>> 
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull 
>> wrote:
>>>> 
>>>>> On 03/03/2016 19:25, Toke Eskildsen wrote:
>>>>> 
>>>>>> Joseph Obernberger  wrote:
>>>>>> 
>>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose'
>>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing
>>>>>>> each of the fields (about 180)?
>>>>>>> 
>>>>>> 
>>>>>> Possible, yes. Reasonable? It is not going to be cheap.
>>>>>> 
>>>>>> Twitter index the tweets themselves and have been quite open about
>>>>>> how they do it. I would suggest lo

Re: Solr Cloud sharding strategy

2016-03-07 Thread Walter Underwood
Excellent advice, and I’d like to reinforce a few things.

* Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and 
faster disks matter a lot.
* Realistic user query logs are super important. We measure 95th percentile 
latency and that is dominated by rare and malformed queries.
* 5000 queries is not nearly enough. That totally fits in cache. I usually 
start with 100K, though I’d like more. Benchmarking a cached system is one of 
the hardest things in devops.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 7, 2016, at 4:27 PM, Erick Erickson  wrote:
> 
> Still, 50M is not excessive for a single shard although it's getting
> into the range that I'd like proof that my hardware etc. is adequate
> before committing to it. I've seen up to 300M docs on a single
> machine, admittedly they were tweets. YMMV based on hardware and index
> complexity of course. Here's a long blog about sizing:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> In this case I'd be pretty comfortable by creating a test harness
> (using jMeter or the like) and faking the extra 30M documents by
> re-indexing the current corpus but assigning new IDs ( Keep doing this until your target machine breaks (i.e. either blows up
> by exhausting memory or the response slows unacceptably) and that'll
> give you a good upper bound. Note that you should plan on a couple of
> rounds of tuning/testing when you start to have problems.
> 
> I'll warn you up front, though, that unless you have an existing app
> to mine for _real_ user queries, generating say 5,000 "typical"
> queries is more of a challenge than you might expect ;)...
> 
> Now, all that said all is not lost if you do go with a single shard.
> Let's say that 6 months down the road your requirements change. Or the
> initial estimate was off. Or
> 
> There are a couple of options:
> 1> create a new collection with more shards and re-index from scratch
> 2> use the SPLITSHARD Collections API all to, well, split the shard.
> 
> 
> In this latter case, a shard is split into two pieces of roughly equal
> size, which does mean that you can only grow your shard count by
> powers of 2.
> 
> And even if you do have a single shard, using SolrCloud is still a
> good thing as the failover is automagically handled assuming you have
> more than one replica...
> 
> Best,
> Erick
> 
> On Mon, Mar 7, 2016 at 4:05 PM, shamik  wrote:
>> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
>> documents, but the growth projection around 50 million in next 6-8 months.
>> It'll continue to grow, but maybe not at the same rate. From the index size
>> point of view, the size can grow up to half a TB from its current state.
>> Honestly, my perception of "big" index is still vague :-) . All I'm trying
>> to make sure is that decision I take is scalable in the long term and will
>> be able to sustain the growth without compromising the performance.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Why is multiplicative boost prefered over additive?

2016-03-18 Thread Walter Underwood
Popularity has a very wide range. Try my example, scale 1 million and 100 into 
the same 1.0-0.0 range. Even with log popularity.

As another poster pointed out, text relevance scores also have a wide range.

In practice, I never could get additive boost to work right at Netflix at both 
ends of the popularity scale. I gave up and made it work for popular movies. 
Here at Chegg, multiplicative boost works fine.

Don’t think so much about the absolute values of the scores. All we care about 
is ordering. Work with real user queries, not with theory.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2016, at 5:34 AM,  
>  wrote:
> 
> On Thursday, March 17, 2016 7:58 PM, wun...@wunderwood.org wrote:
>> 
>> Think about using popularity as a boost. If one movie has a million rentals 
>> and one has a hundred rentals, there is no additive formula that balances 
>> that with text relevance. Even with log(popularity), it doesn't work.
> 
> I'm not sure I follow your logic now. If one can express the popularity as a 
> value between 0.0 and 1.0, why can't one use that, together with a weight 
> (indicating how much the popularity should influence the score, in general) 
> and add that to the text relevance score? And how, exactly, would I achieve 
> that using any multiplicative formula?
> 
> The logic of the weight, in this case, is that I want to be able to tweak how 
> much influence the popularity has on the final score (and thus the sort order 
> of the documents), where a weight of 0.0 would have the same effect as if the 
> popularity wasn't included in the boost logic at all, and a high enough 
> weight would have the same effect as if one sorted the documents solely on 
> popularity.
> 
> /Jimi



Re: Why is multiplicative boost prefered over additive?

2016-03-18 Thread Walter Underwood
That works fine if you have a query that matches things with a wide range of 
popularities. But that is the easy case.

What about the query “twilight”, which matches all the Twilight movies, all of 
which are popular (millions of views). Or “Lord of the Rings” which only 
matches movies with hundreds of views? People really will notice when the 1978 
animated version shows up before the Peter Jackson films.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2016, at 8:18 AM,  
>  wrote:
> 
> On Friday, March 18, 2016 3:53 PM, wun...@wunderwood.org wrote:
>> 
>> Popularity has a very wide range. Try my example, scale 1 million and 100 
>> into the same 1.0-0.0 range. Even with log popularity.
> 
> Well, in our case, we don't really care do differentiate between documents 
> with low popularity. And if we know roughly what the popularity distribution 
> is it is not hard to normalize it to a value between 0.0 and 1.0. The most 
> simple approach is to simply focus on the maximum value, and mapping that 
> value to 1.0, so basically the normalization function is: 
> normalizedValue=value/maxValue. But knowing the mean and median, or other 
> statistical information, one could of course use a more advanced calculation.
> 
> In essence, if one can answer the question "How popular is this 
> document/movie/item?", using "extremely popular", "very popular", "quite 
> popular", "average", "not very popular" and "very unpopular" (ie popularity 
> normalized down to 6 possible values), it should not be that hard to 
> normalize the popularity to a value between 0.0 and 1.0.
> 
> /Jimi



Re: Why is multiplicative boost prefered over additive?

2016-03-19 Thread Walter Underwood
Think about using popularity as a boost. If one movie has a million rentals and 
one has a hundred rentals, there is no additive formula that balances that with 
text relevance. Even with log(popularity), it doesn’t work.

With multiplicative boost, we only care about the difference between the one 
rented one million time and the one rented 800 thousand times (think about the 
Twilight movies at Netflix). But it also distinguishes between the one rented 
100 times and the one rented 80 times.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 17, 2016, at 11:29 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
> 
> Hi,
> 
> After reading a bit on various sites, and especially the blog post "Comparing 
> boost methods in Solr", it seems that the preferred boosting type is the 
> multiplicative one, over the additive one. But I can't really get my head 
> around *why* that is so, since in most boosting problems I can think of, it 
> seems that an additive boost would suit better.
> 
> For example, in our project we want to boost documents depending on various 
> factors, but in essence they can be summarized as:
> 
> - Regular edismax logic, like qf=title^2 mainText^1
> - Multiple custom document fields, with weights specified at query time
> 
> So, first of, the custom fields... It became obvious to me quite quickly that 
> multiplicative logic here would totally ruin the purpose of the weights, 
> since something like "(f1 *  w1) * (f2 * w2)" is the same as "(f1 *  w2) * 
> (f2 * w1)". So, I ended up using additive boost here.
> 
> Then we have the combination of the edismax boost, and my custom boost. As 
> far as I understand it, when using the boost field with edismax, this 
> combination is always performed using multiplicative logic. But the same 
> problem exists here as it did with my custom fields. Because if I boost the 
> aggregated result of the custom fields using some weight, it doesn't affect 
> the order of the documents because that weight influences the edismax boost 
> just as much. What I want is to have the weight only influence my custom 
> boost value, so that I can control how much (or little) the final score 
> should be effected by the custom boost.
> 
> So, in both cases I find myself wanting to use the additive boost. But surely 
> I must be missing something, right? Am I thinking backwards or something?
> 
> I don't use any out-of-the-box example indexes, so I can provide you with a 
> working URL that shows exactly what I am doing. But in essence my query looks 
> like this:
> 
> - q=test
> - defType=edismax
> - qf=title^2&qf=mainText1^1
> - 
> totalRanking=div(sum(product(random1,1),product(random2,1.5),product(random3,2),product(random4,2.5),product(random5,3)),5)
> - weightedTotalRanking=product($totalRanking,1.5)
> - bf=$weightedTotalRanking
> - fl=*,score,[explain style=text],$weightedTotalRanking
> 
> random1 to random5 are document fields of type double, with random values 
> between 0.0 and 1.0.
> 
> With this setup, I can change the overall importance of my custom boosting 
> using the factor in weightedTotalRanking (1.5 above). But that is only 
> because bf is additive. If I switch to the boost parameter, I can no longer 
> influence the order of the documents using this factor, no matter how high a 
> value I choose.
> 
> Am I looking at the this the wrong way? Is there a much better approach to 
> achieve what I want?
> 
> Regards
> /Jimi



Re: Why is multiplicative boost prefered over additive?

2016-03-20 Thread Walter Underwood
I used a popularity score based on the DVD being in people’s queues and the 
streaming views. The Peter Jackson films were DVD only. They were in about 100 
subscriber queues. The first Twilight film was in 1.25 million queues.

Now think about the query “twilight zone”. How do you make “Twilight” not be 
the first hit for that?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2016, at 8:48 AM,  
>  wrote:
> 
> On Friday, March 18, 2016 4:25 PM, wun...@wunderwood.org wrote:
>> 
>> That works fine if you have a query that matches things with a wide range of 
>> popularities. But that is the easy case.
>> 
>> What about the query "twilight", which matches all the Twilight movies, all 
>> of which are popular (millions of views).
> 
> Well, like I said, I focused on our use case. And we deal with articles, not 
> movies. And the raw popularity value is basically just "the number of page 
> views the last N days". We want to boost documents that many people have 
> visited recently, but don't really care about the exact search result 
> position when comparing documents with roughly the same popularity. So if all 
> the matched documents have *roughly* the same popularity, then we basically 
> don't want the popularity to influence the score much at all.
> 
>> Or "Lord of the Rings" which only matches movies with hundreds of views? 
>> People really will notice when 
>> the 1978 animated version shows up before the Peter Jackson films.
> 
> Well, doesn't the Peter Jackson "Lord of the Rings" films have more than just 
> a few hundred views?
> 
> /Jimi



Re: Delete by query using JSON?

2016-03-22 Thread Walter Underwood
“Why do you care?” might not be the best way to say it, but it is essential to 
understand the difference between selection (filtering) and ranking.

As Solr params:

* q is ranking and filtering
* fq is filtering only
* bq is ranking only

When deleting documents, ordering does not matter, which is why we ask why you 
care about the ordering.

If the response is familiar to you, imagine how the questions sound to people 
who have been working in search for twenty years. But even when we are snippy, 
we still try to help.

Many, many times, the question is wrong. The most common difficulty on this 
list is an “XY problem”, where the poster has problem X and has assumed 
solution Y, which is not the right solution. But they ask about Y. So we will 
tell people that their approach is wrong, because that is the most helpful 
thing we can do.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 22, 2016, at 4:16 PM, Robert Brown  wrote:
> 
> "why do you care? just do this ..."
> 
> I see this a lot on mailing lists these days, it's usually a learning 
> curve/task/question.  I know I fall into these types of questions/tasks 
> regularly.
> 
> Which usually leads to "don't tell me my approach is wrong, just explain 
> what's going on, and why", or "just answer the straight-forward question I 
> asked in first place.".
> 
> Sorry for rambling, this just sounded familiar...
> 
> :)
> 
> 
> 
> On 22/03/16 22:50, Alexandre Rafalovitch wrote:
>> Why do you care?
>> 
>> The difference between Q and FQ are the scoring. For delete, you
>> delete all of them regardless of scoring and there is no difference.
>> Just chuck them all into Q.
>> 
>> Regards,
>>Alex.
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>> 
>> 
>> On 23 March 2016 at 06:07, Paul Hoffman  wrote:
>>> I've been struggling to find the right syntax for deleting by query
>>> using JSON, where the query includes an fq parameter.
>>> 
>>> I know how to delete *all* documents, but how would I delete only
>>> documents with field doctype = "cres"?  I have tried the following along
>>> with a number of variations, all to no avail:
>>> 
>>> $ curl -s -d @- 'http://localhost:8983/solr/blacklight-core/update?wt=json' 
>>> <>> {
>>> "delete": { "query": "doctype:cres" }
>>> }
>>> EOS
>>> 
>>> I can identify the documents like this:
>>> 
>>> curl -s 
>>> 'http://localhost:8983/solr/blacklight-core/select?q=&fq=doctype%3Acres&wt=json&fl=id'
>>> 
>>> It seems like such a simple thing, but I haven't found any examples that
>>> use an fq.  Could someone post an example?
>>> 
>>> Thanks in advance,
>>> 
>>> Paul.
>>> 
>>> --
>>> Paul Hoffman 
>>> Systems Librarian
>>> Fenway Libraries Online
>>> c/o Wentworth Institute of Technology
>>> 550 Huntington Ave.
>>> Boston, MA 02115
>>> (617) 442-2384 (FLO main number)
> 



Re: Can Solr recognize daylight savings time?

2016-03-25 Thread Walter Underwood
If possible, log in UTC. Daylight time causes amusing problems in logs, like 
one day with 23 hours and one day with 25.

You can always convert to local time when you display it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 25, 2016, at 8:49 AM, Shawn Heisey  wrote:
> 
> On 3/25/2016 9:32 AM, Shawn Heisey wrote:
>> Apparently we *can* use java system properties in the log4j config, so
>> saying there's no generalized solution available was premature.
> 
> Second followup:
> 
> The info I looked at about using sysprops had no version number for
> log4j, and was talking about the XML config, which is the only way to
> configure log4j2.
> 
> So I'm not even sure that we can use system properties with our current
> version.  This is something I will need to look into when I have more time.
> 
> Thanks,
> Shawn
> 



Re: Solr slave is doing full replication (entire index) of index after master restart

2016-04-09 Thread Walter Underwood
I’m not sure this is a legal polling interval:

00:00:60

Try:

00:01:00

Also, polling every minute is very fast. Try a longer period.

Check the clocks on the two systems. If the clocks are not synchronized, that 
could cause problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 9, 2016, at 8:10 AM, Lior Sapir  wrote:
> 
> Anyone can tell me what was I doing wrong ? 
> Is that the expected behavior (slave replicate entire index if on previous 
> replication attempt the master was not available ) ?
> 
> 
>  
> 
> On Thu, Apr 7, 2016 at 9:12 PM, Lior Sapir  <mailto:lior.sa...@gmail.com>> wrote:
> Thanks for the reply.
> 
> I easily re produced it in my "sandbox" env.  Steps to re produce
> 1. Setup a master
> 2. Setup a slave in a different server
> 3. The slave replicated the master index
> 4. From now on not even a single document is added. No optimization or what 
> so ever is done on the master or slave
> 5. I stop the master
> 6. I start the master
> 7. I see the slave is replicating/copying the entire index
> 
> This is exactly what happened  in production when I restarted the master.
> 
> I attached the configurations files.
> 
> Replication section:
> 
> Master:
> 
> 
>   
> commit
>   
> 
> 
> Slave:
> 
>   
> 
>  name="masterUrl">http://solr01-isrl01.flr.local:8983/solr/replication-master/replication
>  
> <http://solr01-isrl01.flr.local:8983/solr/replication-master/replication>
> 00:00:60
> 
> 
> 
> 
> 
> Best,
> Lior
> 
> On Thu, Apr 7, 2016 at 6:56 PM, Erick Erickson  <mailto:erickerick...@gmail.com>> wrote:
> What does your configuration file look like for the replication
> handler? Does this happen whenever you restart a slave even if
> _nothing_ has changed on the master?
> 
> And this will certainly happen if you're optimizing the master before
> you restart, although that doesn't sound likely.
> 
> Best,
> Erick
> 
> On Thu, Apr 7, 2016 at 6:54 AM, Lior Sapir  <mailto:lior.sa...@gmail.com>> wrote:
> > Solr slave is doing full replication (entire index) of index after master
> > restart
> > Using solr 5.3.1 not cloud (using maser slave architecture ) I see that
> > slave replicates entire index after master restart even though the index
> > version is the same
> >
> > This is bad for me since the slave which is doing serving replicates 80gb
> > if I restart the server and our service is down
> >
> > I attached a file with some snippets of the slave log  before and after the
> > master restart.
> >
> > Is there some default configuration issue causing this problem?
> > Both indexes master and slave were not updated for sure before and after the
> > master restart.
> > The index version stayed exactly the same.
> >
> >
> >
> 
> 



Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-14 Thread Walter Underwood
Solr 3.6 is a VERY old release. You won’t see any fixes for that.

I would recommend starting with Solr 5.5 and keeping an eye on Solr 6.x, which 
has just started releases.

Removing -ing endings is pretty aggressive. That changes “tracking meeting” 
into “track meet”. Most of the time, you’ll be better off with an inflectional 
stemmer that just converts plurals to singulars and other similar changes.

The Porter stemmer does not produce dictionary words. It produces “stems”. 
Those are the same for the singular and plural forms of a word, but the stem 
might not be a word.

1. Start using Solr 5.5. That automatically gets you four years of bug fixes 
and performance improvements.
2. Look at the options for language analysis in the current release of Solr: 
https://cwiki.apache.org/confluence/display/solr/Language+Analysis 
<https://cwiki.apache.org/confluence/display/solr/Language+Analysis>
3. Learn the analysis tool in the Solr admin UI. That allows you to explore the 
behavior.
4. If you really need a high grade morphological analyzer, consider purchasing 
one from Basis Technology: http://www.rosette.com/solr/ 
<http://www.rosette.com/solr/>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 14, 2016, at 10:17 AM, Sara Woodmansee  wrote:
> 
> Hello all,
> 
> I posted yesterday, however I never received my own post, so worried it did 
> not go through (?) Also, I am not a coder, so apologies if not appropriate to 
> post here. I honestly don't know where else to turn, and am determined to 
> find a solution, as search is essential to our site.
> 
> We are having a website built with a search engine based on SOLR v3.6. For 
> stemming, the developer uses EnglishMinimalStemFilterFactory. They were 
> previously using PorterStemFilterFactory which worked better with plural 
> forms, however PorterStemFilterFactory was not working correctly with –ing 
> endings. “icing” becoming "ic", for example.
> 
> Most search terms work fine, but we have inconsistent results (singular vs 
> plural) with terms that end in -ee, -oe, -ie, -ae,  and words that end in -s. 
>  In comparison, the following work fine: words that end with -oo, -ue, -e, -a.
> 
> The developers have been unable to find a solution ("Unfortunately we tried 
> to apply all the filters for stemming but this problem is not resolved"), but 
> this has to be a common issue (?) Someone surely has found a solution to this 
> problem?? 
> 
> Any suggestions greatly appreciated.
> 
> Many thanks!
> Sara 
> _
> 
> DO NOT WORK:  Plural terms that end in -ee, -oe, -ie, -ae,  and words that 
> end in -s.  
> 
> Examples: 
> 
> tree = 0 results
> trees = 21 results
> 
> dungaree = 0 results
> dungarees = 1 result
> 
> shoe = 0 results
> shoes = 1 result
> 
> toe = 1 result
> toes = 0 results
> 
> tie = 1 result
> ties = 0 results
> 
> Cree = 0 results
> Crees = 1 result
> 
> dais = 1 result
> daises = 0 results
> 
> bias = 1 result
> biases = 0 results
> 
> dress = 1 result
> dresses = 0 results
> _
> 
> WORKS:  Words that end with -oo, -ue, -e, -a
> 
> Examples: 
> 
> tide = 1 result
> tides = 1 results
> 
> hue = 2 results
> hues = 2 results
> 
> dakota = 1 result
> dakotas = 1 result
> 
> loo = 1 result
> loos = 1 result
> _
> 



Re: Referencing incoming search terms in searchHandler XML

2016-04-14 Thread Walter Underwood
> On Apr 14, 2016, at 12:18 PM, John Bickerstaff  
> wrote:
> 
> If a user types in "foobarbaz figo" I want all documents with "figo" in the
> contentType field boosted above every other document in the results.


This is a very common requirement that seems like a good idea, but has very bad 
corner cases. I always take this back to the customer and convert it to 
something that works for all queries.

Think about this query:

   vitamin a figo

Now, every document with the word “a” is ranked in front of documents with 
“vitamin a”. That is probably what not what the customer wanted.

Instead, have a requirement that when two documents are equal matches for the 
query, the “figo” document is first.

Or, create an SRP with two sections, five figo matches with a “More …” link, 
then five general matches. But you might want to avoid dupes between the two.

If your customer absolutely insists on having every single figo doc above 
non-figo docs, well, they deserve what they get.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-15 Thread Walter Underwood
I looked at the PHP clients a couple of years ago and they didn’t seem to add 
much.

I wrote PHP code to make GET requests to Solr and parse the JSON response. It 
wasn’t much more code than doing it with a client library.

The client libraries don’t really do much for you. They can’t even keep 
connections open or pool them, because PHP doesn’t do that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 15, 2016, at 8:39 AM, Sara Woodmansee  wrote:
> 
> Hi Shawn,
> 
> No clue what PHP client they are using.
> 
> Thanks for the info!
> 
> Sara
> 
>> On Apr 15, 2016, at 10:35 AM, Shawn Heisey  wrote:
>> 
>> On 4/15/2016 8:15 AM, Sara Woodmansee wrote:
>>> When I suggested the developer consider upgrading to v5.5 or 6.0 (from 
>>> v3.6), this was their response.  It’s clear that upgrading is not going to 
>>> happen any time soon.
>>> 
>>> Developer response:  "But to use SOLR 5, there is a need to find a stable 
>>> and reliable php client. And until very recent time there were no release. 
>>> In other case we would have to write PHP client itself.  Then we would have 
>>> to rewrite integration API with a software, because API very likely has 
>>> changed. And then make changes to every single piece of code in backend and 
>>> frontend of our system that is tied up with search functionality in any 
>>> way. “
>>> 
>>> — I would still like to know (from you folks) if the “stable PHP client” 
>>> issue still holds true?  Perhaps that is not an easy question.
>> 
>> There should be PHP clients with Solr4 support.  Those should work well
>> with 5.x.  I don't know enough about 6 to comment on how compatible it
>> would be.
>> 
>> All PHP clients are third-party -- the project didn't write any of
>> them.  Which PHP client are you using now?
>> 
>> Thanks,
>> Shawn
>> 
> 



Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-17 Thread Walter Underwood
No, Zookeeper is used for managing the locations of replicas and the leader for 
indexing. Queries should still be distributed with a load balancer.

Queries do NOT go through Zookeeper.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 17, 2016, at 9:35 PM, John Bickerstaff  
> wrote:
> 
> My prior use of SOLR in production was pre SOLR cloud.  We put a
> round-robin  load balancer in front of replicas for searching.
> 
> Do I understand correctly that a load balancer is unnecessary with SOLR
> Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless of
> which replica's URL is getting hit?
> 
> Are there any caveats?
> 
> Thanks,



Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Walter Underwood
Sure, here are some real world examples from my time at Netflix.

Is this movie twice as much about “new york”?

* New York, New York

Which one of these is the best match for “blade runner”:

* Blade Runner: The Final Cut
* Blade Runner: Theatrical & Director’s Cut
* Blade Runner: Workprint

http://dvd.netflix.com/Search?v1=blade+runner 
<http://dvd.netflix.com/Search?v1=blade+runner>

At Netflix (when I was there), those were shown in popularity order with a 
boost function.

And for stemming, should the movie “Saw” match “see”? Maybe not.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 20, 2016, at 5:28 PM, Jack Krupansky  wrote:
> 
> Maybe it's a cultural difference, but I can't imagine why on a query for
> "John", any of those titles would be treated as anything other than equals
> - namely, that they are all about John. Maybe the issue is that this seems
> like a contrived example, and I'm asking for a realistic example. Or, maybe
> you have some rule of relevance that you haven't yet shared - and I mean
> rule that a user would comprehend and consider valuable, not simply a
> mechanical rule.
> 
> 
> 
> -- Jack Krupansky
> 
> On Wed, Apr 20, 2016 at 8:10 PM, 
> wrote:
> 
>> Ok sure, I can try and give some examples :)
>> 
>> Lets say that we have the following documents:
>> 
>> Id: 1
>> Title: John Doe
>> 
>> Id: 2
>> Title: John Doe Jr.
>> 
>> Id: 3
>> Title: John Lennon: The Life
>> 
>> Id: 4
>> Title: John Thompson's Modern Course for the Piano: First Grade Book
>> 
>> Id: 5
>> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
>> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
>> Mrs. Surratt
>> 
>> 
>> And in general, when a search word matches the title, I would like to have
>> the length of the title field influence the score, so that matching
>> documents with shorter title get a higher score than documents with longer
>> title, all else considered equal.
>> 
>> So, when a user searches for "John", I would like the results to be pretty
>> much in the order presented above. Though, it is not crucial that for
>> example document 1 comes before document 2. But I would surely want
>> document 1-3 to come before document 4 and 5.
>> 
>> In my mind, the fieldNorm is a perfect solution for this. At least in
>> theory. In practice, the encoding of the fieldNorm seems to make this
>> function much less useful for this use case. Unless I have missed something.
>> 
>> Is there another way to achive something like this? Note that I don't want
>> a general boost on documents with short titles, I only want to boost them
>> if the title field actually matched the query.
>> 
>> /Jimi
>> 
>> 
>> From: Jack Krupansky 
>> Sent: Thursday, April 21, 2016 1:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to configure a minimum field length for the
>> fieldNorm value?
>> 
>> I'm not sure I fully follow what distinction you're trying to focus on. I
>> mean, traditionally length normalization has simply tried to distinguish a
>> title field (rarely more than a dozen words) from a full body of text, or
>> maybe an abstract, not things like exactly how many words were in a title.
>> Or, as another example, a short newswire article of a few paragraphs vs. a
>> feature-length article, paper, or even book. IOW, traditionally it was more
>> of a boolean than a broad range of values. Sure, yes, you absolutely can
>> define a custom similarity with a custom norm that supports a wide range of
>> lengths, but you'll have to decide what you really want  to achieve to tune
>> it.
>> 
>> Maybe you could give a couple examples of field values that you feel should
>> be scored differently based on length.
>> 
>> -- Jack Krupansky
>> 
>> On Wed, Apr 20, 2016 at 7:17 PM, 
>> wrote:
>> 
>>> I am talking about the title field. And for the title field, a sweetspot
>>> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
>>> value that differentiates between for example 2, 3, 4 and 5 terms in the
>>> title, but only very little.
>>> 
>>> The 20% number I got by simply calculating the difference in the title
>>> fieldNorm of two documents, where one title was one word longer than the
>>> other title. And one fieldNorm value was 20% larger then the oth

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Walter Underwood
32 GB is a pretty big heap. If the working set is really smaller than that, the 
extra heap just makes a full GC take longer.

How much heap is used after a full GC? Take the largest value you see there, 
then add a bit more, maybe 25% more or 2 GB more.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2016, at 8:50 AM, Nick Vasilyev  wrote:
> 
> mmfr_exact is a string field. key_phrases is a multivalued string field.
> 
> On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley  wrote:
> 
>> What about the field types though... are they single valued or multi
>> valued, string, text, numeric?
>> 
>> -Yonik
>> 
>> 
>> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
>>  wrote:
>>> Hi Yonik,
>>> 
>>> I forgot to mention that the index is approximately 50 million docs split
>>> across 4 shards (replication factor 2) on 2 solr replicas.
>>> 
>>> This particular script will filter items based on a category
>> (10-~1,000,000
>>> items in each) and run facets on top X terms for particular fields. Query
>>> looks like this:
>>> 
>>> {
>>>   q => "cat:$code",
>>>   rows => 0,
>>>   facet => 'true',
>>>   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
>>>   'f.key_phrases.facet.limit' => 100,
>>>   'f.mmfr_exact.facet.limit' => 20,
>>>   'facet.mincount' => 5,
>>>   distrib => 'false',
>>> }
>>> 
>>> I know it can be re-worked some, especially considering there are
>> thousands
>>> of similar requests going out. However we didn't have this issue before
>> and
>>> I am worried that it may be a symptom of a larger underlying problem.
>>> 
>>> On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley 
>> wrote:
>>> 
>>>> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
>>>>  wrote:
>>>>> Hello,
>>>>> 
>>>>> We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
>> long
>>>> GC
>>>>> pauses when running jobs that do some hairy faceting. The same jobs
>>>> worked
>>>>> fine with our previous 4.6 Solr.
>>>> 
>>>> What does a typical request look like, and what are the field types
>>>> that faceting is done on?
>>>> 
>>>> -Yonik
>>>> 
>>>> 
>>>>> The JVM is configured with 32GB heap with default GC settings, however
>>>> I've
>>>>> been tweaking the GC settings to no avail. The latest version had the
>>>>> following differences from the default config:
>>>>> 
>>>>> XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
>>>>> 
>>>>> XX:CMSInitiatingOccupancyFraction increased from 50 to 70
>>>>> 
>>>>> 
>>>>> Here is a sample output from the gc_log
>>>>> 
>>>>> 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which
>> application
>>>>> threads were stopped: 0.1667520 seconds, Stopping threads took:
>> 0.0171900
>>>>> seconds
>>>>> {Heap before GC invocations=2051 (full 59):
>>>>> par new generation   total 6990528K, used 2626705K
>> [0x2b16c000,
>>>>> 0x2b18c000, 0x2b18c000)
>>>>>  eden space 5592448K,  44% used [0x2b16c000,
>> 0x2b17571b9948,
>>>>> 0x2b181556)
>>>>>  from space 1398080K,  10% used [0x2b181556,
>> 0x2b181e8cac28,
>>>>> 0x2b186aab)
>>>>>  to   space 1398080K,   0% used [0x2b186aab,
>> 0x2b186aab,
>>>>> 0x2b18c000)
>>>>> concurrent mark-sweep generation total 25165824K, used 25122205K
>>>>> [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
>>>>> Metaspace   used 41840K, capacity 42284K, committed 42680K,
>> reserved
>>>>> 43008K
>>>>> 2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
>>>>> 2016-04-28T04:36:49.828-0400: 27908.124:
>>>> [CMS2016-04-28T04:36:49.912-0400:
>>>>> 27908.207: [CMS-concurr
>>>>> ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
>>>>> real=5.86 secs]
>>>>> (concurrent mod

Absolute path name for external file field

2015-08-13 Thread Walter Underwood
Is there a way to specify a different file location for the external file field 
file? I know that the data directory makes the most sense, but for deployment, 
it is going to be MUCH easier for us to put it in the config directory.

The original Jira mentioned an absolute path for the file. Is that still 
possible?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Admin Login

2015-08-15 Thread Walter Underwood
No one runs a public-facing Solr server. Just like no one runs a public-facing 
MySQL server.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 15, 2015, at 4:15 PM, Scott Derrick  wrote:

> I'm somewhat puzzled there is no built in security.  I can't image anybody is 
> running a public facing solr server with the admin page wide open?
> 
> I've searched and haven't found any solutions that work out of the box.
> 
> I've tried the solutions here to no avail. 
> https://wiki.apache.org/solr/SolrSecurity
> 
> and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms
> 
> The Solr security docs say to use the application server and if I could run 
> it on my tomcat server I would already be done.  But I'm told I can't do that?
> 
> What solutions are people using?
> 
> Scott
> 
> -- 
> Leave no stone unturned.
> Euripides



Re: Cache

2015-08-19 Thread Walter Underwood
Why? Do you evaluate Unix performance with and without file buffers?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 19, 2015, at 5:00 PM, Nagasharath  wrote:

> Trying to evaluate the performance of queries with and without cache
> 
> 
> 
>> On 18-Aug-2015, at 11:30 am, Yonik Seeley  wrote:
>> 
>> On Tue, Aug 18, 2015 at 12:23 PM, naga sharathrayapati
>>  wrote:
>>> Is it possible to clear the cache through query?
>>> 
>>> I need this for performance valuation.
>> 
>> No, but you can prevent a query from being cached:
>> q={!cache=false}my query
>> 
>> What are you trying to test the performance of exactly?
>> If you think queries will be highly unique, the best way of testing is
>> to make your test queries highly unique (for example, adding a random
>> number in the mix) so that the hit rate on the query cache won't be
>> unrealistically high.
>> 
>> -Yonik



Re: Multiple concurrent queries to Solr

2015-08-23 Thread Walter Underwood
The last time that I used the HTTPClient library, it was non-blocking. It 
doesn’t try to read from the socket until you ask for data from the response 
object. That allows parallel requests without threads.

Underneath, it has a pool of connections that can be reused. If the pool is 
exhausted, it can block.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 23, 2015, at 8:49 AM, Shawn Heisey  wrote:

> On 8/23/2015 7:46 AM, Ashish Mukherjee wrote:
>> I want to run few Solr queries in parallel, which are being done in a
>> multi-threaded model now. I was wondering if there are any client libraries
>> to query Solr  through a non-blocking I/O mechanism instead of a threaded
>> model. Has anyone attempted something like this?
> 
> The only client library that the Solr project makes is SolrJ -- the
> client for Java.  If you are not using the SolrJ client, then the Solr
> project did not write it, and you should contact the authors of the
> library directly.
> 
> SolrJ and Solr are both completely thread-safe, and multiple threads are
> recommended for highly concurrent usage.  SolrJ uses HttpClient for
> communication with Solr.
> 
> I was not able to determine whether the default httpclient settings will
> result in non-blocking I/O or not. As far as I am aware, nothing in
> SolrJ sets any explicit configuration for blocking or non-blocking I/O.
> You can create your own HttpClient object in a SolrJ program and have
> the SolrClient object use it.
> 
> HttpClient uses HttpCore.  Here is the main web page for these components:
> 
> https://hc.apache.org/
> 
> On this webpage, it says "HttpCore supports two I/O models: blocking I/O
> model based on the classic Java I/O and non-blocking, event driven I/O
> model based on Java NIO."  There is no information here about which
> model is chosen by default.
> 
> Thanks,
> Shawn
> 



Re: any easy way to find out when a core's index physical file has been last updated?

2015-09-03 Thread Walter Underwood
Instead of writing new code, you could configure an autocommit interval in 
Solr. That already does what you want, no more than one commit in the interval 
and no commits if there were no adds or deletes.

Then the clients would never need to commit.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 3, 2015, at 3:20 PM, Renee Sun  wrote:

> this make sense now. Thanks!
> 
> why I got on this idea is:
> 
> In our system we have large customer base and lots of cores, each customer
> may have multiple cores.
> 
> there are also a lot of processes running in our system processing the data
> for these customers, and once a while, they would ask a center piece of
> webapp that we wrote to commit on a core.
> 
> In this center piece webapp, I deploy it with solr in same tomcat container,
> its task is mainly a wrapper around the local cores to manage monitoring of
> the core size, merge cores if needed etc. I also have controls over the
> commit requests this webapp receives from time to time, try to space the
> commit out. In the case where multiple processes asking commits to the same
> core , my webapp will guarantee only one commit in x mintues interval get
> executed and drop the other commit requests.
> 
> Now I just discovered some of the processes send in large amount of commit
> requests on many cores which never had any changes in the last interval.
> This was due to a bug in those other processes but the programmers there are
> behind on fixing the issue. this triggers me to the idea of verifying the
> incoming commit requests by checking the physical index files to see if any
> updates really occurred in the last interval.
> 
> I was searching for any solr core admin RESTful api to get some meta data
> about the core such as 'last modified timestamp' ... but did not have any
> luck. 
> 
> I thought I could use 'index' folder timestamp to get accurate last modified
> time, but with what you just explained, it would not be the case. I will
> have to traverse through the files in the folder and figure out the last
> modified file.
> 
> any input will be appreciated. Thanks a lot!
> Renee
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/any-easy-way-to-find-out-when-a-core-s-index-physical-file-has-been-last-updated-tp4227044p4227084.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Strange interpretation of invalid ISO date strings

2015-09-07 Thread Walter Underwood
Yes, ISO 8601 gets pretty baroque in the far nooks and crannies of the spec.

I use the “web profile” of ISO 8601, which is very simple. I’ve never seen any 
software mishandle dates using this subset of the spec.

http://www.w3.org/TR/NOTE-datetime

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 6, 2015, at 10:28 PM, Paul Libbrecht  wrote:

> Just a word of warning: iso-8601, the date format standard, is quite big, to 
> say the least, and I thus expect very few implementations to be complete. 
> 
> I survived one such interoperability issue with Safari on iOS6. While they 
> (and JS I think) claim iso-8601, it was not complete and fine grained hunting 
> lead us to the discovery of that. Opening an issue at Apple was done but 
> changing on our side was ‎much faster. Overall, this has cost us several 
> months of development...
> 
> I wish there would be a tinyer standard.
> 
> Paul‎
> 
> 
> -- fat fingered on my z10 --
>   Message d'origine  
> De: Shawn Heisey
> Envoyé: Montag, 7. September 2015 02:05
> À: solr-user@lucene.apache.org
> Répondre à: solr-user@lucene.apache.org
> Objet: Strange interpretation of invalid ISO date strings
> 
> Here's some debug info from a query our code was generating:
> 
> "querystring": "post_date:[2015-09-0124T00:00:00Z TO
> 2015-09-0224T00:00:00Z]",
> "parsedquery": "post_date:[145169280 TO 146033280]",
> 
> The "24" is from part of our code that interprets the hour, it was being
> incorrectly added. We have since fixed the problem, but are somewhat
> confused that we did not get an error.
> 
> When I decode the millisecond timestamps in the parsed query, I get
> these dates:
> 
> Sat, 02 Jan 2016 00:00:00 GMT
> Mon, 11 Apr 2016 00:00:00 GMT
> 
> Should this be considered a bug? I would have expected Solr to throw an
> exception related to an invalidly formatted date, not assume that we
> meant the 124th and 224th day of the month and calculate it
> accordingly. Would I be right in thinking that this problem is not
> actually in Solr code, that we are using code from either Java itself or
> a third party for ISO date parsing?
> 
> The index where this problem was noticed is Solr 4.9.1 running with
> Oracle JDK8u45 on Linux. I confirmed that the same thing happens if I
> use Solr 5.2.1 running with Oracle JDK 8u60 on Windows.
> 
> Thanks,
> Shawn
> 



Re: Solr facets implementation question

2015-09-08 Thread Walter Underwood
Every faceting implementation I’ve seen (not just Solr/Lucene) makes big 
in-memory lists. Lots of values means a bigger list.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Sep 8, 2015, at 8:33 AM, Shawn Heisey  wrote:

> On 9/8/2015 9:10 AM, adfel70 wrote:
>> I am trying to understand why faceting on a field with lots of unique values
>> has a great impact on query performance. Since Googling for Solr facet
>> algorithm did not yield anything, I looked how facets are implemented in
>> Lucene. I found out that there are 2 methods - taxonomy-based and
>> SortedSetDocValues-based. Does Solr facet capabilities are based on one of
>> those methods? if so, I still cant understand why unique values impacts
>> query performance...
> 
> Lucene's facet implementation is completely separate (and different)
> from Solr's implementation.  I am not familiar with the inner workings
> of either implementation.  Solr implemented faceting long before Lucene
> did.  I think *Solr* actually contains at least two different facet
> implementations, used for different kinds of facets.
> 
> Faceting on a field with many unique values uses a HUGE amount of heap
> memory, which is likely why query performance is impacted.
> 
> I have a dev system with all my indexes (each of which has dedicated
> hardware for production) on it.  Normally it requires 15GB of heap to
> operate properly.  Every now and then, I get asked to do a duplicate
> check on a field that *should* be unique, on an index with 250 million
> docs in it.  The query that I am asked to do for the facet matches about
> 100 million docs.  This facet query, on a field that DOES have
> docValues, will throw OOM if my heap is less than 27GB.  The dev machine
> only has 32GB of RAM, so as you might imagine, performance is really
> terrible when I do this query.  Thankfully it's a dev machine.  When I
> was doing these queries, it was running 4.9.1.  I have since upgraded it
> to 5.2.1, as a proof of concept for upgrading our production indexes ...
> but I have not attempted the facet query since the upgrade.
> 
> Thanks,
> Shawn
> 



Re: Detect term occurrences

2015-09-10 Thread Walter Underwood
Doing a query for each term should work well. Solr is fast for queries. Write a 
script.

I assume you only need to do this once. Running all the queries will probably 
take less time than figuring out a different approach.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 10, 2015, at 7:37 AM, Markus Jelsma  wrote:

> If you are interested in just the number of occurences of an indexed term. 
> The TermsComponent will give that answer.
> MArkus 
> 
> -Original message-
>> From:Francisco Andrés Fernández 
>> Sent: Thursday 10th September 2015 15:58
>> To: solr-user@lucene.apache.org
>> Subject: Detect term occurrences
>> 
>> Hi all, I'm new to Solr.
>> I want to detect all ocurrences of terms existing in a thesaurus into 1 or
>> more documents.
>> What´s the best strategy to make it?
>> Doing a query for each term doesn't seem to be the best way.
>> Many thanks,
>> 
>> Francisco
>> 



Re: Ideas

2015-09-21 Thread Walter Underwood
I have put a limit in the front end at a couple of sites. Nobody gets more than 
50 pages of results. Show page 50 if they request beyond that.

First got hit by this at Netflix, years ago.

Solr 4 is much better about deep paging, but here at Chegg we got deep paging 
plus a stupid, long query. That was using too much CPU.

Right now, block the IPs. Those are hostile.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 21, 2015, at 10:31 AM, Paul Libbrecht  wrote:
> 
> Writing a query component would be pretty easy or?
> It would throw an exception if crazy numbers are requested...
> 
> I can provide a simple example of a maven project for a query component.
> 
> Paul
> 
> 
> William Bell wrote:
>> We have some Denial of service attacks on our web site. SOLR threads are
>> going crazy.
>> 
>> Basically someone is hitting start=15 + and rows=20. The start is crazy
>> large.
>> 
>> And then they jump around. start=15 then start=213030 etc.
>> 
>> Any ideas for how to stop this besides blocking these IPs?
>> 
>> Sometimes it is Google doing it even though these search results are set
>> with No-index and No-Follow on these pages.
>> 
>> Thoughts? Ideas?
> 



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-22 Thread Walter Underwood
Faceting on an author field is almost always a bad idea. Or at least a slow, 
expensive idea.

Faceting makes big in-memory lists. More values, bigger lists. An author field 
usually has many, many values, so you will need a lot of memory.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 21, 2015, at 6:42 AM, Uwe Reh  wrote:
> 
> Am 21.09.2015 um 15:16 schrieb Shalin Shekhar Mangar:
>> Can you post your complete facet request as well as the schema
>> definition of the field on which you are faceting?
>> 
> 
> Query:
>> http://yxz/solr/hebis/select/?q=darwin&facet=true&facet.mincount=1&facet.limit=30&facet.field=material_access&facet.field=department_3&facet.field=rvk_facet&facet.field=author_facet&facet.field=material_brief&facet.field=language&facet.prefix=&facet.sort=count&echoParams=all&debugQuery=true
> 
> 
> 
> Schema (with docValue):
>> ...
>> > required="false" multiValued="true" docValues="true" />
>> > required="false" multiValued="true" docValues="true" />
>> ...
>> 
>> ...
> 
> 
> 
> Schema (w/o docValue):
>> ...
>> > required="false" multiValued="true" docValues="true" />
>> > required="false" multiValued="true" />
>> ...
>> 
>> ...
> 
> 
> 
> solrconfig:
>> ...
>> > showItems="48" />
>> ...
>> 
>>  
>> 10
>> allfields
>> none
>>  
>>  
>> query
>> facet
>> stats
>> debug
>> elevator
>>  
>>   
> 
> 



Re: is there a way to remove deleted documents from index without optimize

2015-09-22 Thread Walter Underwood
Don’t do anything. Solr will automatically clean up the deleted documents for 
you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 22, 2015, at 6:01 PM, CrazyDiamond  wrote:
> 
> my index is updating frequently and i need to remove unused documents from
> index after update/reindex.
> Optimizaion is very expensive so what should i do?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/is-there-a-way-to-remove-deleted-documents-from-index-without-optimize-tp4230691.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr get score of each doc in edis max search and more like this search result

2015-09-23 Thread Walter Underwood
You can request the “score” field in the “fl” parameter.

Why do you want to cut off at a particular score value?

Solr scores don’t work like that. They are not absolute relevance scores, they 
change with each query. There is no such thing as a 100% match or a 50% match.

Setting a lower score limit will almost certainly not do what you want. Because 
it doesn’t do anything useful.

I recommend reading this document for more info:

https://wiki.apache.org/lucene-java/ScoresAsPercentages 
<https://wiki.apache.org/lucene-java/ScoresAsPercentages>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 23, 2015, at 6:53 AM, sara hajili  wrote:
> 
> hi all
> i wanna to get each doc score in search result + restrict search result to
> some doc that their score are above than score that i need (i mean i set
> minimum score in search and get doc based on upper than that score)
> i need this in normal search with edismax and more like this in pysolr
> i undrestand that i can set debug = true
> and from search resulrt i get
> 
> print(search_result.debug['explain'])
> 
> but this explain more and i couldn't get each doc score.
> any help ?!
> tnx



Re: firstSearcher cache warming with own QuerySenderListener

2015-09-25 Thread Walter Underwood
Right.

I chose the twenty most frequent terms from our documents and use those for 
cache warming. The list of most frequent terms is pretty stable in most 
collections.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 8:38 AM, Erick Erickson  wrote:
> 
> That's what the firstSearcher event in solrconfig.xml is for, exactly the
> case of autowarming Solr when it's just been started. The queries you put
> in that event are fired only when the server starts.
> 
> So I'd just put my queries there. And you do not have to put a zillion
> queries here. Start with one that mentions all the facets you intend to
> use, sorts by all the various sort fields you use, perhaps (if you have any
> _very_ common filter queries) put those in too.
> 
> Then analyze the queries that are still slow when issued the first time
> after startup and add what you suspect are the relevant bits to the
> firstSearcher query (or queries).
> 
> I suggest that this is a much easier thing to do, and focus efforts on why
> you are shutting down your Solr servers often enough that anyone notices..
> 
> Best,
> Erick
> 
> 
> 
> On Fri, Sep 25, 2015 at 8:31 AM, Christian Reuschling <
> christian.reuschl...@gmail.com> wrote:
> 
>> Hey all,
>> 
>> we want to avoid cold start performance issues when the caches are cleared
>> after a server restart.
>> 
>> For this, we have written a SearchComponent that saves least recently used
>> queries. These are
>> written to a file inside a closeHook of a SolrCoreAware at server shutdown.
>> 
>> The plan is to perform these queries at server startup to warm up the
>> caches. For this, we have
>> written a derivative of the QuerySenderListener and configured it as
>> firstSearcher listener in
>> solrconfig.xml. The only difference to the origin QuerySenderListener is
>> that it gets it's queries
>> from the formerly dumped lru queries rather than getting them from the
>> config file.
>> 
>> It seems that everything is called correctly, and we have the impression
>> that the query response
>> times for the dumped queries are sometimes slightly better than without
>> this warming.
>> 
>> Nevertheless, there is still a huge difference against the times when we
>> manually perform the same
>> queries once, e.g. from a browser. If we do this, the second time we
>> perform these queries they
>> respond much faster (up to 10 times) than the response times after the
>> implemented warming.
>> 
>> It seems that not all caches are warmed up during our warming. And because
>> of these huge
>> differences, I doubt we missed something.
>> 
>> The index has about 25M documents, and is splitted into two shards in a
>> cloud configuration, both
>> shards are on the same server instance for now, for testing purposes.
>> 
>> Does anybody have an idea? I tried to disable lazy field loading as a
>> potential issue, but with no
>> success.
>> 
>> 
>> Cheers,
>> 
>> Christian
>> 
>> 



Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
Sure.

1. Delete all the docs (no commit).
2. Add all the docs (no commit).
3. Commit.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:17 PM, Ravi Solr  wrote:
> 
> I have been trying to re-index the docs (about 1.5 million) as one of the
> field needed part of string value removed (accidentally introduced). I was
> issuing a query for 100 docs getting 4 fields and updating the doc  (atomic
> update with "set") via the CloudSolrClient in batches, However from time to
> time the query returns 0 results, which exits the re-indexing program.
> 
> I cant understand as to why the cloud returns 0 results when there are 1.4x
> million docs which have the "accidental" string in them.
> 
> Is there another way to do bulk massive updates ?
> 
> Thanks
> 
> Ravi Kiran Bhaskar



Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
Sorry, I did not mean to be rude. The original question did not say that you 
don’t have the docs outside of Solr. Some people jump to the advanced features 
and miss the simple ones.

It might be faster to fetch all the docs from Solr and save them in files. Then 
modify them. Then reload all of them. No guarantee, but it is worth a try.

Good luck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:59 PM, Ravi Solr  wrote:
> 
> Walter, Not in a mood for banter right now Its 6:00pm on a friday and
> Iam stuck here trying to figure reindexing issues :-)
> I dont have source of docs so I have to query the SOLR, modify and put it
> back and that is seeming to be quite a task in 5.3.0, I did reindex several
> times with 4.7.2 in a master slave env without any issue. Since then we
> have moved to cloud and it has been a pain all day.
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> 
> On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood 
> wrote:
> 
>> Sure.
>> 
>> 1. Delete all the docs (no commit).
>> 2. Add all the docs (no commit).
>> 3. Commit.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 25, 2015, at 2:17 PM, Ravi Solr  wrote:
>>> 
>>> I have been trying to re-index the docs (about 1.5 million) as one of the
>>> field needed part of string value removed (accidentally introduced). I
>> was
>>> issuing a query for 100 docs getting 4 fields and updating the doc
>> (atomic
>>> update with "set") via the CloudSolrClient in batches, However from time
>> to
>>> time the query returns 0 results, which exits the re-indexing program.
>>> 
>>> I cant understand as to why the cloud returns 0 results when there are
>> 1.4x
>>> million docs which have the "accidental" string in them.
>>> 
>>> Is there another way to do bulk massive updates ?
>>> 
>>> Thanks
>>> 
>>> Ravi Kiran Bhaskar
>> 
>> 



Re: Cost of having multiple search handlers?

2015-09-28 Thread Walter Underwood
We did the same thing, but reporting performance metrics to Graphite.

But we won’t be able to add servlet filters in 6.x, because it won’t be a 
webapp.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 28, 2015, at 11:32 AM, Gili Nachum  wrote:
> 
> A different solution to the same need: I'm measuring response times of
> different collections measuring  online/batch queries apart using New
> Relic. I've added a servlet filter that analyses the request and makes this
> info available to new relic over a request argument.
> 
> The built in new relic solr plug in doesn't provide much.
> On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
> 
>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
>>> I want to register multiple but identical search handler to have
>> multiple buckets to measure performance for our different apis and
>> consumers (and to find out who is actually using Solr).
>>> 
>>> What are there some costs associated with having multiple search
>> handlers? Are they neglible?
>> 
>> Unless you are creating hundreds or thousands of them, I doubt you'll
>> notice any significant increase in resource usage from additional
>> handlers.  Each handler definition creates an additional URL endpoint
>> within the servlet container, additional object creation within Solr,
>> and perhaps an additional thread pool and threads to go with it, so it's
>> not free, but I doubt that it's significant.  The resources required for
>> actually handling a request is likely to dwarf what's required for more
>> handlers.
>> 
>> Disclaimer: I have not delved into the code to figure out exactly what
>> gets created with a search handler config, so I don't know exactly what
>> happens.  I'm basing this on general knowledge about how Java programs
>> are constructed by expert developers, not specifics about Solr.
>> 
>> There are others on the list who have a much better idea than I do, so
>> if I'm wrong, I'm sure one of them will let me know.
>> 
>> Thanks,
>> Shawn
>> 
>> 



Re: Cost of having multiple search handlers?

2015-09-28 Thread Walter Underwood
We built our own because there was no movement on that. Don’t hold your breath.

Glad to contribute it. We’ve been running it in production for a year, but the 
config is pretty manual.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 28, 2015, at 4:41 PM, Jeff Wartes  wrote:
> 
> 
> One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
> be done by then. 
> 
> 
> On 9/28/15, 11:39 AM, "Walter Underwood"  wrote:
> 
>> We did the same thing, but reporting performance metrics to Graphite.
>> 
>> But we won’t be able to add servlet filters in 6.x, because it won’t be a
>> webapp.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 28, 2015, at 11:32 AM, Gili Nachum  wrote:
>>> 
>>> A different solution to the same need: I'm measuring response times of
>>> different collections measuring  online/batch queries apart using New
>>> Relic. I've added a servlet filter that analyses the request and makes
>>> this
>>> info available to new relic over a request argument.
>>> 
>>> The built in new relic solr plug in doesn't provide much.
>>> On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
>>> 
>>>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
>>>>> I want to register multiple but identical search handler to have
>>>> multiple buckets to measure performance for our different apis and
>>>> consumers (and to find out who is actually using Solr).
>>>>> 
>>>>> What are there some costs associated with having multiple search
>>>> handlers? Are they neglible?
>>>> 
>>>> Unless you are creating hundreds or thousands of them, I doubt you'll
>>>> notice any significant increase in resource usage from additional
>>>> handlers.  Each handler definition creates an additional URL endpoint
>>>> within the servlet container, additional object creation within Solr,
>>>> and perhaps an additional thread pool and threads to go with it, so
>>>> it's
>>>> not free, but I doubt that it's significant.  The resources required
>>>> for
>>>> actually handling a request is likely to dwarf what's required for more
>>>> handlers.
>>>> 
>>>> Disclaimer: I have not delved into the code to figure out exactly what
>>>> gets created with a search handler config, so I don't know exactly what
>>>> happens.  I'm basing this on general knowledge about how Java programs
>>>> are constructed by expert developers, not specifics about Solr.
>>>> 
>>>> There are others on the list who have a much better idea than I do, so
>>>> if I'm wrong, I'm sure one of them will let me know.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>> 
> 



Re: Solr vs Lucene

2015-10-01 Thread Walter Underwood
If you want a spell checker, don’t use a search engine. Use a spell checker. 
Something like aspell (http://aspell.net/ <http://aspell.net/>) will be faster 
and better than Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 1, 2015, at 1:06 PM, Mark Fenbers  wrote:
> 
> This is with Solr.  The Lucene approach (assuming that is what is in my Java 
> code, shared previously) works flawlessly, albeit with fewer options, AFAIK.
> 
> I'm not sure what you mean by "business case"...  I'm wanting to spell-check 
> user-supplied text in my Java app.  The end-user then activates the 
> spell-checker on the entire text (presumably, a few paragraphs or less).  I 
> can use StyledText's capabilities to highlight the misspelled words, and when 
> the user clicks the highlighted word, a menu will appear where he can select 
> a suggested spelling.
> 
> But so far, I've had trouble:
> 
> * determining which words are misspelled (because Solr often returns
>   suggestions for correctly spelled words).
> * getting coherent suggestions (regardless if the query word is
>   misspelled or not).
> 
> It's been a bit puzzling (and frustrating)!!  it only took me 10 minutes to 
> get the Lucene spell checker working, but I agree that Solr would be the 
> better way to go, if I can ever get it configured properly...
> 
> Mark
> 
> 
> On 10/1/2015 12:50 PM, Alexandre Rafalovitch wrote:
>> Is that with Lucene or with Solr? Because Solr has several different
>> spell-checker modules you can configure.  I would recommend trying
>> them first.
>> 
>> And, frankly, I still don't know what your business case is.
>> 
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>> 
>> 
>> On 1 October 2015 at 12:38, Mark Fenbers  wrote:
>>> Yes, and I've spend numerous hours configuring and reconfiguring, and
>>> eventually even starting over, but still have not getting it to work right.
>>> Even now, I'm getting bizarre results.  For example, I query   "NOTE: This
>>> is purely as an example."  and I get back really bizarre suggestions, like
>>> "n ot e" and "n o te" and "n o t e" for the first word which isn't even
>>> misspelled!  The same goes for "purely" and "example" also!  Moreover, I get
>>> extended results showing the frequencies of these suggestions being over
>>> 2600 occurrences, when I'm not even using an indexed spell checker.  I'm
>>> only using a file-based spell checker (/usr/shar/dict/words), and the
>>> wordbreak checker.
>>> 
>>> At this point, I can't even figure out how to narrow down my confusion so
>>> that I can post concise questions to the group.  But I'll get there
>>> eventually, starting with removing the wordbreak checker for the time-being.
>>> Your response was encouraging, at least.
>>> 
>>> Mark
>>> 
>>> 
>>> 
>>> On 10/1/2015 9:45 AM, Alexandre Rafalovitch wrote:
>>>> Hi Mark,
>>>> 
>>>> Have you gone through a Solr tutorial yet? If/when you do, you will
>>>> see you don't need to code any of this. It is configured as part of
>>>> the web-facing total offering which are tweaked by XML configuration
>>>> files (or REST API calls). And most of the standard pipelines are
>>>> already pre-configured, so you don't need to invent them from scratch.
>>>> 
>>>> On your specific question, it would be better to ask what _business_
>>>> level functionality you are trying to achieve and see if Solr can help
>>>> with that. Starting from Lucene code is less useful :-)
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>>> http://www.solr-start.com/
>>>> 
>>>> 
>>>> On 1 October 2015 at 07:48, Mark Fenbers  wrote:
> 



Re: How to disable the admin interface

2015-10-05 Thread Walter Underwood
You understand that disabling the admin API will leave you with an 
unmaintainable Solr installation, right? You might not even be able to diagnose 
the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 5, 2015, at 11:34 AM, Siddhartha Singh Sandhu  
> wrote:
> 
> Help please?
> 
> On Sun, Oct 4, 2015 at 5:07 PM, Siddhartha Singh Sandhu <
> sandhus...@gmail.com> wrote:
> 
>> Hi Shawn and Andrew,
>> 
>> I am on page with you guys about the ssh authentication and communicating
>> with the API's that SOLR has to provide. I simply don't want the GUI as it
>> is nobody will be able to access it once I set the policy on my server
>> except for servers in the same network. Also, now that we are on that
>> issue, does SOLR URL's have checks to guard against penetration attacks as
>> the "prod setup" guide is so openly available?
>> 
>> Regards,
>> Sid.
>> 
>> On Sun, Oct 4, 2015 at 4:55 AM, Andrea Open Source <
>> andrearoggerone.o...@gmail.com> wrote:
>> 
>>> Hi,
>>> As Shawn is saying, disabling the Admin interface is not the right way to
>>> go. If you just disable the admin interface users could still run queries
>>> and you don't want that. The solution that you're looking for, is enabling
>>> the ssh authentication so only the users with the right certificate can
>>> query Solr or reach the admin.
>>> 
>>> 
>>> King Regards,
>>> Andrea Roggerone
>>> 
>>>> On 04/ott/2015, at 08:11, Shawn Heisey  wrote:
>>>> 
>>>>> On 10/3/2015 9:17 PM, Siddhartha Singh Sandhu wrote:
>>>>> I want to disable the admin interface in SOLR. I understand that
>>>>> authentication is available in the solrcloud mode but until that
>>> happens I
>>>>> want to disable the admin interface in my prod environment.
>>>>> 
>>>>> How can I do this?
>>>> 
>>>> Why do you need to disable the admin interface?  The admin interface is
>>>> just a bunch of HTML, CSS, and Javascript.  It downloads code that runs
>>>> inside your browser and turns it into a tool that can manipulate Solr.
>>>> 
>>>> The parts of Solr that need protecting are the APIs that the admin
>>>> interface calls.  When authentication is enabled in the newest Solr
>>>> versions, it is not the admin interface that is protected, it is those
>>>> APIs called by the admin interface.  Anyone can use those APIs directly,
>>>> completely independent of the interface.
>>>> 
>>>> Thanks
>>>> Shawn
>>>> 
>>> 
>> 
>> 



Re: Best Indexing Approaches - To max the throughput

2015-10-06 Thread Walter Underwood
It depends on the document. In a e-commerce search, you might want to fail 
immediately and be notified. That is what we do, fail, rollback, and notify.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2015, at 7:58 AM, Alessandro Benedetti  
> wrote:
> 
> mm one broken document in a batch should not break the entire batch ,
> right ( whatever approach used) ?
> Are you referring to the fact that you want to programmatically re-index
> the broken docs ?
> 
> Would be interesting to return the id of the broken docs along with the
> solr update response!
> 
> Cheers
> 
> 
> On 6 October 2015 at 15:30, Bill Dueber  wrote:
> 
>> Just to add...my informal tests show that batching has way more effect
>> than solrj vs json.
>> 
>> I haven't look at CUSC in a while, last time I looked it was impossible to
>> do anything smart about error handling, so check that out before you get
>> too deeply into it. We use a strategy of sending a batch of json documents,
>> and if it returns an error sending each record one at a time until we find
>> the bad one and can log something useful.
>> 
>> 
>> 
>> On Mon, Oct 5, 2015 at 12:07 PM, Alessandro Benedetti <
>> benedetti.ale...@gmail.com> wrote:
>> 
>>> Thanks Erick,
>>> you confirmed my impressions!
>>> Thank you very much for the insights, an other opinion is welcome :)
>>> 
>>> Cheers
>>> 
>>> 2015-10-05 14:55 GMT+01:00 Erick Erickson :
>>> 
>>>> SolrJ tends to be faster for several reasons, not the least of which
>>>> is that it sends packets to Solr in a more efficient binary format.
>>>> 
>>>> Batching is critical. I did some rough tests using SolrJ and sending
>>>> docs one at a time gave a throughput of < 400 docs/second.
>>>> Sending 10 gave 2,300 or so. Sending 100 at a time gave
>>>> over 5,300 docs/second. Curiously, 1,000 at a time gave only
>>>> marginal improvement over 100. This was with a single thread.
>>>> YMMV of course.
>>>> 
>>>> CloudSolrClient is definitely the better way to go with SolrCloud,
>>>> it routes the docs to the correct leader instead of having the
>>>> node you send the docs to do the routing.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Mon, Oct 5, 2015 at 4:57 AM, Alessandro Benedetti
>>>>  wrote:
>>>>> I was doing some studies and analysis, just wondering in your opinion
>>>> which
>>>>> one is the best approach to use to index in Solr to reach the best
>>>>> throughput possible.
>>>>> I know that a lot of factor are affecting Indexing time, so let's
>> only
>>>>> focus in the feeding approach.
>>>>> Let's isolate different scenarios :
>>>>> 
>>>>> *Single Solr Infrastructure*
>>>>> 
>>>>> 1) Xml/Json batch request to /update IndexHandler (xml/json)
>>>>> 
>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
>>>>> I was thinking this to be the fastest approach for a multi threaded
>>>>> indexing application.
>>>>> Posting batch of docs if possible per request.
>>>>> 
>>>>> *Solr Cloud*
>>>>> 
>>>>> 1) Xml/Json batch request to /update IndexHandler(xml/json)
>>>>> 
>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
>>>>> 
>>>>> 3) CloudSolrClient ( javabin)
>>>>> it seems the best approach accordingly to this improvements [1]
>>>>> 
>>>>> What are your opinions ?
>>>>> 
>>>>> A bonus observation should be for using some Map/Reduce big data
>>> indexer,
>>>>> but let's assume we don't have a big cluster of cpus, but the average
>>>>> Indexer server.
>>>>> 
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>>> 
>> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>>>>> 
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 
>>>>> --
>>>>> --
>>>>> 
>>>>> Benedetti Alessandro
>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>> 
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>> 
>>>>> William Blake - Songs of Experience -1794 England
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> 
>>> Benedetti Alessandro
>>> Visiting card - http://about.me/alessandro_benedetti
>>> Blog - http://alexbenedetti.blogspot.co.uk
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>>> 
>> 
>> 
>> 
>> --
>> Bill Dueber
>> Library Systems Programmer
>> University of Michigan Library
>> 
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England



Re: Best Indexing Approaches - To max the throughput

2015-10-06 Thread Walter Underwood
This is at Chegg. One of our indexes is textbooks. These are expensive and 
don’t change very often. It is better to keep yesterday’s index than to drop a 
few important books.

We have occasionally had an error that happens with every book, like a new 
field that is not in the Solr schema. If we ignored errors with that, we’d have 
an empty index: delete all, add all (failing), commit.

With the fail fast and rollback, we can catch problems before they mess up the 
index.

Also, to pinpoint isolated problems, if there is an error in the batch, it 
re-submits that batch one at a time, so we get an accurate report of which 
document was rejected. I wrote that same thing back at Netflix, before SolrJ.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2015, at 9:49 AM, Alessandro Benedetti  
> wrote:
> 
> Hi Walter,
> can you explain better your use case ?
> You index a batch of e-commerce products ( Solr documents) if one fails,
> you want to stop and invalidate the entire batch ( using the almost never
> used solr rollback, or manual deletion ?)
> And then log the exception indexing size.
> To then re-index the whole batch od docs ?
> 
> In this scenario, the ConcurrentUpdateSolrClient will not be ideal?
> Only curiosity.
> 
> Cheers
> 
> On 6 October 2015 at 17:29, Walter Underwood  wrote:
> 
>> It depends on the document. In a e-commerce search, you might want to fail
>> immediately and be notified. That is what we do, fail, rollback, and notify.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 6, 2015, at 7:58 AM, Alessandro Benedetti <
>> benedetti.ale...@gmail.com> wrote:
>>> 
>>> mm one broken document in a batch should not break the entire batch ,
>>> right ( whatever approach used) ?
>>> Are you referring to the fact that you want to programmatically re-index
>>> the broken docs ?
>>> 
>>> Would be interesting to return the id of the broken docs along with the
>>> solr update response!
>>> 
>>> Cheers
>>> 
>>> 
>>> On 6 October 2015 at 15:30, Bill Dueber  wrote:
>>> 
>>>> Just to add...my informal tests show that batching has way more
>> effect
>>>> than solrj vs json.
>>>> 
>>>> I haven't look at CUSC in a while, last time I looked it was impossible
>> to
>>>> do anything smart about error handling, so check that out before you get
>>>> too deeply into it. We use a strategy of sending a batch of json
>> documents,
>>>> and if it returns an error sending each record one at a time until we
>> find
>>>> the bad one and can log something useful.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Oct 5, 2015 at 12:07 PM, Alessandro Benedetti <
>>>> benedetti.ale...@gmail.com> wrote:
>>>> 
>>>>> Thanks Erick,
>>>>> you confirmed my impressions!
>>>>> Thank you very much for the insights, an other opinion is welcome :)
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 2015-10-05 14:55 GMT+01:00 Erick Erickson :
>>>>> 
>>>>>> SolrJ tends to be faster for several reasons, not the least of which
>>>>>> is that it sends packets to Solr in a more efficient binary format.
>>>>>> 
>>>>>> Batching is critical. I did some rough tests using SolrJ and sending
>>>>>> docs one at a time gave a throughput of < 400 docs/second.
>>>>>> Sending 10 gave 2,300 or so. Sending 100 at a time gave
>>>>>> over 5,300 docs/second. Curiously, 1,000 at a time gave only
>>>>>> marginal improvement over 100. This was with a single thread.
>>>>>> YMMV of course.
>>>>>> 
>>>>>> CloudSolrClient is definitely the better way to go with SolrCloud,
>>>>>> it routes the docs to the correct leader instead of having the
>>>>>> node you send the docs to do the routing.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> On Mon, Oct 5, 2015 at 4:57 AM, Alessandro Benedetti
>>>>>>  wrote:
>>>>>>> I was doing some studies and analysis, just wondering in your opinion
>>>>>> which
>>>>>>> one is the best approach to use to index in Solr to reach the best
>>>>>>> throug

Re: Pressed optimize and now SOLR is not indexing while optimize is going on

2015-10-07 Thread Walter Underwood
Unix has a “buffer cache”, often called a file cache. This chapter discusses 
the Linux buffer cache, which is very similar to other Unix implementations. 
Essentially, all unused RAM is used to make disk access faster.

http://www.tldp.org/LDP/sag/html/buffer-cache.html 
<http://www.tldp.org/LDP/sag/html/buffer-cache.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 7, 2015, at 3:40 AM, Toke Eskildsen  wrote:
> 
> On Wed, 2015-10-07 at 07:03 -0300, Eric Torti wrote:
>> I'm sorry to diverge this thread a little bit. But could please point me to
>> resources that explain deeply how this process of OS using the non-java
>> memory to cache index data?
> 
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> Shawn Heisey:
>>> Whatever RAM is left over after you give 12GB to Java for Solr will be
>>> used automatically by the operating system to cache index data on the
>>> disk.  Solr is completely reliant on that caching for good performance.
>> 
>> I'm puzzled as to why the physical memory of solr's host machine is always
>> used up and I think some resources on that would help me understand it.
> 
> It is not used up as such: Add "Disk cache" and "Free space" (or
> whatever your monitoring tool calls them) and you will have the amount
> of memory available for new processes. If you start a new and
> memory-hungry process, it will take the memory from the free pool first,
> then from the disk cache.
> 
> 
> - Toke Eskildsen, State and University Library, Denmark
> 
> 



Re: EdgeNGramFilterFactory question

2015-10-07 Thread Walter Underwood
You would need an analyzer or char filter factory that removed all spaces. But 
then you would only get one “edge”. That would make “to be or not to be” into 
the single token “tobeornottobe”. I don’t think that fixes anything.

Stemming and prefix matching do very different things. Use them in different 
analysis chains stored in separate fields.

The exact example you list will work fine with stemming and phrase search. 
Check out the phrase search support in the edismax query parser.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2015, at 1:47 PM, vit  wrote:
> 
> I have Solr 4.2
> 
> 1) Is it possible to somehow use EdgeNGramFilterFactory ignoring white
> spaces in n-grams?
> 
> 2) Is it possible to use EdgeNGramFilterFactory in combination with stemming
> ?
>Say applying this to "look for close hotel" instead of "looking for
> closest hotels"
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/EdgeNGramFilterFactory-question-tp4233034.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to show some documents ahead of others

2015-10-08 Thread Walter Underwood
Sorting all paid above all unpaid will give bad results when there are many 
matches. It will show 1000 paid items, include all the barely relevant ones, 
before it shows the first highly relevant unpaid recipe. What if that was the 
only correct result?

Two approaches that work:

1. Boost paid items using the “boost” parameter in edismax. Adjust it to be a 
tiebreaker between documents with similar score.

2. Show two lists, one with the five most relevant paid, the next with the five 
most relevant unpaid.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 8, 2015, at 7:39 AM, Alessandro Benedetti  
> wrote:
> 
> Is it possible to understand better this : "as it doesn't
> allow any meaningful customization " ?
> 
> Cheers
> 
> On 8 October 2015 at 15:27, Andrea Roggerone > wrote:
> 
>> Hi guys,
>> I don't think that sorting is a good solution in this case as it doesn't
>> allow any meaningful customization.I believe that the advised
>> QueryElevationComponent is one of the viable alternative. Another one would
>> be to boost at query time a particular field, like for instance paid. That
>> would allow you to assign different boosts to different values using a
>> function.
>> 
>> On Thu, Oct 8, 2015 at 1:48 PM, Upayavira  wrote:
>> 
>>> Or just have a field in your index -
>>> 
>>> paid: true/false
>>> 
>>> Then sort=paid desc, score desc
>>> 
>>> (you may need to sort paid asc, not sure which way a boolean would sort)
>>> 
>>> Question is whether you want to show ALL paid posts, or just a set of
>>> them. For the latter you could use result grouping on the paid field.
>>> 
>>> Upayavira
>>> 
>>> On Thu, Oct 8, 2015, at 01:34 PM, NutchDev wrote:
>>>> Hi Christian,
>>>> 
>>>> You can take a look at Solr's  QueryElevationComponent
>>>> <https://wiki.apache.org/solr/QueryElevationComponent>  .
>>>> 
>>>> It will allow you to configure the top results for a given query
>>>> regardless
>>>> of the normal lucene scoring. Also you can specify exclude document
>> list
>>>> to
>>>> exclude certain results for perticular query.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> 
>>> 
>> http://lucene.472066.n3.nabble.com/How-to-show-some-documents-ahead-of-others-tp4233481p4233490.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>> 
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England



Re: Exclude documents having same data in two fields

2015-10-09 Thread Walter Underwood
Please explain why you do not want to use an extra field. That is the only 
solution that will perform well on your large index.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 9, 2015, at 7:47 AM, Aman Tandon  wrote:
> 
> No Susheel, As our index size is 62 GB so it seems hard to find those
> records.
> 
> With Regards
> Aman Tandon
> 
> On Fri, Oct 9, 2015 at 7:30 PM, Susheel Kumar  wrote:
> 
>> Hi Aman,  Did the problem resolved or still having some errors.
>> 
>> Thnx
>> 
>> On Fri, Oct 9, 2015 at 8:28 AM, Aman Tandon 
>> wrote:
>> 
>>> okay Thanks
>>> 
>>> With Regards
>>> Aman Tandon
>>> 
>>> On Fri, Oct 9, 2015 at 4:25 PM, Upayavira  wrote:
>>> 
>>>> Just beware of performance here. This is fine for smaller indexes, but
>>>> for larger ones won't work so well. It will need to do this calculation
>>>> for every document in your index, thereby undoing all benefits of
>> having
>>>> an inverted index.
>>>> 
>>>> If your index (or resultset) is small enough, it can work, but might
>>>> catch you out later.
>>>> 
>>>> Upayavira
>>>> 
>>>> On Fri, Oct 9, 2015, at 10:59 AM, Aman Tandon wrote:
>>>>> Hi,
>>>>> 
>>>>> I tried to use the same as mentioned in the url
>>>>> <
>>>> 
>>> 
>> http://stackoverflow.com/questions/16258605/query-for-document-that-two-fields-are-equal
>>>>> 
>>>>> .
>>>>> 
>>>>> And I used the description field to check because mapping field
>>>>> is multivalued.
>>>>> 
>>>>> So I add the fq={!frange%20l=0%20u=1}strdist(title,description,edit)
>> in
>>>>> my
>>>>> url, but I am getting this error. As mentioned below. Please take a
>>> look.
>>>>> 
>>>>> *Solr Version 4.8.1*
>>>>> 
>>>>> *Url is*
>>>>> 
>>>> 
>>> 
>> http://localhost:8150/solr/core1/select?q.alt=*:*&fl=big*,title,catid&fq={!frange%20l=0%20u=1}strdist(title,description,edit)&defType=edismax
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 500
>>>>>> 8
>>>>>> 
>>>>>> *:*
>>>>>> edismax
>>>>>> big*,title,catid
>>>>>> {!frange l=0
>> u=1}strdist(title,description,edit)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> java.lang.RuntimeException at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.search.ExtendedDismaxQParser$ExtendedDismaxConfiguration.(ExtendedDismaxQParser.java:1455)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.search.ExtendedDismaxQParser.createConfiguration(ExtendedDismaxQParser.java:239)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.search.ExtendedDismaxQParser.(ExtendedDismaxQParser.java:108)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.search.ExtendedDismaxQParserPlugin.createParser(ExtendedDismaxQParserPlugin.java:37)
>>>>>> at org.apache.solr.search.QParser.getParser(QParser.java:315) at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:144)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1952) at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:774)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>>>>>> at
>>>>>> 
>>>> 
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>>>>>> at
>>>>>> 
>>>&g

Re: Exclude documents having same data in two fields

2015-10-10 Thread Walter Underwood
After several days, we finally get the real requirement. It really does waste a 
lot of time and energy when people won’t tell us that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 10, 2015, at 8:19 AM, Upayavira  wrote:
> 
> In which case you'd be happy to wait for 30s for it to complete, in
> which case the func or frange function query should be fine.
> 
> Upayavira
> 
> On Fri, Oct 9, 2015, at 05:55 PM, Aman Tandon wrote:
>> Thanks Mikhail the suggestion. I will try that on monday will let you
>> know.
>> 
>> *@*Walter This was just an random requirement to find those fields which
>> are not same and then reindex only those. I can full index but I was
>> wondering if there might some function or something.
>> 
>> With Regards
>> Aman Tandon
>> 
>> On Fri, Oct 9, 2015 at 9:05 PM, Mikhail Khludnev
>> >> wrote:
>> 
>>> Aman,
>>> 
>>> You can invoke Terms Component for the filed M, let it returns terms:
>>> {a,c,d,f}
>>> then you invoke it for field T let it return {b,c,f,e},
>>> then you intersect both lists (it's quite romantic if they are kept
>>> ordered), you've got {c,f}
>>> and then you applies filter:
>>> fq=-((+M:c +T:c) (+M:f +T:f))
>>> etc
>>> 
>>> 
>>> On Thu, Oct 8, 2015 at 8:29 AM, Aman Tandon 
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Is there a way in solr to remove all those documents from the search
>>>> results in which two of the fields, *mapping* and  *title* is the exactly
>>>> same.
>>>> 
>>>> With Regards
>>>> Aman Tandon
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Principal Engineer,
>>> Grid Dynamics
>>> 
>>> <http://www.griddynamics.com>
>>> 
>>> 



Re: How to show some documents ahead of others - requirements

2015-10-10 Thread Walter Underwood
By far the easiest solution is to do two queries from the front end.
One requesting three paid results, and one requesting nine unpaid results.
If all the results are in one collection, use “fq” to select paid/unpaid.

That is going to be fast and there is zero doubt that it will do the right 
thing. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 10, 2015, at 9:31 AM, Erick Erickson  wrote:
> 
> Would result grouping work here? If the group key was "paid", then
> you'd get two groups back, "paid" an "unpaid". Within each group you'd
> have results ordered by rank. This would work for a page or two, but
> eventually you'd be in a spot where you'd have to over sample, i.e.
> return pages*X in each group to be able to page very deeply.
> 
> Or you could just fire two queries and have the app assemble the final list.
> 
> Best,
> Erick
> 
> On Sat, Oct 10, 2015 at 8:13 AM, Upayavira  wrote:
>> I've seen a similar requirement to this recently.
>> 
>> Basically, a sorting requirement that is close to impossible to
>> implement as a scoring/boosting formula, because the *position* of the
>> result features in the score, and that's not something I believe can be
>> done right now.
>> 
>> The way we solved the issue in the similar case I referred to above was
>> by using a RerankQuery. That query class has a getTopDocsCollector()
>> function, which you can override, providing your own Collector.
>> 
>> If you then refer to your query(actually your query parser) with the
>> rerank query param in Solr: rq={!myRerankQuery} then it will trigger
>> your new collector, which will be given its topDocs() method is called,
>> will call topDocs on its parent query, get a list of documents, then
>> order them in some way such as you require, and return them in a
>> non-score order.
>> 
>> Not sure I've made that very clear, but hope it helps a little.
>> 
>> Upayavira
>> 
>> On Sat, Oct 10, 2015, at 03:13 PM, liviuchrist...@yahoo.com.INVALID
>> wrote:
>>> Hi Upayavira & Walter & everyone else
>>> 
>>> About the requirements:1. I need to return no more than 3 paid results on
>>> a page of 12 results2. Paid results should be sorted like this: let's say
>>> a user is searching for: "chocolate almonds cake"Now, lets say that 2000
>>> results match the query and there are about 10 of these that are "paid
>>> results".I need to list the first 3 (1-2-3) of the paid results (in their
>>> ranking decreasing order) on the first page (maybe by improving the
>>> ranking of the 20 paid results over the non-paid ones and listing the
>>> first 3 of them.) and then listing 9 non-paid results on the page in
>>> their ranking decreasing order.
>>> Then, on the second page, I want to list first the next 3 paid results
>>> (4-5-6) and so on.
>>> 
>>> Kind regards,Christian
>>> Christian Fotache Tel: 0728.297.207
>>> 
>>>  From: Upayavira 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, October 8, 2015 7:03 PM
>>> Subject: Re: How to show some documents ahead of others
>>> 
>>> Hence the suggestion to group by the paid field - would give you two
>>> lists of the number you ask for.
>>> 
>>> What I'm trying to say is that the QueryElevationComponent might do it,
>>> but it is also relatively clunky, so a pure search solution might do it.
>>> 
>>> However, the thing we lack right now is a full take on the requirements,
>>> e.g. how should paid results be sorted, how many paid results do you
>>> show, etc, etc. Without these details we're all guessing.
>>> 
>>> Upayavira
>>> 
>>> 
>>> On Thu, Oct 8, 2015, at 04:45 PM, Walter Underwood wrote:
>>>> Sorting all paid above all unpaid will give bad results when there are
>>>> many matches. It will show 1000 paid items, include all the barely
>>>> relevant ones, before it shows the first highly relevant unpaid recipe.
>>>> What if that was the only correct result?
>>>> 
>>>> Two approaches that work:
>>>> 
>>>> 1. Boost paid items using the “boost” parameter in edismax. Adjust it to
>>>> be a tiebreaker between documents with similar score.
>>>> 
>>>> 2. Show two lists, one with the five most relevant paid, the next with
>>>> the five most relevant unpaid.
>>>> 
>>>>

Re: catchall fields or multiple fields

2015-10-12 Thread Walter Underwood
Why get rid of idf? Most often, idf is a big help in relevance.

I’ve used different weights for different parts of the document, like weighting 
the title 8X the body.

I’ve used different weights for different analysis chains. If we have three 
fields, one lowercased, one stemmed, and one a phonetic representation, then 
you can weight the lower case higher than the stemmed field, and stemmed higher 
than phonetic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 12, 2015, at 6:12 AM, Ahmet Arslan  wrote:
> 
> Hi,
> 
> Catch-all field: No need to worry about how to aggregate scores coming from 
> different fields.
> But you cannot utilize different analysers for different fields.
> 
> Multiple-fields: You can play with edismax's parameters on-the-fly, without 
> having to re-index.
> It is flexible that you can include/exclude fields from search.
> 
> Ahmet
> 
> 
> 
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit 
>  wrote:
> Hello,
> 
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
> 
> Best regards,
> Elisabeth



Re: LIX readability index calculation by solr

2015-10-21 Thread Walter Underwood
Can you reload all the content?

If so, I would calculate this in an update request processor and put the result 
in its own field.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 21, 2015, at 2:53 AM, Roland Szűcs  wrote:
> 
> Thank Toke your quick response. All your suggestions seem to be very good 
> idea. I found the capital letters also strange because of the names, places 
> so I will skip this part as I do not need an absolute measure just a ranked 
> order among my documents,
> 
> cheers,
> Roland
> 
> 
> 
> 2015. okt. 21. dátummal, 11:25 időpontban Toke Eskildsen 
>  írta:
> 
>> Roland Szűcs  wrote:
>>> My use case is that I have to calculate the LIX readability index for my
>>> documents.
>> [...]
>>> *B* = Number of periods (defined by period, colon or capital first letter)
>> [...]
>>> Does anybody have idea how to get the number of "periods"?
>> 
>> As the positions does not matter, you could make a copyField containing only 
>> punctuation. And maybe extended with a replace filter so that you have dot, 
>> comma, color, bang, question ect. instead of .,:!?
>> 
>> The capital first letter seems a bit strange to me - what about names? But 
>> anyway, you could do it with a PatternReplaceCharFilter, matching on 
>> something like 
>> ([^.,:!?]\p{Space}*\p{Upper})|(^\p{Upper})
>> and replacing with 'capital' (the regexp above probably fails - it was just 
>> from memory).
>> 
>> - Toke Eskildsen



Re: [newbie] Configuration for SolrCloud + DataImportHandler

2015-10-21 Thread Walter Underwood
Does the collection reload do a rolling reload of each node or does it do them 
all at once? We were planning on using the core reload on each system, one at a 
time. That would make sure the collection stays available.

I read the documentation, it didn’t say anything about that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 21, 2015, at 8:36 AM, Erick Erickson  wrote:
> 
> Please be very careful using the core admin UI for anything related to
> SolrCloud. In fact, I try to avoid using it at all.
> 
> The reason is that it is very low-level, and it is very easy to use it
> incorrectly. For instance, reloading a core in a multi-replica setup
> (doesnt matter whether it's several shards or just a single shard with
> multiple replicas) will reload _only_ that core, leaving the other
> replicas in your collection with the old configs.
> 
> Always use the collections API if at all possible, see:
> https://cwiki.apache.org/confluence/display/solr/Collections+API
> 
> Best,
> Erick
> 
> On Wed, Oct 21, 2015 at 1:02 AM, Hangu Choi  wrote:
>> Mikhail,
>> I solved the problem, I putfile to wrong path. /synonyms.txt  should be
>> /configs/gettingstarted/synonyms.txt .
>> 
>> 
>> Regards,
>> Hangu
>> 
>> On Wed, Oct 21, 2015 at 4:17 PM, Hangu Choi  wrote:
>> 
>>> Mikhail,
>>> 
>>> I didn't understatnd that's what I need to do. thank you.
>>> 
>>> but at the first moment, I am not doing well..
>>> I am testing to change configuration in solrcloud, through this command
>>> 
>>> ./zkcli.sh -zkhost localhost:9983 -cmd putfile /synonyms.txt
>>> /usr/local/solr-5.3.1-test/server/scripts/cloud-scripts/synonyms.txt
>>> and no error message was occured.
>>> 
>>> and then I reloaded solr at localhost:8983 coreAdmin.
>>> then I checked synonyms.txt file at localhost:8983/solr/#/~cloud?view=tree
>>> but nothing happend. what's wrong?
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> Hangu
>>> 
>>> On Tue, Oct 20, 2015 at 9:18 PM, Mikhail Khludnev <
>>> mkhlud...@griddynamics.com> wrote:
>>> 
>>>> did you try something like
>>>> $> zkcli.sh -zkhost localhost:2181 -cmd putfile /solr.xml
>>>> /path/to/solr.xml
>>>> ?
>>>> 
>>>> On Mon, Oct 19, 2015 at 11:15 PM, hangu choi  wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am trying to start SolrCloud with embedded ZooKeeper.
>>>>> 
>>>>> I know how to config solrconfig.xml and schema.xml, and other things for
>>>>> data import handler.
>>>>> but when I trying to config it with solrCloud, I don't know where to
>>>> start.
>>>>> 
>>>>> I know there is no conf directory in SolrCloud because conf directory
>>>> are
>>>>> stored in ZooKeeper.
>>>>> Then, how can I config that? I read this (
>>>>> 
>>>>> 
>>>> https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files
>>>>> )
>>>>> but I failed to understand.
>>>>> 
>>>>> I need to config solrconfig.xml and schema.xml for my custom schema.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Hangu
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Principal Engineer,
>>>> Grid Dynamics
>>>> 
>>>> <http://www.griddynamics.com>
>>>> 
>>>> 
>>> 
>>> 



Re: Best strategy for indexing multiple tables with multiple fields

2015-10-26 Thread Walter Underwood
Most of the time, the best approach is to denormalize everything into one big 
virtual table. Think about a making a view, where each row is one document in 
Solr. That row needs everything that will be searched and everything that will 
be displayed, but nothing else.

I’ve heard of installations with tens of thousands of fields. A thousand fields 
might be cumbersome, but it won’t break Solr.

If the tables contain different kinds of things, you might have different 
collections (one per document), or one collection with a “type” field for each 
kind of document. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 26, 2015, at 4:08 PM, Daniel Valdivia  wrote:
> 
> Hi, I’m new to the solr world, I’m in need of some experienced advice as I 
> see I can do a lot of cool stuff with Solr, but I’m not sure which path to 
> take so I don’t shoot myself in the foot with all this power :P
> 
> I have several tables (225) in my application, which I’d like to add into a 
> single index (multiple type of documents in the same index with unique id) 
> however, each table has a different number of columns, from 5 to 30 columns, 
> do you recomend indexing each column separately or joining all columns into a 
> single “big document”?
> 
> I’m trying to provide my users with a simple experience where they type their 
> search query in a simple search box and I list all the possible documents 
> across different tables that match their query, not sure if that strategy is 
> the best, or perhaps a core per table?
> 
> So far these are my considered strategies:
> 
> unique_id , table , megafield: All of the columns in the record get mixed 
> into a single megafield and indexes (cons: no faceting?)
> a core per table: Each table gets a core, all the fields get indexed (except 
> numbers and foreign keys), I’m not sure if having 200 cores will play nice 
> with Solr
> Single core, all fields get indexed ( possible 1,000’s of columns), this 
> sounds expensive and not so efficient to me
> 
> My application has around 2M records
> 
> Thanks in advance for any advise.
> 
> Cheers



Re: restore quorum after majority of zk nodes down

2015-10-29 Thread Walter Underwood
You can't. Zookeeper needs a majority. One node is not a majority of a three 
node ensemble.

There is no way to split a Solr Cloud cluster across two datacenters and have 
high availability. You can do that with three datacenters.

You can probably bring up a new Zookeeper ensemble and configure the Solr 
cluster to talk to it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 29, 2015, at 10:08 AM, Matteo Grolla  wrote:
> 
> I'm designing a solr cloud installation where nodes from a single cluster
> are distributed on 2 datacenters which are close and very well connected.
> let's say that zk nodes zk1, zk2 are on DC1 and zk2 is on DC2 and let's say
> that DC1 goes down and the cluster is left with zk3.
> how can I restore a zk quorum from this situation?
> 
> thanks



Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
Is there some reason that you don’t want to use aspell with a custom 
dictionary? Lucene and Solr are pretty weak compared to purpose-built spelling 
checkers.

http://aspell.net/ <http://aspell.net/>

Also, consider the Peter Norvig spell corrector approach. With a fixed list, it 
is blazing fast. In only 21 lines of Python.

http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 30, 2015, at 11:37 AM, Robert Oschler  wrote:
> 
> Hello everyone,
> 
> I have a gigantic list of industry terms that I want to import into a
> Solr/Lucene instance running on an AWS box.  What is the fastest way to
> import the list into my Solr/Lucene instance?  I have admin/sudo privileges
> on the box.
> 
> Also, is there a document that shows me how to set up my Solr/Lucene config
> file to be optimized for fast searches on single word entries using fuzzy
> search?  I intend to use this Solr/Lucene instance to do spell checking on
> the big industry word list I mentioned above.  Each data record will be a
> single word from the file.  I'll want to take a single word query and do a
> fuzzy search on the word against the index (Lichtenstein, max distance 2 as
> per Solr/Lucene's fuzzy search feature).  So what parameters will configure
> Solr/Lucene to be optimized for such a search?  Also, if a document shows
> the best index/read parameters to support single word fuzzy searching then
> that would be a big help too.  Note, the contents of the index will change
> very infrequently if that affects the optimal parameter mix.
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/



Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
Dedicated spell-checkers have better algorithms than Solr. They usually handle 
transposed characters as well as inserted, deleted, or substituted characters. 
This is an enhanced version of Levinshtein distance. It is called 
Damerau-Levenshtein and is too expensive to use in Solr search. Spell 
correctors can also use a bigger distance than 2, unlike Solr.

The Peter Norvig corrector also handles words that have been run together. The 
Norvig corrector has been translated to many different computer languages.

The Norvig corrector is an interesting approach. It is well worth reading this 
short article to learn more about spelling correction. 

http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 30, 2015, at 4:45 PM, Robert Oschler  wrote:
> 
> Hello Walter and Mikhail,
> 
> Thank you for your answers.  Do those spell checkers have the same or
> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
> distance 2)?  That's a critical requirement for my application.  I take it
> by your suggestion of these spell checker apps they can easily be extended
> with a user defined, supplementary dictionary, yes?
> 
> Thanks.
> 
> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
> 
>> Perhaps
>> FileBasedSpellChecker
>> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>> 
>> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler 
>> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I have a gigantic list of industry terms that I want to import into a
>>> Solr/Lucene instance running on an AWS box.  What is the fastest way to
>>> import the list into my Solr/Lucene instance?  I have admin/sudo
>> privileges
>>> on the box.
>>> 
>>> Also, is there a document that shows me how to set up my Solr/Lucene
>> config
>>> file to be optimized for fast searches on single word entries using fuzzy
>>> search?  I intend to use this Solr/Lucene instance to do spell checking
>> on
>>> the big industry word list I mentioned above.  Each data record will be a
>>> single word from the file.  I'll want to take a single word query and do
>> a
>>> fuzzy search on the word against the index (Lichtenstein, max distance 2
>> as
>>> per Solr/Lucene's fuzzy search feature).  So what parameters will
>> configure
>>> Solr/Lucene to be optimized for such a search?  Also, if a document shows
>>> the best index/read parameters to support single word fuzzy searching
>> then
>>> that would be a big help too.  Note, the contents of the index will
>> change
>>> very infrequently if that affects the optimal parameter mix.
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> <http://www.griddynamics.com>
>> 
>> 
> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/



Re: Fastest way to import a giant word list into Solr/Lucene?

2015-10-30 Thread Walter Underwood
Read the links I have sent.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 30, 2015, at 7:10 PM, Robert Oschler  wrote:
> 
> Thanks Walter.  Are there any open source spell checkers that implement the
> Peter Norvig or Damerau-Levenshtein algorithms?  I'm short on time so I
> have to keep the custom coding down to a minimum.
> 
> 
> On Fri, Oct 30, 2015 at 8:02 PM, Walter Underwood 
> wrote:
> 
>> Dedicated spell-checkers have better algorithms than Solr. They usually
>> handle transposed characters as well as inserted, deleted, or substituted
>> characters. This is an enhanced version of Levinshtein distance. It is
>> called Damerau-Levenshtein and is too expensive to use in Solr search.
>> Spell correctors can also use a bigger distance than 2, unlike Solr.
>> 
>> The Peter Norvig corrector also handles words that have been run together.
>> The Norvig corrector has been translated to many different computer
>> languages.
>> 
>> The Norvig corrector is an interesting approach. It is well worth reading
>> this short article to learn more about spelling correction.
>> 
>> http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html
>>> 
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Oct 30, 2015, at 4:45 PM, Robert Oschler 
>> wrote:
>>> 
>>> Hello Walter and Mikhail,
>>> 
>>> Thank you for your answers.  Do those spell checkers have the same or
>>> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
>>> distance 2)?  That's a critical requirement for my application.  I take
>> it
>>> by your suggestion of these spell checker apps they can easily be
>> extended
>>> with a user defined, supplementary dictionary, yes?
>>> 
>>> Thanks.
>>> 
>>> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
>>> mkhlud...@griddynamics.com> wrote:
>>> 
>>>> Perhaps
>>>> FileBasedSpellChecker
>>>> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>>>> 
>>>> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <
>> robert.osch...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> I have a gigantic list of industry terms that I want to import into a
>>>>> Solr/Lucene instance running on an AWS box.  What is the fastest way to
>>>>> import the list into my Solr/Lucene instance?  I have admin/sudo
>>>> privileges
>>>>> on the box.
>>>>> 
>>>>> Also, is there a document that shows me how to set up my Solr/Lucene
>>>> config
>>>>> file to be optimized for fast searches on single word entries using
>> fuzzy
>>>>> search?  I intend to use this Solr/Lucene instance to do spell checking
>>>> on
>>>>> the big industry word list I mentioned above.  Each data record will
>> be a
>>>>> single word from the file.  I'll want to take a single word query and
>> do
>>>> a
>>>>> fuzzy search on the word against the index (Lichtenstein, max distance
>> 2
>>>> as
>>>>> per Solr/Lucene's fuzzy search feature).  So what parameters will
>>>> configure
>>>>> Solr/Lucene to be optimized for such a search?  Also, if a document
>> shows
>>>>> the best index/read parameters to support single word fuzzy searching
>>>> then
>>>>> that would be a big help too.  Note, the contents of the index will
>>>> change
>>>>> very infrequently if that affects the optimal parameter mix.
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks,
>>>>> Robert Oschler
>>>>> Twitter -> http://twitter.com/roschler
>>>>> http://www.RobotsRule.com/
>>>>> http://www.Robodance.com/
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Principal Engineer,
>>>> Grid Dynamics
>>>> 
>>>> <http://www.griddynamics.com>
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>> 
>> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/



Re: Solr getting irrelevant results when use block join

2015-10-31 Thread Walter Underwood
This will probably work better without child documents and joins.

I would denormalize into actor documents and movie documents. At least, that’s 
what I did at Netflix.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 31, 2015, at 1:17 PM, Yangrui Guo  wrote:
> 
> Hi I'm using solr to search imdb database. I set the parent entity to
> include the name for each actor/actress and child entity for his movies.
> Because user might either enter a movie or a person I did not specify which
> entity solr should return. When I just search q=Kate AND Winslet without
> block join solr returned me the correct result. However, when I search
> {!parent which="type:parent"}+(Kate AND Winslet) solr seemed to have
> returned all document containing just term "Kate". I tried quoting the
> terms but the order needs to be exactly "Kate Winslet". Is there any method
> I can boost higher the score of the document which includes the terms in
> the same field?
> 
> Yangrui



Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
To back up a bit, how many documents are in this 90GB index? You might not need 
to shard at all.

Why are you sending a query with a trailing wildcard? Are you matching the 
prefix of words, for query completion? If so, look at the suggester, which is 
designed to solve exactly that. Or you can use the EdgeNgramFilter to index 
prefixes. That will make your index larger, but prefix searches will be very 
fast.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 2, 2015, at 5:17 AM, Toke Eskildsen  wrote:
> 
> On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
> 
>> The query q=network se* is quick enough in our system too. It takes
>> around 3-4 seconds for around 8 million records.
>> 
>> The problem is with the same query as phrase. q="network se*".
> 
> I misunderstood your query then. I tried replicating it with
> q="der se*"
> 
> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
> 22&wt=json&indent=true&facet=false&group=true&group.field=domain
> 
> gets expanded to
> 
> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
> svane* | description:\"kan svane\")) ())/no_coord"
> 
> The result was 1,043,258,271 hits in 15,211 ms
> 
> 
> Interestingly enough, a search for 
> q="kan svane*"
> resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
> billion+ documents. On that note,
> q=se*
> resulted in -951812427 hits in 194,276 ms.
> 
> Now this is interesting. The negative number seems to be caused by
> grouping, but I finally got the response time up in the minutes. Still
> no memory problems though. Hits without grouping were 3,343,154,869.
> 
> For comparison,
> q=http
> resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
> count was 7,062,516,538. Twice the hits of 'se*' in half the time.
> 
>> I changed my SolrCloud setup from 12 shard to 8 shard and given each
>> shard 30 GB of RAM on the same machine with same index size
>> (re-indexed) but could not see the significant improvement for the
>> query given.
> 
> Strange. I would have expected the extra free memory for disk space to
> help performance.
> 
>> Also can you please share your experiences with respect to RAM, GC,
>> solr cache setup etc as it seems by your comment that the SolrCloud
>> environment you have is kind of similar to the one I work on?
>> 
> There is a short write up at
> https://sbdevel.wordpress.com/net-archive-search/
> 
> - Toke Eskildsen, State and University Library, Denmark
> 
> 
> 



Re: Very high memory and CPU utilization.

2015-11-02 Thread Walter Underwood
One rule of thumb for Solr is to shard after you reach 100 million documents. 
With large documents, you might want to shard sooner.

We are running an unsharded index of 7 million documents (55GB) without 
problems.

The EdgeNgramFilter generates a set of prefix terms for each term in the 
document. For the term “secondary”, it would generate:

s
se
sec
seco
secon
second
seconda
secondar
secondary

Obviously, this makes the index larger. But it makes prefix match a simple 
lookup, without needing wildcards.

Again, we can help you more if you describe what you are trying to do.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 2, 2015, at 9:39 PM, Modassar Ather  wrote:
> 
> Thanks Walter for your response,
> 
> It is around 90GB of index (around 8 million documents) on one shard and
> there are 12 such shards. As per my understanding the sharding is required
> for this case. Please help me understand if it is not required.
> 
> We have requirements where we need full wild card support to be provided to
> our users.
> I will try using EdgeNgramFilter. Can you please help me understand if
> EdgeNgramFilter can be a replacement of wild cards?
> There are situations where the words may be extended with some special
> characters e.g. For se* there can be a match secondry-school which also
> needs to be considered.
> 
> Regards,
> Modassar
> 
> 
> 
> On Mon, Nov 2, 2015 at 10:17 PM, Walter Underwood 
> wrote:
> 
>> To back up a bit, how many documents are in this 90GB index? You might not
>> need to shard at all.
>> 
>> Why are you sending a query with a trailing wildcard? Are you matching the
>> prefix of words, for query completion? If so, look at the suggester, which
>> is designed to solve exactly that. Or you can use the EdgeNgramFilter to
>> index prefixes. That will make your index larger, but prefix searches will
>> be very fast.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 2, 2015, at 5:17 AM, Toke Eskildsen 
>> wrote:
>>> 
>>> On Mon, 2015-11-02 at 17:27 +0530, Modassar Ather wrote:
>>> 
>>>> The query q=network se* is quick enough in our system too. It takes
>>>> around 3-4 seconds for around 8 million records.
>>>> 
>>>> The problem is with the same query as phrase. q="network se*".
>>> 
>>> I misunderstood your query then. I tried replicating it with
>>> q="der se*"
>>> 
>>> http://rosalind:52300/solr/collection1/select?q=%22der+se*%
>>> 22&wt=json&indent=true&facet=false&group=true&group.field=domain
>>> 
>>> gets expanded to
>>> 
>>> parsedquery": "(+DisjunctionMaxQuery((content_text:\"kan svane\" |
>>> author:kan svane* | text:\"kan svane\" | title:\"kan svane\" | url:kan
>>> svane* | description:\"kan svane\")) ())/no_coord"
>>> 
>>> The result was 1,043,258,271 hits in 15,211 ms
>>> 
>>> 
>>> Interestingly enough, a search for
>>> q="kan svane*"
>>> resulted in 711 hits in 12,470 ms. Maybe because 'kan' alone matches 1
>>> billion+ documents. On that note,
>>> q=se*
>>> resulted in -951812427 hits in 194,276 ms.
>>> 
>>> Now this is interesting. The negative number seems to be caused by
>>> grouping, but I finally got the response time up in the minutes. Still
>>> no memory problems though. Hits without grouping were 3,343,154,869.
>>> 
>>> For comparison,
>>> q=http
>>> resulted in -1527418054 hits in 87,464 ms. Without grouping the hit
>>> count was 7,062,516,538. Twice the hits of 'se*' in half the time.
>>> 
>>>> I changed my SolrCloud setup from 12 shard to 8 shard and given each
>>>> shard 30 GB of RAM on the same machine with same index size
>>>> (re-indexed) but could not see the significant improvement for the
>>>> query given.
>>> 
>>> Strange. I would have expected the extra free memory for disk space to
>>> help performance.
>>> 
>>>> Also can you please share your experiences with respect to RAM, GC,
>>>> solr cache setup etc as it seems by your comment that the SolrCloud
>>>> environment you have is kind of similar to the one I work on?
>>>> 
>>> There is a short write up at
>>> https://sbdevel.wordpress.com/net-archive-search/
>>> 
>>> - Toke Eskildsen, State and University Library, Denmark
>>> 
>>> 
>>> 
>> 
>> 



Re: Boosting a document score when advertised! Please help!

2015-11-05 Thread Walter Underwood
The elevation component will be a ton of manual work. Instead, use edismax and 
the boost parameter.

Add a field that is true for paid documents, then boost for paid:true. It might 
be easier to use a boost query (bq) to do this. The extra boost will be a 
tiebreaker for documents that would have the same score.

Use this in your solrconfig.xml:

paid:true 

You can add weight to that if it isn’t boosting the paid content enough. Like 
this:

paid:true^8 

It is slightly better to do this with the boost parameter and a function query, 
because that bypasses idf, but I think this approach is nice and clear.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 5, 2015, at 3:33 AM, Alessandro Benedetti  
> wrote:
> 
> Hi Christian,
> there are several ways :
> 
> 1) Elevation query component - it should be your winner :
> https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component
> 
> 2) Play with boosting according to your requirements
> 
> Cheers
> 
> On 5 November 2015 at 10:52,  wrote:
> 
>> Hi everyone,I'm building a food recipe search engine based on solr.
>> 
>> I need to boost documents score for the recipes that their authors paid
>> for in order to have them returned first when somebody searches for
>> "chocolate cake with hazelnuts". So those recipes that match the query
>> terms and their authors paid to be listed first need to be returned first,
>> ahead of the unpaid ones that match the query.
>> 
>> How do I do that in Solr?
>> PLEASE HELP!
>> Regards,
>> Christian
>> 
>> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England



Re: Is it impossible to update an index that is undergoing an optimize?

2015-11-06 Thread Walter Underwood
It is pretty handy, though. Great for expunging docs that are marked deleted or 
are expired.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2015, at 5:31 PM, Alexandre Rafalovitch  wrote:
> 
> Elasticsearch removed deleteByQuery from the core all together.
> Definitely an outlier :-)
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
> 
> 
> On 6 November 2015 at 20:18, Yonik Seeley  wrote:
>> On Wed, Nov 4, 2015 at 3:36 PM, Shawn Heisey  wrote:
>>> The specific index update that fails during the optimize is the SolrJ
>>> deleteByQuery call.
>> 
>> deleteByQuery may be the outlier here... we have to jump through extra
>> hoops internally because we don't know which documents it will affect.
>> Normal adds and deletes should proceed in parallel though.
>> 
>> -Yonik



Re: Best way to track cumulative GC pauses in Solr

2015-11-13 Thread Walter Underwood
Also, what GC settings are you using? We may be able to make some suggestions.

Cumulative GC pauses aren’t very interesting to me. I’m more interested in the 
longest ones, 90th percentile, 95th, etc.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 13, 2015, at 8:32 AM, Shawn Heisey  wrote:
> 
> On 11/13/2015 8:00 AM, Tom Evans wrote:
>> We have some issues with our Solr servers spending too much time
>> paused doing GC. From turning on gc debug, and extracting numbers from
>> the GC log, we're getting an idea of just how much of a problem.
> 
> Try loading your gc log into gcviewer.
> 
> https://github.com/chewiebug/GCViewer/releases
> 
> Here's a screenshot of this in action with a gc log from Solr loaded:
> 
> https://www.dropbox.com/s/orwt0fcmii5691l/solr-gc-gcviewer-1.35-snapshot.png?dl=0
> 
> This screenshot is from a snapshot build including a feature request
> that I made:
> 
> https://github.com/chewiebug/GCViewer/issues/139
> 
> If you use the 1.34.1 version, you will not see some of the numbers
> shown in my screenshot, but the info you asked for, accumulated GC
> pauses, IS included in that version.
> 
> Thanks,
> Shawn
> 



Re: Solr logging in local time

2015-11-16 Thread Walter Underwood
I’m sure it is possible, but think twice before logging in local time. Do you 
really want one day with 23 hours and one day with 25 hours each year?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 16, 2015, at 8:04 AM, tedsolr  wrote:
> 
> Is it possible to define a timezone for Solr so that logging occurs in local
> time? My logs appear to be in UTC. Due to daylight savings, I don't think
> defining a GMT offset in the log4j.properties files will work.
> 
> thanks! Ted
> v. 5.2.1
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-logging-in-local-time-tp4240369.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Boost non stemmed keywords (KStem filter)

2015-11-19 Thread Walter Underwood
That is the approach I’ve been using for years. Simple and effective.

It probably makes the index bigger. Make sure that only one of the fields is 
stored, because the stored text will be exactly the same in both.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 1:47 PM, Ahmet Arslan  wrote:
> 
> Hi,
> 
> I wonder about using two fields (text_stem and text_no_stem) and applying 
> query time boost
> text_stem^0.3 text_no_stem^0.6
> 
> What is the advantage of keyword repeat/paylad approach compared with this 
> one?
> 
> Ahmet
> 
> 
> On Thursday, November 19, 2015 10:24 PM, Markus Jelsma 
>  wrote:
> Hello Jan - i have no code i can show but we are using it to power our search 
> servers. You are correct, you need to deal with payloads at query time as 
> well. This means you need a custom similarity but also customize your query 
> parser to rewrite queries to payload supported types. This is also not very 
> hard, some ancient examples can still be found on the web. But you also need 
> to copy over existing TokenFilters to emit payloads whenever you want. 
> Overriding TokenFilters is usually impossible due to crazy private members (i 
> still cannot figure out why so many parts are private..)
> 
> It can be very powerful, especially if you do not use payloads to contain 
> just a score. But instead to carry a WORD_TYPE, such as stemmed, unstemmed 
> but also stopwords, acronyms, compound and subwords, headings or normal text 
> but also NER types (which we don't have yet). For this to work you just need 
> to treat the payload as a bitset for different types so you can have really 
> tuneable scoring at query time via your similarity. Unfortunately, payloads 
> can only carry a relative small amount of bits :)
> 
> M.
> 
> -Original message-
>> From:Jan Høydahl 
>> Sent: Thursday 19th November 2015 14:30
>> To: solr-user@lucene.apache.org
>> Subject: Re: Boost non stemmed keywords (KStem filter)
>> 
>> Do you have a concept code for this? Don’t you also have to hack your query 
>> parser, e.g. dismax, to use other Query objects supporting payloads?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 18. nov. 2015 kl. 22.24 skrev Markus Jelsma :
>>> 
>>> Hi - easiest approach is to use KeywordRepeatFilter and 
>>> RemoveDuplicatesTokenFilter. This creates a slightly higher IDF for 
>>> unstemmed words which might be just enough in your case. We found it not to 
>>> be enough, so we also attach payloads to signify stemmed words amongst 
>>> others. This allows you to decrease score for stemmed words at query time 
>>> via your similarity impl.
>>> 
>>> M.
>>> 
>>> 
>>> 
>>> -Original message-
>>>> From:bbarani 
>>>> Sent: Wednesday 18th November 2015 22:07
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Boost non stemmed keywords (KStem filter)
>>>> 
>>>> Hi,
>>>> 
>>>> I am using KStem factory for stemming. This stemmer converts 'france to
>>>> french', 'chinese to china' etc.. I am good with this stemming but I am
>>>> trying to boost the results that contain the original term compared to the
>>>> stemmed terms. Is this possible?
>>>> 
>>>> Thanks,
>>>> Learner
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: 
>>>> http://lucene.472066.n3.nabble.com/Boost-non-stemmed-keywords-KStem-filter-tp4240880.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>> 
>> 
>> 



Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood
With one field in qf for a single-term query, Solr is fetching one posting 
list. With 1500 fields, it is fetching 1500 posting lists. It could easily be 
1500 times slower.

It might be even slower than that, because we can’t guarantee that: a) every 
algorithm in Solr is linear, b) that all those lists will fit in memory.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 3:46 PM, Steven White  wrote:
> 
> Hi everyone
> 
> What is considered too many fields for qf and fq?  On average I will have
> 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
> (I have to check with the design) for qf, if I cut it down to 1 field, will
> I see noticeable performance improvement?  It will take a lot of effort to
> test this which is why I'm asking first.
> 
> As is, I'm seeing 2-5 sec response time for searches on an index of 1
> million records with total index size (on disk) of 4 GB.  I gave Solr 2 GB
> of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1 GB.
> 
> Thanks in advanced
> 
> Steve



Re: Number of fields in qf & fq

2015-11-19 Thread Walter Underwood
The implementation for fq has changed from 4.x to 5.x, so I’ll let someone else 
answer that in detail.

In 4.x, the result of each filter query can be cached. After that, they are 
quite fast.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 19, 2015, at 3:59 PM, Steven White  wrote:
> 
> Thanks Walter.  I see your point.  Does this apply to fq as will?
> 
> Also, how does one go about debugging performance issues in Solr to find
> out where time is mostly spent?
> 
> Steve
> 
> On Thu, Nov 19, 2015 at 6:54 PM, Walter Underwood 
> wrote:
> 
>> With one field in qf for a single-term query, Solr is fetching one posting
>> list. With 1500 fields, it is fetching 1500 posting lists. It could easily
>> be 1500 times slower.
>> 
>> It might be even slower than that, because we can’t guarantee that: a)
>> every algorithm in Solr is linear, b) that all those lists will fit in
>> memory.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 19, 2015, at 3:46 PM, Steven White  wrote:
>>> 
>>> Hi everyone
>>> 
>>> What is considered too many fields for qf and fq?  On average I will have
>>> 1500 fields in qf and 100 in fq (all of which are OR'ed).  Assuming I can
>>> (I have to check with the design) for qf, if I cut it down to 1 field,
>> will
>>> I see noticeable performance improvement?  It will take a lot of effort
>> to
>>> test this which is why I'm asking first.
>>> 
>>> As is, I'm seeing 2-5 sec response time for searches on an index of 1
>>> million records with total index size (on disk) of 4 GB.  I gave Solr 2
>> GB
>>> of RAM (also tested at 4 GB) in both cases Solr didn't use more then 1
>> GB.
>>> 
>>> Thanks in advanced
>>> 
>>> Steve
>> 
>> 



Re: Setting up Solr on multiple machines

2015-11-29 Thread Walter Underwood
Why would that link answer the question?

Each Solr connects to one Zookeeper node. If that node goes down, Zookeeper is 
still available, but the node will need to connect to a new  node.

Specifying only one zk node is a single point of failure. If that node goes 
down, Solr cannot continue operating. 

Specifying a list of all the zk nodes is robust. If one goes down, it tries 
another.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 29, 2015, at 12:19 PM, Don Bosco Durai  wrote:
> 
> This should answer your question: 
> https://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html#sc_designGoals
> 
> On 11/29/15, 12:04 PM, "Salman Ansari"  wrote:
> 
>> my point is that what is the exact difference between the whole list and
>> one zookeeper? Moreover, I think this issue is related to Windows command
>> as mentioned here
>> http://stackoverflow.com/questions/28837827/solr-5-0-unable-to-start-solr-with-zookeeper-ensemble
>> 
>> 
>> On Sun, Nov 29, 2015 at 10:55 PM, Don Bosco Durai  wrote:
>> 
>>> It is highly recommended to list all, but for testing, you might be able
>>> to get away giving only one.
>>> 
>>> If the list doesn’t work, then you might even want to look into zookeeper
>>> and see whether they are setup properly.
>>> 
>>> Bosco
>>> 
>>> On 11/29/15, 11:51 AM, "Salman Ansari"  wrote:
>>> 
>>>> but the point is: do I really need to list all the zookeepers in the
>>>> ensemble when starting solr or I can just specify one of them?
>>>> 
>>>> On Sun, Nov 29, 2015 at 10:45 PM, Don Bosco Durai 
>>> wrote:
>>>> 
>>>>> You might want to check the logs for why solr is not starting up.
>>>>> 
>>>>> 
>>>>> Bosco
>>>>> 
>>>>> 
>>>>> On 11/29/15, 11:30 AM, "Salman Ansari"  wrote:
>>>>> 
>>>>>> Thanks for your reply.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Actually I am following the official guide to start solr using (on
>>> Windows
>>>>>> machines)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> bin/solr start -e cloud -z zk1:2181,zk2:2182,zk3:2183
>>>>>> 
>>>>>> (it is listed here
>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
>>>>>> )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> However, I am facing 2 issues
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 1) If I specify the full list of ensemble (even with quotes around -z
>>>>>> "zk1:2181,zk2:2182,zk3:2183") it does not start Solr on port 8983
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 2) Then I tried the workaround, which is specifying "localhost" on each
>>>>>> Solr server to consult its local Zookeeper instance that is part of the
>>>>>> ensemble, which worked as follows
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> bin/solr start -e cloud -z localhost:2181(on each machine that has
>>>>>> zookeeper as well)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I followed the wizard (on each machine) to create 2 shards on 2 ports
>>> and
>>>>> 2
>>>>>> replicas. For the first machine I created "test" collection, but for
>>> the
>>>>>> second one I just reused the same collection. Now, Solr works on both
>>>>>> machines but the issue is that when I see Solr admin page, it shows all
>>>>> the
>>>>>> shards and replicas of the collection on ONE MACHINE.
>>>>>> 
>>>>>> 
>>>>>> Any ideas why I am facing these issues?
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Salman
>>>>>> 
>>>>>> On Sun, Nov 29, 2015 at 10:07 PM, Erick Erickson <
>>> erickerick...@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> 1> 

Re: Setting up Solr on multiple machines

2015-11-29 Thread Walter Underwood
Connecting to one Zookeeper node is fine. Until that node fails. Then what does 
Solr do for cluster information?

The entire point of Zookeeper is to share that information in a reliable, 
fault-tolerant way. Solr can talk to any Zookeeper node and get the same 
information.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 29, 2015, at 2:36 PM, Salman Ansari  wrote:
> 
> Correct me if I am wrong but my understanding is that even connecting to
> one zookeeper should be enough as internally that zookeeper will sync Solr
> server info to other zookeepers in the ensemble (as long as that zookeeper
> belongs to an ensemble). Having said that, if that particular zookeeper
> goes down, another one from the ensemble should be able to serve the Solr
> instance.
> 
> What made me even more leaning towards this understanding is that I tried
> connecting 2 different solr instances to 2 different zookeepers (but both
> belong to the same ensemble) and I realized both Solr servers can see each
> other. I guess that does explain somehow that zookeepers are sharing solr
> servers information among the ensemble.
> 
> Regards,
> Salman
> 
> On Mon, Nov 30, 2015 at 1:07 AM, Walter Underwood 
> wrote:
> 
>> Why would that link answer the question?
>> 
>> Each Solr connects to one Zookeeper node. If that node goes down,
>> Zookeeper is still available, but the node will need to connect to a new
>> node.
>> 
>> Specifying only one zk node is a single point of failure. If that node
>> goes down, Solr cannot continue operating.
>> 
>> Specifying a list of all the zk nodes is robust. If one goes down, it
>> tries another.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 29, 2015, at 12:19 PM, Don Bosco Durai  wrote:
>>> 
>>> This should answer your question:
>> https://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html#sc_designGoals
>>> 
>>> On 11/29/15, 12:04 PM, "Salman Ansari"  wrote:
>>> 
>>>> my point is that what is the exact difference between the whole list and
>>>> one zookeeper? Moreover, I think this issue is related to Windows
>> command
>>>> as mentioned here
>>>> 
>> http://stackoverflow.com/questions/28837827/solr-5-0-unable-to-start-solr-with-zookeeper-ensemble
>>>> 
>>>> 
>>>> On Sun, Nov 29, 2015 at 10:55 PM, Don Bosco Durai 
>> wrote:
>>>> 
>>>>> It is highly recommended to list all, but for testing, you might be
>> able
>>>>> to get away giving only one.
>>>>> 
>>>>> If the list doesn’t work, then you might even want to look into
>> zookeeper
>>>>> and see whether they are setup properly.
>>>>> 
>>>>> Bosco
>>>>> 
>>>>> On 11/29/15, 11:51 AM, "Salman Ansari" 
>> wrote:
>>>>> 
>>>>>> but the point is: do I really need to list all the zookeepers in the
>>>>>> ensemble when starting solr or I can just specify one of them?
>>>>>> 
>>>>>> On Sun, Nov 29, 2015 at 10:45 PM, Don Bosco Durai 
>>>>> wrote:
>>>>>> 
>>>>>>> You might want to check the logs for why solr is not starting up.
>>>>>>> 
>>>>>>> 
>>>>>>> Bosco
>>>>>>> 
>>>>>>> 
>>>>>>> On 11/29/15, 11:30 AM, "Salman Ansari" 
>> wrote:
>>>>>>> 
>>>>>>>> Thanks for your reply.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Actually I am following the official guide to start solr using (on
>>>>> Windows
>>>>>>>> machines)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> bin/solr start -e cloud -z zk1:2181,zk2:2182,zk3:2183
>>>>>>>> 
>>>>>>>> (it is listed here
>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
>>>>>>>> )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> However, I am facing 2 issues
>>>>>>

Re: fuzzy searches and EDISMAX

2015-12-08 Thread Walter Underwood
You probably want to apply the patch for SOLR-629. We have this in production 
at Chegg. I’ve been trying to get this feature added to Solr for seven years. 
Not sure why it never gets approved.

https://issues.apache.org/jira/browse/SOLR-629 
<https://issues.apache.org/jira/browse/SOLR-629>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 8, 2015, at 9:56 AM, Felley, James  wrote:
> 
> I am trying to build an edismax search handler that will allow a fuzzy 
> search, using the "query fields" property (qf).
> 
> I have two instances of SOLR 4.8.1, one of which has edismax "qf" configured 
> with no fuzzy search
> ...
> ns_name^3.0  i_topic^3.0  i_object_type^3.0
> 
> ...
> And the other with a fuzzy search for ns_name (non-stemmed name)
> ns_name~1^3.0  i_topic^3.0  i_object_type^3.0
> 
> ...
> 
> The index of both includes a record with an ns_name of 'Johnson'
> 
> I get no return in either instance with the query
> q=Johnso
> 
> I get the Johnson record returned in both instances with a query of
> q=Johnso~1
> 
> The SOLR documentation seems silent on incorporating fuzzy searches in the 
> query fields.  I have seen various posts on Google that suggest that 'qf' 
> will accept fuzzy search declarations, other posts suggest only the query 
> itself will allow fuzzy searches (as seems to be the case for me).
> 
> Any guidance will be much appreciated
> 
> Jim
> 
> Jim Felley
> OCIO
> Smithsonian Institution
> fell...@si.edu
> 
> 
> 
> 



Re: Long Running Data Import Handler - Notifications

2015-12-08 Thread Walter Underwood
Not that I know of. I wrote a script to check the status and sleep until done. 
Like this:

SOLRURL='http://solr-master.prod2.cloud.cheggnet.com:6090/solr/textbooks/dataimport'

while : ; do
echo `date` checking whether Solr indexing is finished
curl -s "${SOLRURL}" | fgrep '"status":"idle"' > /dev/null
[ $? -ne 0 ] || break
sleep 300
done

echo Solr indexing is finished

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 8, 2015, at 5:37 PM, Brian Narsi  wrote:
> 
> Is there a way to receive notifications when a Data Import Handler finishes
> up and whether it succeeded or failed. (typically runs about an hour)
> 
> Thanks



Re: Unstructured/Structured data for indexing

2015-12-09 Thread Walter Underwood
Often Solr documents are “semi-structured”. They have some structured fields 
and some free-text fields. e-mail messages are like that, with structured 
headers and an unstructured body.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 9, 2015, at 4:13 AM, Alexandre Rafalovitch  wrote:
> 
> Don't think about indexing so much, think about searching.
> 
> Say you are searching a video? What does that mean? Do you want to
> match random sequence of binary values that represent inter-frame
> change? Probably not. When you answer what you want to actually search
> (title? length? subscripts?), you will discover that structure. What
> do you want to return? A whole video, a segment, a description with a
> link?
> 
> So, you pre-process/index your data to give you the things you want to
> search for and in the form you want them to receive.
> 
> Regards,
>   Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
> 
> 
> On 9 December 2015 at 03:09, subinalex  wrote:
>> Hi,
>> 
>> I am a solr newbie,just got a quick question.
>> 
>> SOLR is designed for querying unstructured data,but then why we have to send
>> it in a structured form(json,xml) for indexing?.
>> 
>> Thanks & Regards,S
>> Subin
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Unstructured-Structured-data-for-indexing-tp4244406.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Committed before 500

2015-02-20 Thread Walter Underwood
Since you are getting these failures, the 90 second timeout is not “good 
enough”. Try increasing it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Feb 20, 2015, at 5:22 AM, NareshJakher  wrote:

> Hi Shawn,
> 
> I do not want to increase timeout as these errors are very few. Also current 
> timeout of 90 seconds is good enough.  Is there a way to find why Solr is 
> getting timed-out ( at times ), could it be that Solr is busy doing other 
> activities like re-indexing, commits etc.
> 
> Additionally I also found that some of non-leader node move to recovering or 
> recovery failed after these time out errors. I am just wondering if these are 
> related to performance issue and Solr commits needs to be controlled.
> 
> Regards,
> Naresh Jakher
> 
> From: Shawn Heisey-2 [via Lucene] 
> [mailto:ml-node+s472066n4187382...@n3.nabble.com]
> Sent: Thursday, February 19, 2015 8:12 PM
> To: Jakher, Naresh
> Subject: Re: Committed before 500
> 
> On 2/19/2015 6:30 AM, NareshJakher wrote:
> 
>> I am using Solr cloud with 3 nodes, at times following error is observed in
>> logs during delete operation. Is it a performance issue ? What can be done
>> to resolve this issue
>> 
>> "Committed before 500 {msg=Software caused connection abort: socket write
>> error,trace=org.eclipse.jetty.io.EofException"
>> 
>> I did search on old topics but couldn't find anything concrete related to
>> Solr cloud. Would appreciate any help on the issues as I am relatively new
>> to Solr.
> 
> A jetty EofException indicates that one specific thing is happening:
> 
> The TCP connection from the client was severed before Solr responded to
> the request.  Usually this happens because the client has been
> configured with an absolute timeout or an inactivity timeout, and the
> timeout was reached.
> 
> Configuring timeouts so that you can be sure clients don't get stuck is
> a reasonable idea, but any configured timeouts should be VERY long.
> You'd want to use a value like five minutes, rather than 10, 30, or 60
> seconds.
> 
> The timeouts MIGHT be in the HttpShardHandler config that Solr and
> SolrCloud use for distributed searches, and they also might be in
> operating-system-level config.
> 
> https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches
> 
> Thanks,
> Shawn
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187382.html
> To unsubscribe from Committed before 500, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4187361&code=bmFyZXNoLmpha2hlckBjYXBnZW1pbmkuY29tfDQxODczNjF8NzQ0MTczNzc0>.
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> This message contains information that may be privileged or confidential and 
> is the property of the Capgemini Group. It is intended only for the person to 
> whom it is addressed. If you are not the intended recipient, you are not 
> authorized to read, print, retain, copy, disseminate, distribute, or use this 
> message or any part thereof. If you receive this message in error, please 
> notify the sender immediately and delete all copies of this message.
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187601.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
The HTTP protocol does not set a limit on GET URL size, but individual web 
servers usually do. You should get a response code of “414 Request-URI Too 
Long” when the URL is too long.

This limit is usually configurable.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Feb 21, 2015, at 12:46 AM, steve  wrote:

> Careful with the GETs! There is a real, hard limit on the length of a GET url 
> (in the low hundreds of characters). That's why a POST is so much better for 
> complex queries; the limit is in the hundreds of MegaBytes.
> 
>> Date: Sat, 21 Feb 2015 01:42:03 -0700
>> From: osta...@gmail.com
>> To: solr-user@lucene.apache.org
>> Subject: Re: Performing DIH on predefined list of IDS
>> 
>> Yes,  you right,  I am not using a DB. 
>> SolrEntityProcessor is using a GET method,  so I will need to send
>> relatively big URL ( something like a hundreds of ids ) hope it will be
>> possible. 
>> 
>> Any way I think it is the only method to perform reindex if I want to
>> control it and be able to continue from any point in case of failure.  
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187835.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Performing DIH on predefined list of IDS

2015-02-21 Thread Walter Underwood
Am an expert? Not sure, but I worked on an enterprise search spider and search 
engine for about a decade (Ultraseek Server) and I’ve done customer-facing 
search for another 6+ years.

Let the server reject URLs it cannot handle. Great servers will return a 414, 
good servers will return a 400, broken servers will return a 500, and crapulous 
servers will hang. In nearly all cases, you’ll get a fast fail which won’t hurt 
other users of the site.

Manage your site for zero errors, so you can fix the queries that are too long.

At Chegg, we have people paste entire homework problems into the search for 
homework solutions, and, yes, we have a few queries longer than 8K. But we deal 
with it gracefully.

Never do POST for a read-only request. Never. That only guarantees that you 
cannot reproduce the problem by looking at the logs.

If your design requires extremely long GET requests, you may need to re-think 
your design.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 21, 2015, at 4:45 PM, Shawn Heisey  wrote:

> On 2/21/2015 1:46 AM, steve wrote:
>> Careful with the GETs! There is a real, hard limit on the length of a GET 
>> url (in the low hundreds of characters). That's why a POST is so much better 
>> for complex queries; the limit is in the hundreds of MegaBytes.
> 
> The limit on a GET command (including the GET itself and the protocol
> specifier (usually HTTP/1.1) is normally 8K, or 8192 bytes.  That's the
> default value in Jetty, at least.
> 
> A question for the experts:  Would it be a good idea to force a POST
> request in SolrEntityProcessor?  It may be dealing with parameters that
> have been sent via POST and may exceed the header size limit.
> 
> Thanks,
> Shawn
> 



Re: syntax for increasing java memory

2015-02-23 Thread Walter Underwood
That depends on the JVM you are using. For the Oracle JVMs, use this to get a 
list of extended options:

java -X

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Feb 23, 2015, at 8:21 AM, Kevin Laurie  wrote:

> Hi Guys,
> I am a newbie on Solr and I am just using it for dovecot sake.
> Could you help advise the correct syntax to increase java heap size using
> the  -xmx option(or advise some easy-to-read literature for configuring) ?
> Much appreciate if you could help. I just need this to sort out the problem
> with my Dovecot FTS.
> Thanks
> Kevin



Re: Basic Multilingual search capability

2015-02-23 Thread Walter Underwood
It isn’t just complicated, it can be impossible.

Do you have content in Chinese or Japanese? Those languages (and some others) 
do not separate words with spaces. You cannot even do word search without a 
language-specific, dictionary-based parser.

German is space separated, except many noun compounds are not space-separated.

Do you have Finnish content? Entire prepositional phrases turn into word 
endings.

Do you have Arabic content? That is even harder.

If all your content is in space-separated languages that are not heavily 
inflected, you can kind of do OK with a language-insensitive approach. But it 
hits the wall pretty fast.

One thing that does work pretty well is trademarked names (LaserJet, Coke, 
etc). Those are spelled the same in all languages and usually not inflected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Feb 23, 2015, at 8:00 PM, Rishi Easwaran  wrote:

> Hi Alex,
> 
> There is no specific language list.  
> For example: the documents that needs to be indexed are emails or any 
> messages for a global customer base. The messages back and forth could be in 
> any language or mix of languages.
> 
> I understand relevancy, stemming etc becomes extremely complicated with 
> multilingual support, but our first goal is to be able to tokenize and 
> provide basic search capability for any language. Ex: When the document 
> contains hello or здравствуйте, the analyzer creates tokens and provides 
> exact match search results.
> 
> Now it would be great if it had capability to tokenize email addresses 
> (ex:he...@aol.com- i think standardTokenizer already does this),  filenames 
> (здравствуйте.pdf), but maybe we can use filters to accomplish that. 
> 
> Thanks,
> Rishi.
> 
> -Original Message-
> From: Alexandre Rafalovitch 
> To: solr-user 
> Sent: Mon, Feb 23, 2015 5:49 pm
> Subject: Re: Basic Multilingual search capability
> 
> 
> Which languages are you expecting to deal with? Multilingual support
> is a complex issue. Even if you think you don't need much, it is
> usually a lot more complex than expected, especially around relevancy.
> 
> Regards,
>   Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> 
> 
> On 23 February 2015 at 16:19, Rishi Easwaran  wrote:
>> Hi All,
>> 
>> For our use case we don't really need to do a lot of manipulation of 
>> incoming 
> text during index time. At most removal of common stop words, tokenize 
> emails/ 
> filenames etc if possible. We get text documents from our end users, which 
> can 
> be in any language (sometimes combination) and we cannot determine the 
> language 
> of the incoming text. Language detection at index time is not necessary.
>> 
>> Which analyzer is recommended to achive basic multilingual search capability 
> for a use case like this.
>> I have read a bunch of posts about using a combination standardtokenizer or 
> ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking 
> for 
> ideas, suggestions, best practices.
>> 
>> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
>> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
>> https://issues.apache.org/jira/browse/SOLR-6492
>> 
>> 
>> Thanks,
>> Rishi.
>> 
> 
> 



  1   2   3   4   5   6   7   8   9   10   >