Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-01 Thread Daniel Angelov
Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud? For example, on host1 we have, coll1 shard1 replica1 and coll2 shard1 replica1, on host2 we have, coll1 shard2 replica2 and coll2 shard2 replica2. Does this mean,

Re: Performance Issue in Streaming Expressions

2017-06-01 Thread Susmit Shukla
Hi, Which version of solr are you on? Increasing memory may not be useful as streaming API does not keep stuff in memory (except may be hash joins). Increasing replicas (not sharding) and pushing the join computation on worker solr cluster with #workers > 1 would definitely make things faster.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Nutch was built for that, but it is a pain to use. I’m still sad that I couldn’t get Mike Lynch to open source Ultraseek. So easy and much more powerful than Nutch. Ignoring robots.txt is often a bad idea. You may get into a REST API or into a calendar that generates an unending number of

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Mike Drob
Isn't this exactly what Apache Nutch was built for? On Thu, Jun 1, 2017 at 6:56 PM, David Choi wrote: > In any case after digging further I have found where it checks for > robots.txt. Thanks! > > On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood >

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In any case after digging further I have found where it checks for robots.txt. Thanks! On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood wrote: > Which was exactly what I suggested. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Doug Turnbull
Scrapy is fantastic and I use it scrape search results pages for clients to take quality snapshots for relevance work Ignoring robots.txt sometimes legit comes up because a staging site might be telling google not to crawl but don't care about a developer crawling for internal purposes. Doug On

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Which was exactly what I suggested. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 1, 2017, at 3:31 PM, David Choi wrote: > > In the mean time I have found a better solution at the moment is to test on > a site that

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In the mean time I have found a better solution at the moment is to test on a site that allows users to crawl their site. On Thu, Jun 1, 2017 at 5:26 PM David Choi wrote: > I think you misunderstand the argument was about stealing content. Sorry > but I think you need to

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
I think you misunderstand the argument was about stealing content. Sorry but I think you need to read what people write before making bold statements. On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood wrote: > Let’s not get snarky right away, especially when you are wrong.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Let’s not get snarky right away, especially when you are wrong. Corporations do not generally ignore robots.txt. I worked on a commercial web spider for ten years. Occasionally, our customers did need to bypass portions of robots.txt. That was usually because of a poorly-maintained web server,

Performance Issue in Streaming Expressions

2017-06-01 Thread thiaga rajan
We are working on a proposal and feeling streaming API along with export handler will best fit for our usecases. We are already of having a structure in solr in which we are using graph queries to produce hierarchical structure. Now from the structure we need to join couple of more collections.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Oh well I guess its ok if a corporation does it but not someone wanting to learn more about the field. I actually have written a crawler before as well as the you know Inverted Index of how solr works but I just thought its architecture was better suited for scaling. On Thu, Jun 1, 2017 at 4:47

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
And I mean that in the context of stealing content from sites that explicitly declare they don't want to be crawled. Robots.txt is to be followed. > On Jun 1, 2017, at 5:31 PM, David Choi wrote: > > Hello, > > I was wondering if anyone could guide me on how to crawl

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Vivek Pathak
I can help. We can chat in some freenode chatroom in an hour or so. Let me know where you hang out. Thanks Vivek On 6/1/17 5:45 PM, Dave wrote: If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now. No one is going to

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now. No one is going to help you with this request, at least I'd hope not. > On Jun 1, 2017, at 5:31 PM, David Choi wrote: > > Hello, > > I was

Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Hello, I was wondering if anyone could guide me on how to crawl the web and ignore the robots.txt since I can not index some big sites. Or if someone could point how to get around it. I read somewhere about a protocol.plugin.check.robots but that was for nutch. The way I index is bin/post -c

Re: Solr query with more than one field

2017-06-01 Thread Chris Hostetter
: I could have sworn I was paraphrasing _your_ presentation Hoss. I : guess I did not learn my lesson well enough. : : Thank you for the correction. Trust but verify! ... we're both wrong. Boolean functions (like lt(), gt(), etc...) behave just like sum() -- they "exist" for a document if and

Replace a solr node which is using a block storage

2017-06-01 Thread Minu Theresa Thomas
Hi, I am new to Solr. I have a use case to add a new node when an existing node goes down. The new node with a new IP should contain all the replicas that the previous node had. So I am using a network storage (cinder storage) in which the data directory (where the solr.xml and the core

Re: Solr query with more than one field

2017-06-01 Thread Alexandre Rafalovitch
Bother, I could have sworn I was paraphrasing _your_ presentation Hoss. I guess I did not learn my lesson well enough. Thank you for the correction. Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 1 June 2017 at 15:26, Chris Hostetter

Re: Solr query with more than one field

2017-06-01 Thread Chris Hostetter
: Because the value of the function will be treated as a relevance value : and relevance value of 0 (and less?) will cause the record to be : filtered out. I don't believe that's true? ... IIRC 'fq' doesn't care what the scores are as long as the query is a "match" and a 'func' query will match

Re: Solr query with more than one field

2017-06-01 Thread Alexandre Rafalovitch
Function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries The function would be sub Then you want its result mapped to a fq. could probably be as simple as fq={!func}sub(value,cost). Because the value of the function will be treated as a relevance value and relevance

Solr query with more than one field

2017-06-01 Thread Mikhail Ibraheem
Hi,I have 2 fields "cost" and "value" at my records. I want to get all documents that have "value" greater than "cost". Something likeq=value:[cost TO *] Please advise. Thanks

DateUtil in SOLR-6

2017-06-01 Thread SOLR4189
In SOLR-4.10.1 I use DateUtil.parse in my UpdateProcessor for different datetime formats. In indexing of document datetime format is *-MM-dd'T'HH:mm:ss'Z'* and in reindexing document datetime format is *EEE MMM d hh:mm:ss z *. And it works fine. But what can I do in SOLR-6? I don't

Re: _version_ / Versioning using timespan

2017-06-01 Thread Susheel Kumar
Which version of solr are you using? I tested in 6.0 and if I supply same version, it overwrite/update the document exactly as per the wiki documentation. Thanks, Susheel On Thu, Jun 1, 2017 at 7:57 AM, marotosg wrote: > Thanks a lot Susheel. > I see this is actually what I

Re: Error with polygon search

2017-06-01 Thread BenCall
Thanks! This helped me out as well. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-with-polygon-search-tp4326117p4338496.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Configuration of parallel indexing threads

2017-06-01 Thread Susheel Kumar
How are you indexing currently? Are you using DIH or using SolrJ/Java? And are you indexing with multiple threads/machines simultaneously etc or just one thread/machine etc. Thnx Susheel On Thu, Jun 1, 2017 at 11:45 AM, Erick Erickson wrote: > That's been removed in

Re: Solr Document Routing

2017-06-01 Thread Erick Erickson
Can you check if those IDs are on shard8? You can do this by pointing the URL at the core and specifying =false... Best, Erick On Thu, Jun 1, 2017 at 1:42 AM, Amrit Sarkar wrote: > Sorry, The confluence link: >

Re: Configuration of parallel indexing threads

2017-06-01 Thread Erick Erickson
That's been removed in LUCENE-6659. I regularly max out my CPUs by having multiple _clients_ send update simultaneously rather than trying to up the number of threads the indexing process takes. But Mike McCandless can answer authoritatively... Best, Erick On Thu, Jun 1, 2017 at 4:16 AM,

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Erick Erickson
Well, personally I like to use SolrJ rather than DIH for both debugging ease and the reasons outlined here: https://lucidworks.com/2012/02/14/indexing-with-solrj/ FWIW Erick On Thu, Jun 1, 2017 at 7:59 AM, Josh Lincoln wrote: > I had the same issue as Vrinda and found a

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Josh Lincoln
I had the same issue as Vrinda and found a hacky way to limit the number of times deltaImportQuery was executed. As designed, solr executes *deltaQuery* to get a list of ids that need to be indexed. For each of those it executes *deltaImportQuery*, which is typically very similar to the full

Aw: Re: Re: Facet ranges and stats

2017-06-01 Thread Per Newgro
Thank you for your offer. But i think i need to think about the concept at all. I need to configure the limit in the database and use in in all appropriate places. I already have a clue how to do it - but lack of time :-) > Gesendet: Donnerstag, 01. Juni 2017 um 15:00 Uhr > Von: "Susheel Kumar"

Re: Re: Facet ranges and stats

2017-06-01 Thread Susheel Kumar
Great, it worked out. If you want to share where and in what code you have 90 configured, we can brainstorm if we can simplify it to have only one place. On Thu, Jun 1, 2017 at 3:16 AM, Per Newgro wrote: > Thanks for your support. > > Because the null handling is one of the

Re: _version_ / Versioning using timespan

2017-06-01 Thread marotosg
Thanks a lot Susheel. I see this is actually what I need. I have been testing it and notice the value of the field has to be always greater for a new document to get indexed. if you send the same version number it doesn't work. Is it possible somehow to overwrite documents with the same

Configuration of parallel indexing threads

2017-06-01 Thread gigo314
During performance testing a question was raised whether Solr indexing performance could be improved by adding more concurrent index writer threads. I discovered traces of such functionality here , but not sure how to use it in Solr 6.2.

Re: Solr Analyzer for Vietnamese

2017-06-01 Thread Eirik Hungnes
Thanks Erick, Dat: Do you have more info about the subject? 2017-05-22 17:08 GMT+02:00 Erick Erickson : > Eirik: > > That code is 4 years old and for Lucene 4. I doubt it applies cleanly > to the current code base, but feel free to give it a try but it's not >

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Amrit Sarkar
Erick, Thanks for the pointer. Getting astray from what Vrinda is looking for (sorry about that), what if there are no sub-entities? and no deltaImportQuery passed too. I looked into the code and determine it calculates the deltaImportQuery itself, SQLEntityProcessor:getDeltaImportQuery(..)::126.

Re: Solr Document Routing

2017-06-01 Thread Amrit Sarkar
Sorry, The confluence link: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Thu, Jun 1,

Re: Solr Document Routing

2017-06-01 Thread Amrit Sarkar
Sathyam, It seems your interpretation is wrong as CloudSolrClient calculates (hashes the document id and determine the range it belongs to) which shard the document incoming belongs to. As you have 10 shards, the document will belong to one of them, that is what being calculated and eventually

Solr Document Routing

2017-06-01 Thread Sathyam
HI, I am indexing documents to a 10 shard collection (testcollection, having no replicas) in solr6 cluster using CloudSolrClient. I saw that there is a lot of peer to peer document distribution going on when I looked at the solr logs. An example log statement is as follows: 2017-06-01

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread vrindavda
Thanks Erick, But how do I solve this? I tried creating Stored proc instead of plain query, but no change in performance. For delta import it in processing more documents than the total documents. In this case delta import is not helping at all, I cannot switch to full import each time. This

Aw: Re: Facet ranges and stats

2017-06-01 Thread Per Newgro
Thanks for your support. Because the null handling is one of the important things i decided to use another way. I added a script in my data import handler that decides if object was audited function auditComplete(row) { var total = row.get('TOTAL'); if (total == null || total < 90) {