implement relevency
Hi, How to get most relevent items on top of search results using solr search? -- Rashmi Be the change that you want to see in this world!
Solr Nutch
Hi, Question1 -- When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Synonyms and spellings
Hi, Questions 1) Why do we use Spellings file under solr core conf folder? What spellings do we enter in this? Question 2) : Implementing all synonyms is a tough thing. From where could i get list of as many synonyms as we could see in google search? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Fwd: Search Engine Framework decision
Thanks saurish. My office *intranet *is a sharepoint website. When I am crawling it using nutch, i am getting Unauthorized access(404) error. NTLM realm is used in this website. I checked on one nutch JIRA link that sharepoint could be accessed using nutch. Nutch has below properties in nutch-default.xml. http.proxy.host (should it be intranet site path?) http.proxy.port http.proxy.username (should this contain domain too?) http.proxy.password http.proxy.realm (should it be my desktop machin domain by which i login to my machine? using same domain/username i could access intranet from browser) Also, nutch has httpclient-auth xml file for giving credentials for authentication. What do I provide in below properties in nutch-site.xml? And what should be values in httpclient-auth.xml file? Regards, Rashmi On Mon, Jan 27, 2014 at 3:57 PM, saurish srinivas.oruga...@gmail.comwrote: Hi, Looks like there is support for Sharepoint as well as Windows Share in ManifoldCF. Yes, You can craw folders with Nutch (Atleast i have worked on a windows pc with a local file folder). Nutch 1.7 and Solr 4.5.1 have worked for me. Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/Fwd-Search-Engine-Framework-decision-tp4113584p4113677.html Sent from the Solr - User mailing list archive at Nabble.com. -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr Nutch
Thanks All for quick response. Today I crawled a webpage using nutch. This page have many links. But all anchor tags have href=# and javascript is written on onClick event of each anchor tag to open a new page. So crawler didnt crawl any of those links which were opening using onClick event and has # href value. How these links are crawled using nutch? On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko ale...@martchenko.com.br wrote: 1) Plus, those files are binaries sometimes with metadata, specific crawlers need to understand them. html is a plain text 2) Yes, different data schemes. Sometimes I replicate the same core and make some A-B tests with different weights, filters etc etc and some people like to creare CoreA and CoreB with the same schema and hammer CoreA with updates and commits and optmizes, they make it available for searches while hammering CoreB. Then swap again. This produces faster searches. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-28 Jack Krupansky j...@basetechnology.com 1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages. 2. Think of a core as an SQL table - each table/core has a different type of data. 3. SolrCloud is all about scaling and availability - multiple shards for larger collections and multiple replicas for both scaling of query response and availability if nodes go down. -- Jack Krupansky -Original Message- From: rashmi maheshwari Sent: Tuesday, January 28, 2014 11:36 AM To: solr-user@lucene.apache.org Subject: Solr Nutch Hi, Question1 -- When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Synonyms and spellings
Thanks for quick response Alexei. I will check this link to prepare synonym list. On Tue, Jan 28, 2014 at 11:00 PM, Alexei Martchenko ale...@martchenko.com.br wrote: 2) There are some synonym lists on the web, they aren't always complete but I keep analyzing fields and tokens in order to polish my synonyms. And I like to use tools like http://www.visualthesaurus.com/ to aid me. Hope this helps :-) alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com Hi, Questions 1) Why do we use Spellings file under solr core conf folder? What spellings do we enter in this? Question 2) : Implementing all synonyms is a tough thing. From where could i get list of as many synonyms as we could see in google search? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr Nutch
Thanks Markus and Alexei. On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko ale...@martchenko.com.br wrote: Well, not even Google parse those. I'm not sure about Nutch but in some crawlers (jSoup i believe) there's an option to try to get full URLs from plain text, so you can capture some urls in the form of someClickFunction(' http://www.someurl.com/whatever') or even if they are in the middle of some paragraph. Sometimes it works beautifully, sometimes it misleads you to parse urls shortened with ellipsis in the middle. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com Thanks All for quick response. Today I crawled a webpage using nutch. This page have many links. But all anchor tags have href=# and javascript is written on onClick event of each anchor tag to open a new page. So crawler didnt crawl any of those links which were opening using onClick event and has # href value. How these links are crawled using nutch? On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko ale...@martchenko.com.br wrote: 1) Plus, those files are binaries sometimes with metadata, specific crawlers need to understand them. html is a plain text 2) Yes, different data schemes. Sometimes I replicate the same core and make some A-B tests with different weights, filters etc etc and some people like to creare CoreA and CoreB with the same schema and hammer CoreA with updates and commits and optmizes, they make it available for searches while hammering CoreB. Then swap again. This produces faster searches. alexei martchenko Facebook http://www.facebook.com/alexeiramone | Linkedinhttp://br.linkedin.com/in/alexeimartchenko| Steam http://steamcommunity.com/id/alexeiramone/ | 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone | Github https://github.com/alexeiramone | (11) 9 7613.0966 | 2014-01-28 Jack Krupansky j...@basetechnology.com 1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages. 2. Think of a core as an SQL table - each table/core has a different type of data. 3. SolrCloud is all about scaling and availability - multiple shards for larger collections and multiple replicas for both scaling of query response and availability if nodes go down. -- Jack Krupansky -Original Message- From: rashmi maheshwari Sent: Tuesday, January 28, 2014 11:36 AM To: solr-user@lucene.apache.org Subject: Solr Nutch Hi, Question1 -- When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Fwd: Search Engine Framework decision
Hi, I want to creating a POC to search INTRANET along with documents uploaded on intranet. Documents(PDF, excel, word document, text files, images, videos) are also exists on SHAREPOINT. sharepoint has Authentication access at module level(folder level). My interanet website is http://myintranet/ http://sparsh/ . and Sharepoint url is different. Documents also exist in file folders. I have below queries: A) Which crawler framework do I use along with Solr for this POC, Nutch or Apache ManifoldCF? B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only configuration level change would make this possible? or I have to write code to parse and send to solr? C) Which version of Solr+nutch+MCF should be used? because nutch version has dependency on solr version. wold nutch 1.7 works properly with solr 4.6.0? -- Rashmi Be the change that you want to see in this world! -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org