implement relevency

2014-01-28 Thread rashmi maheshwari
Hi,

How to get most relevent items on top of search results using solr search?

-- 
Rashmi
Be the change that you want to see in this world!


Solr Nutch

2014-01-28 Thread rashmi maheshwari
Hi,

Question1 -- When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Synonyms and spellings

2014-01-28 Thread rashmi maheshwari
Hi,

Questions 1)  Why do we use Spellings file under solr core conf folder?
What spellings do we enter in this?

Question 2) : Implementing all synonyms is a tough thing. From where could
i get list of as many synonyms as we could see in google search?




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Fwd: Search Engine Framework decision

2014-01-28 Thread rashmi maheshwari
Thanks saurish.


My office *intranet *is a sharepoint website. When I am crawling it using
nutch, i am getting Unauthorized access(404) error. NTLM realm is used in
this website.

I checked on one nutch JIRA link that sharepoint could be accessed using
nutch. Nutch has below properties in nutch-default.xml.

http.proxy.host (should it be intranet site path?)
http.proxy.port
http.proxy.username  (should this contain domain too?)
http.proxy.password
http.proxy.realm (should it be my desktop machin domain by which i login to
my machine? using same domain/username i could access intranet from browser)


Also, nutch has httpclient-auth xml file for giving credentials for
authentication.

What do  I provide in below properties in nutch-site.xml?


And what should be values in httpclient-auth.xml file?



Regards,
Rashmi


On Mon, Jan 27, 2014 at 3:57 PM, saurish srinivas.oruga...@gmail.comwrote:

 Hi,

 Looks like there is support for Sharepoint as well as Windows Share in
 ManifoldCF.

 Yes, You can craw folders with Nutch (Atleast i have worked on a windows pc
 with a local file folder).

 Nutch 1.7 and Solr 4.5.1 have worked for me.

 Regards,



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Fwd-Search-Engine-Framework-decision-tp4113584p4113677.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr Nutch

2014-01-28 Thread rashmi maheshwari
Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have href=# and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 1) Plus, those files are binaries sometimes with metadata, specific
 crawlers need to understand them. html is a plain text

 2) Yes, different data schemes. Sometimes I replicate the same core and
 make some A-B tests with different weights, filters etc etc and some people
 like to creare CoreA and CoreB with the same schema and hammer CoreA with
 updates and commits and optmizes, they make it available for searches while
 hammering CoreB. Then swap again. This produces faster searches.


 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 Jack Krupansky j...@basetechnology.com

  1. Nutch follows the links within HTML web pages to crawl the full graph
  of a web of pages.
 
  2. Think of a core as an SQL table - each table/core has a different type
  of data.
 
  3. SolrCloud is all about scaling and availability - multiple shards for
  larger collections and multiple replicas for both scaling of query
 response
  and availability if nodes go down.
 
  -- Jack Krupansky
 
  -Original Message- From: rashmi maheshwari
  Sent: Tuesday, January 28, 2014 11:36 AM
  To: solr-user@lucene.apache.org
  Subject: Solr  Nutch
 
 
  Hi,
 
  Question1 -- When Solr could parse html, documents like doc, excel pdf
  etc, why do we need nutch to parse html files? what is different?
 
  Questions 2: When do we use multiple core in solar? any practical
 business
  case when we need multiple cores?
 
  Question 3: When do we go for cloud? What is meaning of implementing solr
  cloud?
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Synonyms and spellings

2014-01-28 Thread rashmi maheshwari
Thanks for quick response Alexei.

I will check this link to prepare synonym list.


On Tue, Jan 28, 2014 at 11:00 PM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 2) There are some synonym lists on the web, they aren't always complete but
 I keep analyzing fields and tokens in order to polish my synonyms. And I
 like to use tools like http://www.visualthesaurus.com/ to aid me.

 Hope this helps :-)


 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com

  Hi,
 
  Questions 1)  Why do we use Spellings file under solr core conf folder?
  What spellings do we enter in this?
 
  Question 2) : Implementing all synonyms is a tough thing. From where
 could
  i get list of as many synonyms as we could see in google search?
 
 
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr Nutch

2014-01-28 Thread rashmi maheshwari
Thanks Markus and Alexei.


On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 Well, not even Google parse those. I'm not sure about Nutch but in some
 crawlers (jSoup i believe) there's an option to try to get full URLs from
 plain text, so you can capture some urls in the form of someClickFunction('
 http://www.someurl.com/whatever') or even if they are in the middle of
 some
 paragraph. Sometimes it works beautifully, sometimes it misleads you to
 parse urls shortened with ellipsis in the middle.



 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com

  Thanks All for quick response.
 
  Today I crawled a webpage using nutch. This page have many links. But all
  anchor tags have href=# and javascript is written on onClick event of
  each anchor tag to open a new page.
 
  So crawler didnt crawl any of those links which were opening using
 onClick
  event and has # href value.
 
  How these links are crawled using nutch?
 
 
 
 
  On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
  ale...@martchenko.com.br wrote:
 
   1) Plus, those files are binaries sometimes with metadata, specific
   crawlers need to understand them. html is a plain text
  
   2) Yes, different data schemes. Sometimes I replicate the same core and
   make some A-B tests with different weights, filters etc etc and some
  people
   like to creare CoreA and CoreB with the same schema and hammer CoreA
 with
   updates and commits and optmizes, they make it available for searches
  while
   hammering CoreB. Then swap again. This produces faster searches.
  
  
   alexei martchenko
   Facebook http://www.facebook.com/alexeiramone |
   Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
   Steam http://steamcommunity.com/id/alexeiramone/ |
   4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
   Github https://github.com/alexeiramone | (11) 9 7613.0966 |
  
  
   2014-01-28 Jack Krupansky j...@basetechnology.com
  
1. Nutch follows the links within HTML web pages to crawl the full
  graph
of a web of pages.
   
2. Think of a core as an SQL table - each table/core has a different
  type
of data.
   
3. SolrCloud is all about scaling and availability - multiple shards
  for
larger collections and multiple replicas for both scaling of query
   response
and availability if nodes go down.
   
-- Jack Krupansky
   
-Original Message- From: rashmi maheshwari
Sent: Tuesday, January 28, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Solr  Nutch
   
   
Hi,
   
Question1 -- When Solr could parse html, documents like doc, excel
 pdf
etc, why do we need nutch to parse html files? what is different?
   
Questions 2: When do we use multiple core in solar? any practical
   business
case when we need multiple cores?
   
Question 3: When do we go for cloud? What is meaning of implementing
  solr
cloud?
   
   
--
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org
   
  
 
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Fwd: Search Engine Framework decision

2014-01-26 Thread rashmi maheshwari
Hi,

I want to creating a POC to search INTRANET along with documents uploaded
on intranet. Documents(PDF, excel, word document, text files, images,
videos) are also exists on SHAREPOINT. sharepoint has Authentication access
at module level(folder level).

My interanet website is http://myintranet/ http://sparsh/ . and
Sharepoint url is different. Documents also exist in file folders.

I have below queries:
A) Which crawler framework do I use along with Solr for this POC, Nutch
or Apache ManifoldCF?

B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only
configuration level change would make this possible? or I have to write
code to parse and send to solr?

C) Which version of Solr+nutch+MCF should be used? because nutch version
has dependency on solr version. wold nutch 1.7 works properly with solr
4.6.0?
-- 
Rashmi
Be the change that you want to see in this world!




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org