Re: No Such File or directory problem

2010-11-24 Thread Gora Mohanty
On Thu, Nov 25, 2010 at 6:40 AM, Chris Woolum cwoo...@moonvalley.com wrote: Hello everyone, I am new to nutch and am having a problem with my initial deployment of it. It does not seem that nutch is properly parsing the SEGMENT string and is trying to search invalid folders. I am using

Re: subscribe to the Nutch user mailing list

2010-12-08 Thread Gora Mohanty
On Wed, Dec 8, 2010 at 6:33 PM, shi wang wangshi.t...@gmail.com wrote: I want to subscribe to the Nutch user mailing list. Please see http://nutch.apache.org/mailing_lists.html . Presumably, you want to subscribe to the users list, so sending mail to user-subscr...@nutch.apache.org will work.

Re: Please subscribe to mailing list.

2010-12-24 Thread Gora Mohanty
On Fri, Dec 24, 2010 at 11:33 AM, Luis Taveras ltavera...@yahoo.com wrote: Please suscribe to mailing list. You should send mail to user-subscr...@nutch.apache.org in order to be subscribed to the list. Please see http://nutch.apache.org/mailing_lists.html Regards, Gora

Re: unnecessary results in search

2011-01-05 Thread Gora Mohanty
On Tue, Jan 4, 2011 at 11:36 PM, alx...@aim.com wrote: Hello, Thanks you for your response. Let me give you more detail of the issue that I have. First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com Next, call subdomain the followings

Re: unnecessary results in search

2011-01-05 Thread Gora Mohanty
On Wed, Jan 5, 2011 at 11:25 PM, alx...@aim.com wrote: I do search directly in Nutch version 1-2. I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword. That is possible, though I am not sure why the situation is different with

Re: DNS questions

2011-01-14 Thread Gora Mohanty
On Fri, Jan 14, 2011 at 11:13 PM, Asier Martínez axi...@gmail.com wrote: Hi again, I'm having performance issues due my DNS server configurations. I'm now using public dns servers, ( like google etc ) and it seems to be certan limit of query responses at the same time. I'm reading  about

Re: Archiving Audio and Video

2011-01-25 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would

Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:17 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: Thanks Gora! I am interested I'm searching through the text from these audio and video streams. An example would be a 911 dispatch call and maybe even all the recorded official chatter about it. That is just

Re: Archiving Audio and Video

2011-01-26 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Another example would be the content embedded in this flash movie. http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/ [...] ffmpeg can pull out audio from video streams, and a working speech-to-text

Re: License conditions of Nutch

2011-02-12 Thread Gora Mohanty
On Sat, Feb 12, 2011 at 2:57 PM, Amna Waqar amna.waqar...@gmail.com wrote: Hi all, I want to know do the ASF license of nutch allows us to modfiy its code and make a new search engine and then start earning revenue on the basis of it.. [...] Yes, it does. This might help:

Re: Stupid Question

2011-02-12 Thread Gora Mohanty
On Sat, Feb 12, 2011 at 9:01 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: The disc failed on my PC I so will have to test out the patch on the Mac ;-). Is this the version that is still reliant on Gora or have the two been mashed together? I haven't looked at nighty builds in over

Re: how to change the value of a field in index

2011-03-13 Thread Gora Mohanty
Hi, If you mean changing just one field of a document, one cannot do that: Solr is not a RDBMS. However, one does not have to delete a document, and then reindex it. Simply indexing a document with the same ID, with all fields including the changed one, updates it in the index. Regards, Gora

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

2011-03-23 Thread Gora Mohanty
On Wed, Mar 23, 2011 at 3:26 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: $ cat wikipedia_links_simple.nt | grep http://simple.wiki; | awk  '{print $1}' | sort -u | sed -E 's/|//g' I have lost track of what you were trying to do, but it really should not be that difficult. Taking the

Re: How to re-fetch all the modified page?

2011-05-24 Thread Gora Mohanty
Hi, Been a while since I have personally used Nutch in a production environment, but if you are using some kind of a CMS/framework that provides hooks for page creation/modification, your best option might be to use such a hook to trigger a recrawl of the page. At least that is the solution

Re: I want to be subribed

2011-07-13 Thread Gora Mohanty
Hi, Please see http://nutch.apache.org/mailing_lists.html for how to subscribe to various mailing lists. Regards, Gora

Re: How to avoid splitting strings when indexing to solr

2011-08-05 Thread Gora Mohanty
Hi, Not too familiar these days with Nutch, but my guess is that a Solr analyser is getting applied. To have a field exactly as is, use the String fieldtype on Solr's schema.xml rather than tje text fieldtype. Regards, Gora On 05-Aug-2011 6:35 PM, Marek Bachmann m.bachm...@uni-kassel.de wrote:

Re: Weight servers differently

2011-08-31 Thread Gora Mohanty
On Wed, Aug 31, 2011 at 1:52 PM, Johan Svensson johan.svens...@euroling.se wrote: I want to put different weights to different domains, so that I can push up results from my main site. Say for example, I have www.example.com with a few but important pages, and blog.example.com with zillions of

Re: Weight servers differently

2011-08-31 Thread Gora Mohanty
On Wed, Aug 31, 2011 at 2:51 PM, Johan Svensson johan.svens...@euroling.se wrote: Thank you! This looks interesting. However, I wonder if it really can solve this problem. No part of the search query is by necessary means part of the domain name. Let's say for example that we search for foobar.

Re: Spellcheck with Solr

2011-09-07 Thread Gora Mohanty
On Wed, Sep 7, 2011 at 1:16 PM, Danicela nutch danicela-nu...@mail.com wrote: [...]  The first time, I put a spellcheck.build=true in the request, the index was modified, but has only 20 bytes. (I think that's strange for 7000 indexed pages) This seems to indicate that something went wrong

Re: Specialized Nutch Crawling

2012-01-04 Thread Gora Mohanty
On Thu, Jan 5, 2012 at 4:42 AM, niviksha nivik...@gmail.com wrote: Hi all, this is my first post. I've used lucene extensively in the past, but am just getting my feet wet with Nutch. The problem I have is to use Nutch to crawl relational (sql) databases. Is this possible via the current plug

Re: JAVA_HOME is not set

2013-01-25 Thread Gora Mohanty
On 25 January 2013 16:05, peterbarretto peterbarrett...@gmail.com wrote: I still get the below error after setting the java home variable http://lucene.472066.n3.nabble.com/file/n4036204/nutch_java_home_error.png Not sure of how much experience you have had with Unix-style shell quoting, but

Re: JAVA_HOME is not set

2013-01-29 Thread Gora Mohanty
On 29 January 2013 16:20, peterbarretto peterbarrett...@gmail.com wrote: Tried escaping the whitespace but it still did not work so i installed java in another folder and now the installation work just fine [...] The message that I had referenced seems to say that one should *not* be escaping

Re: does nutch take care of any format change in the websites that is been crawled

2013-03-11 Thread Gora Mohanty
On 11 March 2013 15:04, Rohan Thakur rohan.i...@gmail.com wrote: hi I am new to nutch I wanted to know does nutch take care of any kind of format change in the urls that we have set to crawl and does not require any manual changes to the kind of changes that has been applied to the urls to

Re: Unable to crawl google search results

2013-06-04 Thread Gora Mohanty
On 5 June 2013 03:53, Julien Nioche lists.digitalpeb...@gmail.com wrote: Check your URL filters e.g. that you removed the lines below which are there by default *# skip URLs containing certain characters as probable queries, etc.* *-[?*!@=]* [...] Not directly related to your question, but I

Re: Freegen and Solr score

2014-03-25 Thread Gora Mohanty
On Mar 26, 2014 1:02 AM, John Lafitte jlafi...@brandextract.com wrote: I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new documents. I thought the score was calculated

Re: Please share your experience of using Nutch in production

2014-06-22 Thread Gora Mohanty
On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote: Hello Folks, I have noticed that Nutch resources and mailing lists are mostly geared towards the usage of Nutch in research oriented projects , I would like to know from those of you who are using Nutch in production for large

Re: Please share your experience of using Nutch in production

2014-06-24 Thread Gora Mohanty
On 23 June 2014 01:44, Meraj A. Khan mera...@gmail.com wrote: Gora, Thanks for sharing your admin perspective , rest assured I am not trying to circumvent any politeness requirements in any way , as I mentioned earlier , I am with in the crawl-delay limits that are being set by the web

Re: http 501 error

2015-06-11 Thread Gora Mohanty
On 11 June 2015 at 15:30, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Thanks a lot for your response. will Nutch can handle POST request? Don't think so. How would it know what POST data is expected by the page? Regards, Gora

Re: http 501 error

2015-06-11 Thread Gora Mohanty
Hi, A HTTP 501 error is a method not implemented error, as you could have searched and found out. What that means is that the server you are trying to crawl does not implement GET for that URL. Regards, Gora On 11 June 2015 at 14:37, Deepa Jayaveer deepa.jayav...@tcs.com wrote: Hi All,

Re: Crawling the entire web

2024-01-10 Thread Gora Mohanty
Hi, Would suggest starting out by looking at Common Crawl: https://commoncrawl.org/ Regards, Gora