Multi-Lingual Support in Nutch
Hello, I am using Nutch 0.9. I would like to enable multi-lingual support in our existing system. I read the article on Multi-Lingual Support in Nutch by Jérôme Charron. But it is about the previous versions of Nutch. I included the plugin in Nutch-Site.xml as analysis-es. What are the other steps to be followed to enable multi-lingual support ? Thanks Regards, Kunal
Out of Memory Error While Crawling
Hello Everyone, I encountered errors during the crawl process as follows: java.lang.OutOfMemoryError: Java heap space fetcher caught:java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space fetcher caught:java.lang.OutOfMemoryError: Java heap space Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) Please help me solve this. Thanks Regards, Kunal Gosar __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Crawl Problem
Hello, I have a webpage consisting of around 300 hyperlinks to other pages. When I use the crawl using Cygwin, it is crawling around 80 pages (hyperlinks). How can I crawl over the whole webpage i.e., cover all the hyperlinks ? Thanks Regards, Kunal __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Searching multiple meta fields in a single query
Hello Everyone, I have 2 meta tags in the html file. For example, subject:english and professor:john i have added 2 plugins for the respective meta data - subject professor. If I query 'subject:english' in nutch, it results me the pages containing meta data subject:english. If I query 'subject:english professor:john' in nutch, it doesnt give me any results. How can I query mulitple meta tags in a single query ? Please help me solve this problem. Thanks Regards, Kunal - Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos more.
Ranking Technology
Hello Everyone, Can anyone please let me know regarding the page ranking technology used by lucene nutch. I was not able to find any documentation regarding it. If you have any document regarding the ranking algorithms used, please e-mail me. Thanks Regards, Kunal Gosar - Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us.
Plugin for Metadata
Hello Everyone, I have one question. I have used a plugin for searching metadata, called recommended using this webpage: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 When I am searching using nutch, I did not find any difference in the normal search and the metadata search. The word from metadata should get the greater ranking than the normal one. But I am not able to get it. Please help me solve this problem. Thanks Regards, Kunal Gosar - Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
Problem: Compiling Plugin Using Ant
Hello, I worked on a plugin using the reference webpage: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 After setting everything, finally when I compile using Ant 1.6.0, it says build successfully. But when I look in the build folder, nutch-0.9 war file is not found, instead nutch-0.9 file of type 'Task Object' is found. Please help me solving this problem. Thanks Regards, Kunal - Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
Re: Regarding Lucene Nutc
Hello Aditya, Thank you for your reply. I just your e-mail and I will try implementing your idea. I think using this idea, the search results me the files in which the required word appears in the content as well as the metadata of the file. My requirement is that the search should result me the files in which the required word appears only in the metadata of the file i.e., it should search only in the metadata (the required word may appear in the content of the file too. but it need not search in the content of the file). How can I achieve this ? Thanks Regards, Kunal aditya naga hemanth kumar [EMAIL PROTECTED] wrote: Hi You can search a file in the meta-data fields and default fields that are indexed by the search engine.Say you have a set of files which belong to operating system course.You can add a meta-data field subject with value operating systems to all the files directly by using XMP. Then when you are indexing with lucene you can add a separate field called subject for each document.When searching you can boost the score if the query matches with the value of subject field which brings it to the top.Hope this helps Cheers Aditya V On 9/7/07, Kunal Wku wrote: Hello Everyone, I am using Lucene Nutch in my project for searching content in the webpages. For a webpage or any other document, Lucene takes all the words in the page and indexes them and returns the result when searched. Lets say, I have 2 webpages as shown below: Webpage1 -- This is the course page of Computer Science Department Subject: Operating System I Professor: Qi Li Details: The course operating system I deals with the basics of the operating system. Mainly the three topics dealt are process management, storage management memory mangement. etc .. -- Webpage2 -- This is the home page of Computer Science Department The computer science department offers courses at undergradudate level and graduate level. The core courses for the graduate students are Mathematical Foundations of Computer Science, Compilers, Advanced Database, Analysis of Algorithms and Operating Systems. etc .. -- Now if I search using the word operating system, the results shows both the webpages (webpage 1 webpage2) since the word operating system exists in both the webpage. But my requirement is different. If I want to search the word Operating System which should appear in the subject field i.e., as in the webpage1, the result should show only webpage1. How can I achieve this result ? Please help me in this regard. Thanks Regards, Kunal Gosar - Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. - Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when.
Regarding Lucene Nutch
Hello Everyone, I am using Lucene Nutch in my project for searching content in the webpages. For a webpage or any other document, Lucene takes all the words in the page and indexes them and returns the result when searched. Lets say, I have 2 webpages as shown below: Webpage1 -- This is the course page of Computer Science Department Subject: Operating System I Professor: Qi Li Details: The course operating system I deals with the basics of the operating system. Mainly the three topics dealt are process management, storage management memory mangement. etc .. -- Webpage2 -- This is the home page of Computer Science Department The computer science department offers courses at undergradudate level and graduate level. The core courses for the graduate students are Mathematical Foundations of Computer Science, Compilers, Advanced Database, Analysis of Algorithms and Operating Systems. etc .. -- Now if I search using the word operating system, the results shows both the webpages (webpage 1 webpage2) since the word operating system exists in both the webpage. But my requirement is different. If I want to search the word Operating System which should appear in the subject field i.e., as in the webpage1, the result should show only webpage1. How can I achieve this result ? Please help me in this regard. Thanks Regards, Kunal Gosar - Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.