Re: Separating nutch and hadoop configurations.
Hey, thanks. My problem was that I also wanted the nutch conf out of the nutch install dir. So, I did set the NUTCH_CONF_DIR variable in my .bashrc and couldn't understand why it was never picking it up. Well, as it happens, that was the one variable I forgot to export! Doh! So, it wasn't hard at all. Though, I needed to replace hadoop-12.whatever.jar to the lastest within the nutch build. It seems to be working. yay. Thanks. On 7/11/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Briggs wrote: > I am currently trying to figure out how to deploy Nutch and Hadoop > separately. I want to configure Hadoop outside of Nutch and have > Nutch use that service, rather than configuring hadoop within nutch. > I would think all that Nutch should need to know is the urls to > connect to Hadoop, but can't figure out how to get this to work. > > Is this possible? If so, is there some sort of document, or archive > of another list post for this? > > Sorry for the ignorance. If you have a clean hadoop installation up and running (made e.g. from one of the official Hadoop builds), it should be enough to put the nutch*.job file in ${hadoop.dir}, and copy bin/nutch (possibly with some minor modifications - my memory is a little vague on this ...). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- "Conscious decisions by conscious minds are what make reality real"
Separating nutch and hadoop configurations.
I am currently trying to figure out how to deploy Nutch and Hadoop separately. I want to configure Hadoop outside of Nutch and have Nutch use that service, rather than configuring hadoop within nutch. I would think all that Nutch should need to know is the urls to connect to Hadoop, but can't figure out how to get this to work. Is this possible? If so, is there some sort of document, or archive of another list post for this? Sorry for the ignorance. -- "Conscious decisions by conscious minds are what make reality real"
Re: NUTCH-479 "Support for OR queries" - what is this about
Thanks for the answer. That was helpful. I was sooo wrong. On 7/7/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Briggs wrote: > Please keep this thread going as I am also curious to know why this > has been 'forked'. I am sure that most of this lies within the > original OPIC filter but I still can't understand why straight forward > lucene queries have not been used within the application. No, this has actually almost nothing to do with the scoring filters (which were added much later). The decision to use a different query syntax than the one from Lucene was motivated by a few reasons: * to avoid the need to support low-level index and searcher operations, which the Lucene API would require us to implement. * to keep the Nutch core largely independent of Lucene, so that it's possible to use Nutch with different back-end searcher implementations. This started to materialize only now, with the ongoing effort to use Solr as a possible backend. * to limit the query syntax to those queries that provide best tradeoff between functionality and performance, in a large-scale search engine. > On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: >> Ok, so I guess what I don't understand is what is the "Nutch query >> syntax"? Query syntax is defined in an informal way on the Help page in nutch.war, or here: http://wiki.apache.org/nutch/Features Formal syntax definition can be gleaned from org.apache.nutch.analysis.NutchAnalysis.jj. >> >> The main discussion I found on nutch-user is this: >> http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html >> I was wondering why the query syntax is so limited. >> There are no OR queries, there are no fielded queries, >> or fuzzy, or approximate... Why? The underlying index >> supports all these operations. Actually, it's possible to configure Nutch to allow raw field queries - you need to add a raw field query plugin for this. Pleae see RawFieldQueryFilter class, and existing plugins that use fielded queries: query-site, and query-more. Query-more / DateQueryFilter is especially interesting, because it shows how to use raw token values from a parsed query to build complex Lucene queries. >> >> I notice by looking at the or.patch file >> (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) >> that one of the programs under consideration is: >> nutch/searcher/Query.java >> The code for this is distinct from >> lucene/search/Query.java See above - they are completely different classes, with completely different purpose. The use of the same class name is unfortunate and misleading. Nutch Query class is intended to express queries entered by search engine users, in a tokenized and parsed way, so that the rest of Nutch may deal with Clauses, Terms and Phrases instead of plain String-s. On the other hand, Lucene Query is intended to express arbitrarily complex Lucene queries - many of these queries would be prohibitively expensive for a large search engine (e.g. wildcard queries). >> >> It looks like this is an architecture issue that I don't understand. >> If nutch is an "extension" of lucene, why does it define a different >> Query class? Nutch is NOT an extension of Lucene. It's an application that uses Lucene as a library. >> Why don't we just use the Lucene code to query the >> indexes? Does this have something to do with the nutch webapp >> (nutch.war)? What is the historical genesis of this issue (or is that >> even relevant)? Nutch webapp doesn't have anything to do with it. The limitations in the query syntax have different roots (see above). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- "Conscious decisions by conscious minds are what make reality real"
Re: NUTCH-479 "Support for OR queries" - what is this about
Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I still can't understand why straight forward lucene queries have not been used within the application. On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: I've been reading up on NUTCH-479 "Support for OR queries" but I must be missing something obvious because I don't understand what the JIRA is about: https://issues.apache.org/jira/browse/NUTCH-479 Description: There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. Ok, so I guess what I don't understand is what is the "Nutch query syntax"? The main discussion I found on nutch-user is this: http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why? The underlying index supports all these operations. I notice by looking at the or.patch file (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one of the programs under consideration is: nutch/searcher/Query.java The code for this is distinct from lucene/search/Query.java It looks like this is an architecture issue that I don't understand. If nutch is an "extension" of lucene, why does it define a different Query class? Why don't we just use the Lucene code to query the indexes? Does this have something to do with the nutch webapp (nutch.war)? What is the historical genesis of this issue (or is that even relevant)? We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 -- "Conscious decisions by conscious minds are what make reality real"
Re: Reload index
Strange... Here is the quoted, unedited, partially incorrect post... ;-) "I would say that the best thing to do is to create a new nutch bean. I never cared much for the nutch bean containing logic to store itself in a servlet context. I do not believe that this is the place for such logic. It should be up to the user to place the nutch bean into the servlet context and not the bean. My implementation of a "nutch bean" has no knowledge of a servlet context and I believe this dependency should be removed. Why should nutch care about such details? Anyway, enough with my tiny rant. You could just create a 'reload.jsp' (or any servlet, or whatever you want that can get ahold of the servlet context) and do the work... The current way nutch finds an instance of the search bean is within the static method get(ServletContext, Configuration) within the NutchBean class. So, in your java class, jsp or whatever, just replace the instance with something like: servletContext.setAttribute("nutchBean", new NutchBean(yourConfiguration)); Hope that gets you on your way. You could always edit, or subclass the nutch bean with a 'reload/reinit' method too that could just do the same thing." On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: Thanks, Briggs. I will try to create a new NutchBean to se if that solves reloading issue. By the way. Your former mail do not seem to have reached the mailinglist. I can't seem to find it anyway. -Ronny -Opprinnelig melding- Fra: Briggs [mailto:[EMAIL PROTECTED] Sendt: 20. juni 2007 01:22 Til: nutch-user@lucene.apache.org Emne: Re: Reload index By the way, I was wrong about one thing, you can't override the 'get' method of nutch bean because it's static. Doh, that was a silly oversight. But again, if you are using nutch and you need to 'reload' the index, you need only to create a new NutchBean (that is if the NutchBean is what you are using). On 6/19/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > This will reload the application, is'nt this correct? This is > something I do not want as spesified below. > > Is it possible to maybe manupulate the IndexReader part of the nutch > web client to read whenever i tell it to, or something like that? > > Or do I have to write my own client bottom up? > > Regards, > Ronny > > -Opprinnelig melding- > Fra: Susam Pal [mailto:[EMAIL PROTECTED] > Sendt: 18. juni 2007 17:33 > Til: nutch-user@lucene.apache.org > Emne: Re: Reload index > > touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml > > $CATALINA_HOME is the top level directory of Tomcat. It works for most > cases. > > Regards, > Susam Pal > http://susam.in/ > > On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > > > > Is there a way to reload index without restarting application server > > or reloading application? > > > > I have integrated Nutch into our app but we can not restart or > > reload the app everytime we have created a new index. > > > > > > Regards, > > Ronny > > > > > > -- "Conscious decisions by conscious minds are what make reality real" !DSPAM:46786552232131573131950! -- "Conscious decisions by conscious minds are what make reality real"
Re: Reload index
By the way, I was wrong about one thing, you can't override the 'get' method of nutch bean because it's static. Doh, that was a silly oversight. But again, if you are using nutch and you need to 'reload' the index, you need only to create a new NutchBean (that is if the NutchBean is what you are using). On 6/19/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: This will reload the application, is'nt this correct? This is something I do not want as spesified below. Is it possible to maybe manupulate the IndexReader part of the nutch web client to read whenever i tell it to, or something like that? Or do I have to write my own client bottom up? Regards, Ronny -Opprinnelig melding- Fra: Susam Pal [mailto:[EMAIL PROTECTED] Sendt: 18. juni 2007 17:33 Til: nutch-user@lucene.apache.org Emne: Re: Reload index touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml $CATALINA_HOME is the top level directory of Tomcat. It works for most cases. Regards, Susam Pal http://susam.in/ On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > > Is there a way to reload index without restarting application server > or reloading application? > > I have integrated Nutch into our app but we can not restart or reload > the app everytime we have created a new index. > > > Regards, > Ronny > !DSPAM:4676a5d0153227818312239! -- "Conscious decisions by conscious minds are what make reality real"
Re: Reload index
I would say that the best thing to do is to create a new nutch bean. I never cared much for the nutch bean containing logic to store itself in a servlet context. I do not believe that this is the place for such logic. It should be up to the user to place the nutch bean into the servlet context and not the bean. My implementation of a "nutch bean" has no knowledge of a servlet context and I believe this dependency should be removed. Why should nutch care about such details? Anyway, enough with my tiny rant. You could just create a 'reload.jsp' (or any servlet, or whatever you want that can get ahold of the servlet context) and do the work... The current way nutch finds an instance of the search bean is within the static method get(ServletContext, Configuration) within the NutchBean class. So, in your java class, jsp or whatever, just replace the instance with something like: servletContext.setAttribute("nutchBean", new NutchBean(yourConfiguration)); Hope that gets you on your way. You could always edit, or subclass the nutch bean with a 'reload/reinit' method too that could just do the same thing. On 6/18/07, Susam Pal <[EMAIL PROTECTED]> wrote: touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml $CATALINA_HOME is the top level directory of Tomcat. It works for most cases. Regards, Susam Pal http://susam.in/ On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > > Is there a way to reload index without restarting application server or > reloading application? > > I have integrated Nutch into our app but we can not restart or reload > the app everytime we have created a new index. > > > Regards, > Ronny > -- "Conscious decisions by conscious minds are what make reality real"
Re: fetch failing while crawling
Oh and as for the web interface, take a look at the wiki page: http://wiki.apache.org/nutch/NutchTutorial The bottom of the page has a section on searching. On 6/15/07, Briggs <[EMAIL PROTECTED]> wrote: Yeah, you still don't have the agent configured. All your values for the agent (the needs a value) are blank. So, you need to at least confugure an agent name. On 6/15/07, karan thakral <[EMAIL PROTECTED]> wrote: > i m using crawl on the cygwin while working on windows > > but the crawl output is not proper > > during fetch its says fetch: the document could not be fetched java runtime > exception agent not configured > > my nutch-site.xml is as follows > > > > > > > > > http.agent.name > > HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > > > > > http.agent.description > > Further description of our bot- this text is used in > the User-Agent header. It appears in parenthesis after the agent name. > > > > > http.agent.url > > A URL to advertise in the User-Agent header. This will >appear in parenthesis after the agent name. Custom dictates that this >should be a URL of a page explaining the purpose and behavior of this >crawler. > > > > > http.agent.email > > An email address to advertise in the HTTP 'From' request >header and User-Agent header. A good practice is to mangle this >address (e.g. 'info at example dot com') to avoid spamming. > > > > > but still thrs error > > also please throw some light on the searching of info through the web > interface after the crawl is made successful > -- > With Regards > Karan Thakral > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: fetch failing while crawling
Yeah, you still don't have the agent configured. All your values for the agent (the needs a value) are blank. So, you need to at least confugure an agent name. On 6/15/07, karan thakral <[EMAIL PROTECTED]> wrote: i m using crawl on the cygwin while working on windows but the crawl output is not proper during fetch its says fetch: the document could not be fetched java runtime exception agent not configured my nutch-site.xml is as follows http.agent.name HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. http.agent.description Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. http.agent.url A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. http.agent.email An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. but still thrs error also please throw some light on the searching of info through the web interface after the crawl is made successful -- With Regards Karan Thakral -- "Conscious decisions by conscious minds are what make reality real"
Re: Explanation of topN
Well, the quick/simple exlanation is: If you have 5 urls with their associate nutch score: http://a.com/something1 = 5.0 http://b.com/something2 = 4.0 http://c.com/something3 = 3.0 http://d.com/something4 = 2.0 http://e.com/something5 = 1.0 Then you set nutch to crawl with topN = 3 then a,b,c will be fetched and d and e will not. It just means "give me the 3 best ranking URLs" from the current crawl database. On 6/8/07, monkeynuts84 <[EMAIL PROTECTED]> wrote: Can someone give me an explanation of what topN does? I've seen various pieces of info but some of them seem to be conflicting. I've noticed in my crawls that certain sites are crawled more than other in each iteration of a fetch. Is this caused by topN? Thanks. -- View this message in context: http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11033441 Sent from the Nutch - User mailing list archive at Nabble.com. -- "Conscious decisions by conscious minds are what make reality real"
Re: indexing only special documents
Ronny, your way is probably better. See, I was only dealing with the fetched properties. But, in your case, you don't fetch it, which gets rid of all that wasted bandwidth. For dealing with types that can be dealt with via the file extension, this would probably work better. On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: Hi. Configure crawl-urlfilter.txt Thus you want to add something like +\.pdf$ I guess another way would be to exclude all others Try expanding the line below with html, doc, xls, ppt, etc -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ Or try including +\.pdf$ # -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ Followd by -. Have'nt tried it myself, but experiment some and I guess you figure it out pretty soon. Regards, Ronny -Opprinnelig melding- Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED] Sendt: 6. juni 2007 20:30 Til: nutch-user@lucene.apache.org Emne: indexing only special documents hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin !DSPAM:4666ff05259891293215062! -- "Conscious decisions by conscious minds are what make reality real"
Re: indexing only special documents
All the plugins are in the nutch source distribution and are found in: /src/plugins There is nothing that really provides near real-time statistics other than the logging. I am planning on writing a few analysis plugins, perhaps just using aspects, to allow a jmx client monitor the process (and trying to not be too invasive to affect performance). I haven't done it yet, but I don't see plugin creation "too difficult" (if you are comfortable with parsing). There are some processes that you could run that can dump metadata and other useful info for looking into your segments and url databases. just run: /bin/nutch It will show you the options to run for reading the data. You can find out how many urls were successfully fetched, how many failed and total number of urls etc. Look at the nutch 0.8 wiki entry http://wiki.apache.org/nutch/08CommandLineOptions . It just shows the shell output for the nutch options to run. It will give you and idea of what is available. For finding how many documents were fetched of specific types you would be better off just using the search bean and basically, using lucene to find out those things. Otherwise you would have to write your own implementation to read the data. I am learning more about nutch everyday so, I can't claim everything I have said is 100% correct. On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote: Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this out right tomorrow..bit late now here. Another 2 additonal questions: 1.Those "parse" plugins where do I find them in the nutch source code? Is it possible and easy going to write a own parser plugin...cause I think I'm gonna need some additional non standard parser plugin(s). 2. When I do a crawl. Is it possible that I can activate or see some statistics in nutch for that. I mean that at the end of indexing process it shows me how many urls nutch had parsed and how much of them contained i.e. pdfs and additionally how long the crawling and indexing process tooked and so on? thx for support martin Zitat von Briggs <[EMAIL PROTECTED]>: > You set that up in your nutch-site.xml file. Open the > nutch-default.xml file (located in the /conf. Look > for this element: > > > plugin.includes > protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > > > > > You'll notice the "parse" plugins that uses the regex > "parse-(text|html|pdf|msword|rss)". You remove/add the available > parsers here. So, if you only wanted pdfs, you only use the pdf > parser, "parse-(pdf)" or just "parse-pdf". > > Don't edit the nutch-default file. Create a new nutch-site.xml file > for your cusomizations. So, basically copy the nutch-default.xml > file, remove everything you don't need to override, and there ya go. > > I believe that is the correct way. > > > On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> > wrote: > > > > > > hi! > > > > I have a question. If I have for example the seed urls and do a crawl based > o > > that seeds. If I want to index then only pages that contain for example pdf > > documents, how can I do that? > > > > cheers > > martin > > > > > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- "Conscious decisions by conscious minds are what make reality real"
Re: indexing only special documents
You set that up in your nutch-site.xml file. Open the nutch-default.xml file (located in the /conf. Look for this element: plugin.includes protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. You'll notice the "parse" plugins that uses the regex "parse-(text|html|pdf|msword|rss)". You remove/add the available parsers here. So, if you only wanted pdfs, you only use the pdf parser, "parse-(pdf)" or just "parse-pdf". Don't edit the nutch-default file. Create a new nutch-site.xml file for your cusomizations. So, basically copy the nutch-default.xml file, remove everything you don't need to override, and there ya go. I believe that is the correct way. On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote: hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin -- "Conscious decisions by conscious minds are what make reality real"
Re: urls/nutch in local is invalid
I haven't heard of an IRC channel for it, but that would be cool. On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote: I see now whats causing the error. /urls/nutch is a file...but you have to give as input only the urls folder not the file as i did ;) ps: is there an irc channel for nutch or 'only' mailing list? thx martin Zitat von Briggs <[EMAIL PROTECTED]>: > is urls/nutch a file or directory? > > On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> > wrote: > > Hi > > > > I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. > > Unfortunately I get the following error: > > > > [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test-depth 10 > > crawl started in: crawl.test > > rootUrlDir = urls/nutch > > threads = 10 > > depth = 10 > > Injector: starting > > Injector: crawlDb: crawl.test/crawldb > > Injector: urlDir: urls/nutch > > Injector: Converting injected urls to crawl db entries. > > Exception in thread "main" java.io.IOException: Input directory > > /scratch/nutch-0.8.1/urls/nutch in local is invalid. > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java :274) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java :327) > > at org.apache.nutch.crawl.Injector.inject(Injector.java:138) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) > > > > Any ideas what is causing that? > > > > regards > > martin > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- "Conscious decisions by conscious minds are what make reality real"
Re: urls/nutch in local is invalid
is urls/nutch a file or directory? On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote: Hi I wanted to start a crawl like it is done in the nutch 0.8.x tutorial. Unfortunately I get the following error: [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test -depth 10 crawl started in: crawl.test rootUrlDir = urls/nutch threads = 10 depth = 10 Injector: starting Injector: crawlDb: crawl.test/crawldb Injector: urlDir: urls/nutch Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Input directory /scratch/nutch-0.8.1/urls/nutch in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Any ideas what is causing that? regards martin -- "Conscious decisions by conscious minds are what make reality real"
Re: Loading mechnism of plugin classes and singleton objects
This is all I did (and from what I have read, double checked locking is works correctly in jdk 5) private static volatile IndexingFilters INSTANCE; public static IndexingFilters getInstance(final Configuration configuration) { if(INSTANCE == null) { synchronized(IndexingFilters.class) { if(INSTANCE == null) { INSTANCE = new IndexingFilters(configuration); } } } return INSTANCE; } So, I just updated all the code that calls "new IndexingFilters(..)" to call IndexingFilters.getInstance(...). This works for me, perhaps not everyone. I think that the filter interface should be refitted to allow the configuration instance to be passed along the filters too, or allow a way for the thread to obtain it's current configuration, rather than instantiating these things over and over again. If a filter is designed to be thread-safe, there is no need for all this unnecessary object creation. On 6/6/07, Briggs <[EMAIL PROTECTED]> wrote: FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump the stats from your app (use 'jmap -histo;) you can see all the classes that have been loaded. You will notice, if you have been running nutch for a while, classes being loaded thousands of times and never unloaded. My quick fix was to just edit all the main plugin points ( URLFilters.java, IndexFilters.java etc) and made them all singletons. I haven't had time to look into the classloading facility. There is a bit of a bug in there (IMHO), but some people may not want singletons. But, there needs to be a way of just instantiating a new plugin, and not instantiating a new classloader everytime a plugin is requested. These seem to never get garbage collected. Anyway.. that's all I have to say at the moment. On 6/5/07, Doğacan Güney <[EMAIL PROTECTED] > wrote: > > Hi, > > It seems that plugin-loading code is somehow broken. There is some > discussion going on about this on > http://www.nabble.com/forum/ViewPost.jtp?post=10844164&framed=y . > > On 6/5/07, Enzo Michelangeli < [EMAIL PROTECTED]> wrote: > > I have a question about the loading mechanism of plugin classes. I'm > working > > with a custom URLFilter, and I need a singleton object loaded and > > initialized by the first instance of the URLFilter, and shared by > other > > instances (e.g., instantiated by other threads). I was assuming that > the > > URLFilter class was being loaded only once even when the filter is > used by > > multiple threads, so I tried to use a static member variable of my > URLFilter > > class to hold a reference to the object to be shared: but it appears > that > > the supposed singleton, actually, isn't, because the method > responsible for > > its instantiation finds the static field initialized to null. So: are > > URLFilter classes loaded multiple times by their classloader in Nutch? > The > > wiki page at > > > http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem > > seems to suggest otherwise: > > > > Until Nutch runtime, only one instance of such a plugin > > class is alive in the Java virtual machine. > > > > (By the way, what does "Until Nutch runtime" mean here? Before Nutch > > runtime, no class whatsoever is supposed to be alive in the JVM, is > it?) > > > > Enzo > > > > > > -- > Doğacan Güney > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: Loading mechnism of plugin classes and singleton objects
FYI, I ran into the same problem. I wanted my filters to be instantiated only once, and they not only get instantiated repeatedly, but the classloading is flawed in that it keeps reloading the classes. So, if you ever dump the stats from your app (use 'jmap -histo;) you can see all the classes that have been loaded. You will notice, if you have been running nutch for a while, classes being loaded thousands of times and never unloaded. My quick fix was to just edit all the main plugin points ( URLFilters.java, IndexFilters.java etc) and made them all singletons. I haven't had time to look into the classloading facility. There is a bit of a bug in there (IMHO), but some people may not want singletons. But, there needs to be a way of just instantiating a new plugin, and not instantiating a new classloader everytime a plugin is requested. These seem to never get garbage collected. Anyway.. that's all I have to say at the moment. On 6/5/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, It seems that plugin-loading code is somehow broken. There is some discussion going on about this on http://www.nabble.com/forum/ViewPost.jtp?post=10844164&framed=y . On 6/5/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote: > I have a question about the loading mechanism of plugin classes. I'm working > with a custom URLFilter, and I need a singleton object loaded and > initialized by the first instance of the URLFilter, and shared by other > instances (e.g., instantiated by other threads). I was assuming that the > URLFilter class was being loaded only once even when the filter is used by > multiple threads, so I tried to use a static member variable of my URLFilter > class to hold a reference to the object to be shared: but it appears that > the supposed singleton, actually, isn't, because the method responsible for > its instantiation finds the static field initialized to null. So: are > URLFilter classes loaded multiple times by their classloader in Nutch? The > wiki page at > http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem > seems to suggest otherwise: > > Until Nutch runtime, only one instance of such a plugin > class is alive in the Java virtual machine. > > (By the way, what does "Until Nutch runtime" mean here? Before Nutch > runtime, no class whatsoever is supposed to be alive in the JVM, is it?) > > Enzo > > -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Re: Content Type Not Resolved Correctly?
Doh! Again, I missed that. Thanks... Just wish it had a better explanation. On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > Here is another example that keeps saying it can't parse it... > > SegmentReader: get ' > http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir' > Content:: > Version: 2 > url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir > base: > http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir > contentType: > metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5 > Content: > > These are the headers: > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 15:38:15 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Window-Target: _top > X-Highwire-SessionId: nh2ukcdpv1.JS1 > Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > > > So, that's it. any ideas? In both examples nutch wasn't able to fetch the page. When a url can't be fetched, fetcher creates an empty content for it. That's why you can't parse them, there is nothing to parse:). You can't fetch http://hea.sagepub.com/cgi/alerts and http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir because both hosts have robots.txt files that disallow access to your urls. > > > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > > > > > So, here is one: > > > > http://hea.sagepub.com/cgi/alerts > > > > Segment Reader reports: > > > > Content:: > > Version: 2 > > url: http://hea.sagepub.com/cgi/alerts > > base: http://hea.sagepub.com/cgi/alerts > > contentType: > > metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168 > > Content: > > > > So, I notice when I try to crawl that url specifically, I get a job failed > > (array index out of bounds -1 exception). > > > > But if I use curl like: > > > > curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt > > > > I get content and the headers are: > > > > HTTP/1.1 200 OK > > Date: Fri, 01 Jun 2007 15:03:28 GMT > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > Cache-Control: no-store > > X-Highwire-SessionId: xlz2cgcww1.JS1 > > Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ > > Transfer-Encoding: chunked > > Content-Type: text/html > > > > So, I'm lost. > > > > > > On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > > > So, I have been having huge problems with parsing. It seems that many > > > > urls are being ignored because the parser plugins throw and exception > > > > saying there is no parser found for, what is reportedly, and > > > > unresolved contentType. So, if you look at the exception: > > > > > > > > org.apache.nutch.parse.ParseException: parser not found for > > > > contentType= url= > > > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > > > > > > > You can see that it says the contentType is "". But, if you look at > > > > the headers for this request you can see that the Content-Type header > > > > is set at "text/html": > > > > > > > > HTTP/1.1 200 OK > > > > Date: Fri, 01 Jun 2007 13:54:19 GMT > > > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > > > Cache-Control: no-store > > > > X-Highwire-SessionId: y1851mbb91.JS1 > > > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > > > > Transfer-Encoding: chunked > > > > Content-Type: text/html > > > > > > > > Is there something that I have set up wrong? This happens on a LOT of > > > > > > > pages/sites. My current plugins are set at: > > > > > > > > > > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > > > > > > > > > Here is another URL: > > > > > > > > http://www.bionews.org.uk/ > > > > > > > > > > > > Same issue with parsing (parrser not found for contentType= > > > &g
Re: Content Type Not Resolved Correctly?
Here is another example that keeps saying it can't parse it... SegmentReader: get ' http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir' Content:: Version: 2 url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir base: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir contentType: metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5 Content: These are the headers: HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 15:38:15 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Window-Target: _top X-Highwire-SessionId: nh2ukcdpv1.JS1 Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, that's it. any ideas? On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: So, here is one: http://hea.sagepub.com/cgi/alerts Segment Reader reports: Content:: Version: 2 url: http://hea.sagepub.com/cgi/alerts base: http://hea.sagepub.com/cgi/alerts contentType: metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168 Content: So, I notice when I try to crawl that url specifically, I get a job failed (array index out of bounds -1 exception). But if I use curl like: curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt I get content and the headers are: HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 15:03:28 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: xlz2cgcww1.JS1 Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, I'm lost. On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > Hi, > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > So, I have been having huge problems with parsing. It seems that many > > urls are being ignored because the parser plugins throw and exception > > saying there is no parser found for, what is reportedly, and > > unresolved contentType. So, if you look at the exception: > > > > org.apache.nutch.parse.ParseException: parser not found for > > contentType= url= > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > > > You can see that it says the contentType is "". But, if you look at > > the headers for this request you can see that the Content-Type header > > is set at "text/html": > > > > HTTP/1.1 200 OK > > Date: Fri, 01 Jun 2007 13:54:19 GMT > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > Cache-Control: no-store > > X-Highwire-SessionId: y1851mbb91.JS1 > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > > Transfer-Encoding: chunked > > Content-Type: text/html > > > > Is there something that I have set up wrong? This happens on a LOT of > > > pages/sites. My current plugins are set at: > > > > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > Here is another URL: > > > > http://www.bionews.org.uk/ > > > > > > Same issue with parsing (parrser not found for contentType= > > url= http://www.bionews.org.uk/), but the header says: > > > > HTTP/1.0 200 OK > > Server: Lasso/3.6.5 ID/ACGI > > MIME-Version: 1.0 > > Content-type: text/html > > Content-length: 69417 > > > > > > Any clues? Does nutch look at the headers or not? > > Can you do a > bin/nutch readseg -get -noparse -noparsetext > -noparsedata -nofetch -nogenerate > > And send the result? This should show use what nutch fetched as content. > > > > > > > > -- > > "Conscious decisions by conscious minds are what make reality real" > > > > > -- > Doğacan Güney > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: Content Type Not Resolved Correctly?
So, here is one: http://hea.sagepub.com/cgi/alerts Segment Reader reports: Content:: Version: 2 url: http://hea.sagepub.com/cgi/alerts base: http://hea.sagepub.com/cgi/alerts contentType: metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168 Content: So, I notice when I try to crawl that url specifically, I get a job failed (array index out of bounds -1 exception). But if I use curl like: curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt I get content and the headers are: HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 15:03:28 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: xlz2cgcww1.JS1 Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, I'm lost. On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > So, I have been having huge problems with parsing. It seems that many > urls are being ignored because the parser plugins throw and exception > saying there is no parser found for, what is reportedly, and > unresolved contentType. So, if you look at the exception: > > org.apache.nutch.parse.ParseException: parser not found for > contentType= url= http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > You can see that it says the contentType is "". But, if you look at > the headers for this request you can see that the Content-Type header > is set at "text/html": > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 13:54:19 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Cache-Control: no-store > X-Highwire-SessionId: y1851mbb91.JS1 > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > Is there something that I have set up wrong? This happens on a LOT of > pages/sites. My current plugins are set at: > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > Here is another URL: > > http://www.bionews.org.uk/ > > > Same issue with parsing (parrser not found for contentType= > url=http://www.bionews.org.uk/), but the header says: > > HTTP/1.0 200 OK > Server: Lasso/3.6.5 ID/ACGI > MIME-Version: 1.0 > Content-type: text/html > Content-length: 69417 > > > Any clues? Does nutch look at the headers or not? Can you do a bin/nutch readseg -get -noparse -noparsetext -noparsedata -nofetch -nogenerate And send the result? This should show use what nutch fetched as content. > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Re: Content Type Not Resolved Correctly?
Looking into the first URL.. Don't look at the second, I screwed up on that. It's a Disallow bad example... But working on finding the segment for the first thanks for your quick response, I'll be getting right back to you. <http://www.bionews.org.uk/> On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > So, I have been having huge problems with parsing. It seems that many > urls are being ignored because the parser plugins throw and exception > saying there is no parser found for, what is reportedly, and > unresolved contentType. So, if you look at the exception: > > org.apache.nutch.parse.ParseException: parser not found for > contentType= url= http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > You can see that it says the contentType is "". But, if you look at > the headers for this request you can see that the Content-Type header > is set at "text/html": > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 13:54:19 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Cache-Control: no-store > X-Highwire-SessionId: y1851mbb91.JS1 > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > Is there something that I have set up wrong? This happens on a LOT of > pages/sites. My current plugins are set at: > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > Here is another URL: > > http://www.bionews.org.uk/ > > > Same issue with parsing (parrser not found for contentType= > url=http://www.bionews.org.uk/), but the header says: > > HTTP/1.0 200 OK > Server: Lasso/3.6.5 ID/ACGI > MIME-Version: 1.0 > Content-type: text/html > Content-length: 69417 > > > Any clues? Does nutch look at the headers or not? Can you do a bin/nutch readseg -get -noparse -noparsetext -noparsedata -nofetch -nogenerate And send the result? This should show use what nutch fetched as content. > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney -- "Conscious decisions by conscious minds are what make reality real"
Content Type Not Resolved Correctly?
So, I have been having huge problems with parsing. It seems that many urls are being ignored because the parser plugins throw and exception saying there is no parser found for, what is reportedly, and unresolved contentType. So, if you look at the exception: org.apache.nutch.parse.ParseException: parser not found for contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl You can see that it says the contentType is "". But, if you look at the headers for this request you can see that the Content-Type header is set at "text/html": HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 13:54:19 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: y1851mbb91.JS1 Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html Is there something that I have set up wrong? This happens on a LOT of pages/sites. My current plugins are set at: "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Here is another URL: http://www.bionews.org.uk/ Same issue with parsing (parrser not found for contentType= url=http://www.bionews.org.uk/), but the header says: HTTP/1.0 200 OK Server: Lasso/3.6.5 ID/ACGI MIME-Version: 1.0 Content-type: text/html Content-length: 69417 Any clues? Does nutch look at the headers or not? -- "Conscious decisions by conscious minds are what make reality real"
Speed up indexing....
Anyone have any good configuration ideas for indexing/merging with 0.9 using hadoop on a local fs? Our segment merging is taking an extremely long time compared with nutch 0.7. Currently, I am trying to merge 300 segments, which amounts to about 1gig of data. It has taken hours to merge, and it's still not done. This box has dual zeon 2.8ghz processors with 4 gigs of ram. So, I figure there must be a better setup in the mapred-default.xml for a single machine. Do I increase the file size for I/O buffers, sort buffers, etc.? Do I reduce the number of tasks or increase them? I'm at a loss. Any advice would be greatly appreciated. -- "Conscious decisions by conscious minds are what make reality real"
Re: Nutch on Windows. ssh: command not found
so, when in cygwin, if you type 'ssh' (without the quotes, do you get the same error? If so, then you need to go back into the cygwin setup and install ssh. On 5/30/07, Ilya Vishnevsky <[EMAIL PROTECTED]> wrote: Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I installed cygwin. When I execute bin/start-all.sh, I get following messages: localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh: command not found Could you help me with this problem? -- "Conscious decisions by conscious minds are what make reality real"
Re: Problem crawling in Nutch 0.9
Just curious, did you happen to limit the number of urls using the "topN" switch? On 5/14/07, Annona Keene <[EMAIL PROTECTED]> wrote: I recently upgraded to 0.9, and I've started encountering a problem. I began with a single url and crawled with a depth of 10, assuming I would get every page on my site. This same configuration worked for me in 0.8. However, I noticed a particular url that I was especially interested in was not in the index. So I added the url explicitly and crawled again. And it still was not in the index. So I checked the logs, and it is being fetched. So I tried a lower depth, and it worked. With a depth of 6, the url does appear in the index. Any ideas on what would be causing this? I'm very confused. Thanks, Ann Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ -- "Conscious decisions by conscious minds are what make reality real"
Re: Nutch Indexer
Man, I should proofread this stuff before I send them. That is all I have to say. On 5/1/07, Briggs <[EMAIL PROTECTED]> wrote: I would assume that it need these for handling the indexing of the link scores. Lucene puts no scoring weight on things such as urls, page rank and such. Since lucene only indexes documents, and calculates its keyword/query relevancy based only on term vectors (or whatever) nutch needs to add the url scoring and such to the index. On 5/1/07, hzhong <[EMAIL PROTECTED]> wrote: > > Hello, > > In Indexer.java, index(Path indexDir, Path crawlDb, Path linkDb, Path[] > segments), can someone explain to me why crawlDB and linkDB is needed for > indexing? > > In Lucene, there's no crawlDB and linkDB for indexing. > > Thank you very much > > Hanna > -- > View this message in context: http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625 > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: Nutch Indexer
I would assume that it need these for handling the indexing of the link scores. Lucene puts no scoring weight on things such as urls, page rank and such. Since lucene only indexes documents, and calculates its keyword/query relevancy based only on term vectors (or whatever) nutch needs to add the url scoring and such to the index. On 5/1/07, hzhong <[EMAIL PROTECTED]> wrote: Hello, In Indexer.java, index(Path indexDir, Path crawlDb, Path linkDb, Path[] segments), can someone explain to me why crawlDB and linkDB is needed for indexing? In Lucene, there's no crawlDB and linkDB for indexing. Thank you very much Hanna -- View this message in context: http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625 Sent from the Nutch - User mailing list archive at Nabble.com. -- "Conscious decisions by conscious minds are what make reality real"
Re: Nutch and running crawls within a container.
I'll look around the code to make sure I am creating only one instance of Configuration in my classes, and will play around with the maxpermgen settings. Any other input from people that have attempted this sort of setup would be appreciated. On 4/30/07, Briggs <[EMAIL PROTECTED]> wrote: Well, in nutch 0.7 it was all due to NGramEntry instances held within hashmaps that never get cleaned up. This code was in the language plugin, but it has been moved into the nutch codebase. That wasn't the only problem, but that was a big one. I though removing it would solve the problem, but then another creeped up. On 4/30/07, Sami Siren <[EMAIL PROTECTED]> wrote: > Briggs wrote: > > Version: Nutch 0.9 (but this applies to just about all versions) > > > > I'm really in a bind. > > > > Is anyone crawling from within a web application, or is everyone > > running Nutch using the shell scripts provided? I am trying to write > > a web application around the Nutch crawling facilities, but it seems > > that there is are huge memory issues when trying to do this. The > > container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K > > on the stack) runs out of memory in less that an hour. When profiling > > version 0.7.2 we can see that there is a constant pool of objects that > > grow, but never get garbage collected. So, even when the crawl is > > finished, these objects tend to just hang around forever, until we get > > the wonderful: java.lang.OutOfMemoryError: PermGen space. I updated > > the application to use Nutch 0.9 and the problem got about 80x worse > > Have you analyzed in any level of detail what is causing this memory > wasting? Have you tried tweaking jvms XX:MaxPermSize? > > I believe that all the classes required by plugins need to be loaded > multiple times (every time you execute a command where Configuration > object is created) because of the design of plugin system where every > plugin has it's own class loader (per configuration). > > > So, the current design is/was to have an event happen within the > > system, which would fire off a crawler (currently just calls > > org.apache.nutch.crawl.Crawl.main()). But, this has caused nothing > > but grief. We need to have several crawlers running concurrently. We > > You should perhaps use and call the classes directly and take control of > managing the Configuration object, this way PermGen size is not wasted > by loading same classes over and over again. > > -- > Sami Siren > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: Nutch and running crawls within a container.
Well, in nutch 0.7 it was all due to NGramEntry instances held within hashmaps that never get cleaned up. This code was in the language plugin, but it has been moved into the nutch codebase. That wasn't the only problem, but that was a big one. I though removing it would solve the problem, but then another creeped up. On 4/30/07, Sami Siren <[EMAIL PROTECTED]> wrote: Briggs wrote: > Version: Nutch 0.9 (but this applies to just about all versions) > > I'm really in a bind. > > Is anyone crawling from within a web application, or is everyone > running Nutch using the shell scripts provided? I am trying to write > a web application around the Nutch crawling facilities, but it seems > that there is are huge memory issues when trying to do this. The > container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K > on the stack) runs out of memory in less that an hour. When profiling > version 0.7.2 we can see that there is a constant pool of objects that > grow, but never get garbage collected. So, even when the crawl is > finished, these objects tend to just hang around forever, until we get > the wonderful: java.lang.OutOfMemoryError: PermGen space. I updated > the application to use Nutch 0.9 and the problem got about 80x worse Have you analyzed in any level of detail what is causing this memory wasting? Have you tried tweaking jvms XX:MaxPermSize? I believe that all the classes required by plugins need to be loaded multiple times (every time you execute a command where Configuration object is created) because of the design of plugin system where every plugin has it's own class loader (per configuration). > So, the current design is/was to have an event happen within the > system, which would fire off a crawler (currently just calls > org.apache.nutch.crawl.Crawl.main()). But, this has caused nothing > but grief. We need to have several crawlers running concurrently. We You should perhaps use and call the classes directly and take control of managing the Configuration object, this way PermGen size is not wasted by loading same classes over and over again. -- Sami Siren -- "Conscious decisions by conscious minds are what make reality real"
Nutch and running crawls within a container.
Version: Nutch 0.9 (but this applies to just about all versions) I'm really in a bind. Is anyone crawling from within a web application, or is everyone running Nutch using the shell scripts provided? I am trying to write a web application around the Nutch crawling facilities, but it seems that there is are huge memory issues when trying to do this. The container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K on the stack) runs out of memory in less that an hour. When profiling version 0.7.2 we can see that there is a constant pool of objects that grow, but never get garbage collected. So, even when the crawl is finished, these objects tend to just hang around forever, until we get the wonderful: java.lang.OutOfMemoryError: PermGen space. I updated the application to use Nutch 0.9 and the problem got about 80x worse (it use to run for about 16 hours, now it runs out of memory in 20 minutes). We were using 5 concurrent crawlers, meaning we have Crawl.man running 5 times within the application. So, the current design is/was to have an event happen within the system, which would fire off a crawler (currently just calls org.apache.nutch.crawl.Crawl.main()). But, this has caused nothing but grief. We need to have several crawlers running concurrently. We didn't want large 'batch' jobs. The requirement is to crawl a domain as it comes into the system and not wait for days or hours to run the job. Has anyone else attempted to run the crawl in this manner? Have you run into the same problems? Does controlling the fetcher and all the other instances needed for crawling solve this issue? There is nothing in the org.apache.nutch.crawl.Crawl instance, from what I had seen in the past, that would cause such a memory leak. This must be way down somewhere else in the code. Since Nutch handles so much of its threading, could this be causing the problem? I am not sure if I should x-post this to the dev group or not. Anyway, thanks. Briggs -- "Conscious decisions by conscious minds are what make reality real"
Re: [Nutch-general] Removing pages from index immediately
Well, it looks like the link I sent you goes to the 0.9 version of the nutch api. There is a link error on the nutch project site because the 0.7.2 doc link points to the 0.9 docs. On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote: Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html You would then need to create a filter of 'pruned' urls to ignore if they are discovered again. This list can get quite large, but I really don't know how else to do it. It would be cool if we could hack the crawldb (or webdb I believe in your version) to include a flag of 'good/bad' or something. On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote: > Isn't this what you are looking for? > > org.apache.nutch.tools.PruneIndexTool. > > > > On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote: > > > > hi Enis, > > This is franklin ..currently i m using nutch 0.7.2 for my crawling and > > indexing for my search engine... > > i read from ur message that u can delete a particular index directly?if so > > how its possible..i m desperately searching for a clue to do this one... > > my requirement is to delete the porn site's index from my crawled data... > > ur help is highly needed > > > > expecting u to help me in this regards .. > > > > Thanks in advance.. > > Franklin.S > > > > > > ogjunk-nutch wrote: > > > > > > Hi Enis, > > > > > > Right, I can easily delete the page from the Lucene index, though I'd > > > prefer to follow the Nutch protocol and avoid messing something up by > > > touching the index directly. However, I don't want that page to re-appear > > > in one of the subsequent fetches. Well, it won't re-appear, because it > > > will remain missing, but it would be great to be able to tell Nutch to > > > "forget it" "from everywhere". Is that doable? > > > I could read and re-write the *Db Maps, but that's a lot of IO... just to > > > get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch > > > flags a given page as "forget this page as soon as possible" and it just > > > happens later on. > > > > > > Thanks, > > > Otis > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > > > - Original Message > > > From: Enis Soztutar <[EMAIL PROTECTED]> > > > To: nutch-user@lucene.apache.org > > > Sent: Thursday, April 5, 2007 3:29:55 AM > > > Subject: Re: [Nutch-general] Removing pages from index immediately > > > > > > Since hadoop's map files are write once, it is not possible to delete > > > some urls from the crawldb and linkdb. The only thing you can do is to > > > create the map files once again without the deleted urls. But running > > > the crawl once more as you suggested seems more appropriate. Deleting > > > documents from the index is just lucene stuff. > > > > > > In your case it seems that every once in a while, you crawl the whole > > > site, and create the indexes and db's and then just throw the old one > > > out. And between two crawls you can delete the urls from the index. > > > > > > [EMAIL PROTECTED] wrote: > > >> Hi, > > >> > > >> I'd like to be able to immediately remove certain pages from Nutch > > >> (index, crawldb, linkdb...). > > >> The scenario is that I'm using Nutch to index a single site or a set of > > >> internal sites. Once in a while editors of the site remove a page from > > >> the site. When that happens, I want to update at least the index and > > >> ideally crawldb, linkdb, so that people searching the index don't get the > > >> missing page in results and end up going there, hitting the 404. > > >> > > >> I don't think there is a "direct" way to do this with Nutch, is there? > > >> If there really is no direct way to do this, I was thinking I'd just put > > >> the URL of the recently removed page into the first next fetchlist and > > >> then somehow get Nutch to immediately remove that page/URL once it hits a > > >> 404. How does that sound? > > >> > > >> Is there a way to configure Nutch to delete the page after it gets a 404 > > >> for it even just once? I thought I saw the setting for that somewhere
Re: [Nutch-general] Removing pages from index immediately
Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html You would then need to create a filter of 'pruned' urls to ignore if they are discovered again. This list can get quite large, but I really don't know how else to do it. It would be cool if we could hack the crawldb (or webdb I believe in your version) to include a flag of 'good/bad' or something. On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote: Isn't this what you are looking for? org.apache.nutch.tools.PruneIndexTool. On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote: > > hi Enis, > This is franklin ..currently i m using nutch 0.7.2 for my crawling and > indexing for my search engine... > i read from ur message that u can delete a particular index directly?if so > how its possible..i m desperately searching for a clue to do this one... > my requirement is to delete the porn site's index from my crawled data... > ur help is highly needed > > expecting u to help me in this regards .. > > Thanks in advance.. > Franklin.S > > > ogjunk-nutch wrote: > > > > Hi Enis, > > > > Right, I can easily delete the page from the Lucene index, though I'd > > prefer to follow the Nutch protocol and avoid messing something up by > > touching the index directly. However, I don't want that page to re-appear > > in one of the subsequent fetches. Well, it won't re-appear, because it > > will remain missing, but it would be great to be able to tell Nutch to > > "forget it" "from everywhere". Is that doable? > > I could read and re-write the *Db Maps, but that's a lot of IO... just to > > get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch > > flags a given page as "forget this page as soon as possible" and it just > > happens later on. > > > > Thanks, > > Otis > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > - Original Message > > From: Enis Soztutar <[EMAIL PROTECTED]> > > To: nutch-user@lucene.apache.org > > Sent: Thursday, April 5, 2007 3:29:55 AM > > Subject: Re: [Nutch-general] Removing pages from index immediately > > > > Since hadoop's map files are write once, it is not possible to delete > > some urls from the crawldb and linkdb. The only thing you can do is to > > create the map files once again without the deleted urls. But running > > the crawl once more as you suggested seems more appropriate. Deleting > > documents from the index is just lucene stuff. > > > > In your case it seems that every once in a while, you crawl the whole > > site, and create the indexes and db's and then just throw the old one > > out. And between two crawls you can delete the urls from the index. > > > > [EMAIL PROTECTED] wrote: > >> Hi, > >> > >> I'd like to be able to immediately remove certain pages from Nutch > >> (index, crawldb, linkdb...). > >> The scenario is that I'm using Nutch to index a single site or a set of > >> internal sites. Once in a while editors of the site remove a page from > >> the site. When that happens, I want to update at least the index and > >> ideally crawldb, linkdb, so that people searching the index don't get the > >> missing page in results and end up going there, hitting the 404. > >> > >> I don't think there is a "direct" way to do this with Nutch, is there? > >> If there really is no direct way to do this, I was thinking I'd just put > >> the URL of the recently removed page into the first next fetchlist and > >> then somehow get Nutch to immediately remove that page/URL once it hits a > >> 404. How does that sound? > >> > >> Is there a way to configure Nutch to delete the page after it gets a 404 > >> for it even just once? I thought I saw the setting for that somewhere a > >> few weeks ago, but now I can't find it. > >> > >> Thanks, > >> Otis > >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > >> Simpy -- http://www.simpy.com/ - Tag - Search - Share > >> > >> > >> > >> > > > > > > - > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share > > your > > opinions on IT & business topics through brief surveys-and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > ___ > > Nutch-general mailing list > > [EMAIL PROTECTED] > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > > > > > > > > -- > View this message in context: http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273 > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by conscious minds are what make reality real"
Re: [Nutch-general] Removing pages from index immediately
Isn't this what you are looking for? org.apache.nutch.tools.PruneIndexTool. On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote: hi Enis, This is franklin ..currently i m using nutch 0.7.2 for my crawling and indexing for my search engine... i read from ur message that u can delete a particular index directly?if so how its possible..i m desperately searching for a clue to do this one... my requirement is to delete the porn site's index from my crawled data... ur help is highly needed expecting u to help me in this regards .. Thanks in advance.. Franklin.S ogjunk-nutch wrote: > > Hi Enis, > > Right, I can easily delete the page from the Lucene index, though I'd > prefer to follow the Nutch protocol and avoid messing something up by > touching the index directly. However, I don't want that page to re-appear > in one of the subsequent fetches. Well, it won't re-appear, because it > will remain missing, but it would be great to be able to tell Nutch to > "forget it" "from everywhere". Is that doable? > I could read and re-write the *Db Maps, but that's a lot of IO... just to > get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch > flags a given page as "forget this page as soon as possible" and it just > happens later on. > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > - Original Message > From: Enis Soztutar <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Thursday, April 5, 2007 3:29:55 AM > Subject: Re: [Nutch-general] Removing pages from index immediately > > Since hadoop's map files are write once, it is not possible to delete > some urls from the crawldb and linkdb. The only thing you can do is to > create the map files once again without the deleted urls. But running > the crawl once more as you suggested seems more appropriate. Deleting > documents from the index is just lucene stuff. > > In your case it seems that every once in a while, you crawl the whole > site, and create the indexes and db's and then just throw the old one > out. And between two crawls you can delete the urls from the index. > > [EMAIL PROTECTED] wrote: >> Hi, >> >> I'd like to be able to immediately remove certain pages from Nutch >> (index, crawldb, linkdb...). >> The scenario is that I'm using Nutch to index a single site or a set of >> internal sites. Once in a while editors of the site remove a page from >> the site. When that happens, I want to update at least the index and >> ideally crawldb, linkdb, so that people searching the index don't get the >> missing page in results and end up going there, hitting the 404. >> >> I don't think there is a "direct" way to do this with Nutch, is there? >> If there really is no direct way to do this, I was thinking I'd just put >> the URL of the recently removed page into the first next fetchlist and >> then somehow get Nutch to immediately remove that page/URL once it hits a >> 404. How does that sound? >> >> Is there a way to configure Nutch to delete the page after it gets a 404 >> for it even just once? I thought I saw the setting for that somewhere a >> few weeks ago, but now I can't find it. >> >> Thanks, >> Otis >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >> Simpy -- http://www.simpy.com/ - Tag - Search - Share >> >> >> >> > > > - > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ___ > Nutch-general mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > -- View this message in context: http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273 Sent from the Nutch - User mailing list archive at Nabble.com. -- "Conscious decisions by conscious minds are what make reality real"
Re: Case Sensitive
Am am not 100% sure, but I am 99.99% sure that case does matter. In regard to domain name, I would say no, but anything after should be. If not, then there is a bug. On 4/26/07, karthik085 <[EMAIL PROTECTED]> wrote: Hi, Does the URL case sensitivity matter? In my crawl-urlfilter.txt, I want to 'skip special urls' -Test Does that mean it will ignore URLs that contain Test or test? Thanks. -- View this message in context: http://www.nabble.com/Case-Sensitive-tf3654858.html#a10210667 Sent from the Nutch - User mailing list archive at Nabble.com. -- "Conscious decisions by conscious minds are what make reality real"
Re: Using nutch just for the crawler/fetcher
If you are just looking to have a seed list of domains, and would like to mirror their content for indexing, why not just use the unix tool 'wget'? It will mirror the site on your system and then you can just index that. On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote: Hello, I am hoping crawl about 3000 domains using the nutch crawler + PrefixURLFilter, however, I have no need to actually index the html. Ideally, I would just like each domain's raw html pages saved into separate directories. We already have a parser that converts the HTML into indexes for our particular application. Is there a clean way to accomplish this? My current idea is to create a python script (similar to the one already on the wiki) that essentially loops through the fetch, update cycles until depth is reached, and then simply never actually does the real lucene indexing and merging. Now, here's the "there must be a better way" part ... I would then simply execute the "bin/nutch readseg -dump" tool via python to extract all the html and headers (for each segment) and then, via a regex, save each html output back into an html file, and store it in a directory according to the domain it came from. How stupid/slow is this? Any better ideas? I saw someone previously mentioned something like what I want to do, and someone responded that it was better to just roll your own crawler or something? I doubt that for some reason. Also, in the future we'd like to take advantage of the word/pdf downloading/parsing as well. Thanks for what appears to be a great crawler! Sincerely, John -- "Conscious decisions by conscious minds are what make reality real"
Re: Index
Perhaps someone else can chime in on this. I am not sure of exactly what you are asking. The indexing is based on Lucene. So, if you need to understand how the indexing works you will need to look into the Lucene documentation. If you are only looking to add custom fields and such to the index, you could look into the indexing filters of Nutch. There are examples on the wiki for that too. On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote: Thanks for your help but i think there is a misunderstanding. I was talking about creating a new index class in java based on specific parameters that i will defined. Do you if there is any web page which can give me more information in order to implement in Java this index ? E > On the nutch wiki there is this tutorial: > > http://wiki.apache.org/nutch/NutchHadoopTutorial > > There is also (it is for version 0.8, but can still work with 0.9): > > http://lucene.apache.org/nutch/tutorial8.html > > > On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote: >> Hi Guys, >> >> I would like to create a new custom index. >> Do you know if there is any tutorial, document or web page which can >> help me >> ? >> >> Thanks, >> E >> > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- "Conscious decisions by conscious minds are what make reality real"
Re: Index
On the nutch wiki there is this tutorial: http://wiki.apache.org/nutch/NutchHadoopTutorial There is also (it is for version 0.8, but can still work with 0.9): http://lucene.apache.org/nutch/tutorial8.html On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote: Hi Guys, I would like to create a new custom index. Do you know if there is any tutorial, document or web page which can help me ? Thanks, E -- "Conscious decisions by conscious minds are what make reality real"
Re: How to dump all the valid links which has been crawled?
That one is a bit more complicated because it has to do with complexities of the underlying scoring algorithm(s). But, basically, that means "give me the top 35 links within the crawl db and put them in the file called 'test'". Top links are calculated by their relevance when dealing with how many other other links, from other pages/sites point to them. Basically, when the crawler crawls, it stores all discovered links within the db. If the crawler finds the same link from multiple resources (other pages) then that link's score goes up. That is just a simple explanation, but I think it is close enough. You may want to look more into the OPIC filter and how that algorithm works, if you really want to get into the grit of the code. You can see how scoring is calculated by running the nutch example web application and clicking on the 'explain' link on a result. On 4/19/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: Can you please tell me what is the meaning of this command? what is the top 35 links? how nutch rank the top 35 links? "bin/nutch readdb crawl/crawldb -topN 35 test" On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote: > Those links are links that were discovered. It does not mean that they > were fetched, they weren't. > > On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > I think I find out the answer to my previous question by doing this: > > > > bin/nutch readlinkdb crawl/linkdb/ -dump test > > > > > > But my next question is why the result shows URLs with 'gif', 'js', etc,etc > > > > I have this line in my craw-urlfilter.txt, so i don't except I will > > crawl things like images, javascript files, > > > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$ > > > > > > Can you please tell me how to fix my problem? > > > > Thank you. > > > > On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > I read this article about nutch crawling: > > > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html > > > > > > How can I dumped out the valid links which has been crawled? > > > This command described in the article does not work in nutch 0.9. What > > > should I use instead? > > > > > > bin/nutch readdb crawl-tinysite/db -dumplinks > > > > > > Thank you for any help. > > > > > > > > -- > "Conscious decisions by concious minds are what make reality real" > -- "Conscious decisions by concious minds are what make reality real"
Re: How to delete already stored indexed fields???
If you look into the BasicIndexingFilter.java plugin source you will see that this is where those default fields get indexed. So, you can either create a new plugin that is configurable for the properties you want to index, or remove this plugin. Here is the snippet of code that is in the filter: if (host != null) { // add host as un-stored, indexed and tokenized doc.add(new Field("host", host, Field.Store.NO, Field.Index.TOKENIZED)); // add site as un-stored, indexed and un-tokenized doc.add(new Field("site", host, Field.Store.NO, Field.Index.UN_TOKENIZED)); } // url is both stored and indexed, so it's both searchable and returned doc.add(new Field("url", url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); // content is indexed, so that it's searchable, but not stored in index doc.add(new Field("content", parse.getText(), Field.Store.NO, Field.Index.TOKENIZED)); // anchors are indexed, so they're searchable, but not stored in index try { String[] anchors = (inlinks != null ? inlinks.getAnchors() : new String[0]); for (int i = 0; i < anchors.length; i++) { doc.add(new Field("anchor", anchors[i], Field.Store.NO, Field.Index.TOKENIZED)); } } catch (IOException ioe) { if (LOG.isWarnEnabled()) { LOG.warn("BasicIndexingFilter: can't get anchors for " + url.toString()); } } On 4/3/07, Ratnesh,V2Solutions India <[EMAIL PROTECTED]> wrote: exactly offcourse , I want this only, Do you have any solution for this?? looking forwards for your reply Thnx Siddharth Jonathan wrote: > > Do you mean how do you get rid of some of the fields that are indexed by > default? eg. content, anchor text etc. > > Jonathan > On 4/2/07, Ratnesh,V2Solutions India > <[EMAIL PROTECTED]> > wrote: >> >> >> Hi, >> I have written a plugin , which finds no. of Object tags in a html and >> corresponding urls. >> I am storing "objects" as fields and page url as values. >> >> And finally interested in seeing the search realted with "objects" >> indexed >> fields not those which is already stored as indexed fields. >> >> So how shall I delete those index fields which is already stored >> >> Looking forward towards your reply(Valuable >> inputs). >> >> Thnx to Nutch Community >> -- >> View this message in context: >> http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9786377 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9803792 Sent from the Nutch - User mailing list archive at Nabble.com. -- "Conscious decisions by concious minds are what make reality real"
Re: How to dump all the valid links which has been crawled?
Those links are links that were discovered. It does not mean that they were fetched, they weren't. On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: I think I find out the answer to my previous question by doing this: bin/nutch readlinkdb crawl/linkdb/ -dump test But my next question is why the result shows URLs with 'gif', 'js', etc,etc I have this line in my craw-urlfilter.txt, so i don't except I will crawl things like images, javascript files, # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$ Can you please tell me how to fix my problem? Thank you. On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > Hi, > > I read this article about nutch crawling: > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html > > How can I dumped out the valid links which has been crawled? > This command described in the article does not work in nutch 0.9. What > should I use instead? > > bin/nutch readdb crawl-tinysite/db -dumplinks > > Thank you for any help. > -- "Conscious decisions by concious minds are what make reality real"
Re: Forcing update of some URLs
From what I have gathered is that you may want to keep multiple crawldbs for your crawls. So, you could have a crawldb for more frequent crawls and fire off nutch and read that db with the appropriate configs for that job. I was hoping for the same mechanism, but it looks like we need to write this for ourselves. On 4/12/07, Arie Karhendana <[EMAIL PROTECTED]> wrote: Hi all, I'm a new user of Nutch. I use Nutch primarily to crawl blog and news sites. But I noticed that Nutch fetches pages only on some refresh interval (30 days default). Blog and news sites have unique characteristic that some of their pages are updated very frequently (e.g. the main page) so they have to be refetched often, while other pages don't need to be refreshed / refetched at all (e.g. the news article pages, which eventually will become 'obsolete'). Is there any way to force update some URLs? Can I just 're-inject' the URLs to set the next fetch date to 'immediately'? Thank you, -- Arie Karhendana -- "Conscious decisions by concious minds are what make reality real"
Re: Nutch and Crawl Frequency
Cool, cool. Thanks! On 4/19/07, Gal Nitzan <[EMAIL PROTECTED]> wrote: As it is right now... You answered the question yourself :-) ... Separate db's and the whole ceremony... > -Original Message- > From: Briggs [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 19, 2007 10:02 PM > To: nutch-user@lucene.apache.org > Subject: Nutch and Crawl Frequency > > Nutch 0.9 > > Anyone know if it is possible to be more granular regarding crawl > frequency? Meaning, that I would like some sites to be crawled more > often then others. Like, a news site should be crawled every day, but > your average business website should be crawled every 30 days. So, is > it possible to specify a crawl frequency for specific urls, or is it > only global for within the crawl db? I suppose I could have several > crawldbs or something like that, and deal with it.. but, just curious. > > Thanks > -- > "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by concious minds are what make reality real"
Nutch and Crawl Frequency
Nutch 0.9 Anyone know if it is possible to be more granular regarding crawl frequency? Meaning, that I would like some sites to be crawled more often then others. Like, a news site should be crawled every day, but your average business website should be crawled every 30 days. So, is it possible to specify a crawl frequency for specific urls, or is it only global for within the crawl db? I suppose I could have several crawldbs or something like that, and deal with it.. but, just curious. Thanks -- "Conscious decisions by conscious minds are what make reality real"
Re: Classpath and plugins question
I'll add that the PluginRespository is the class that recurses through your plugins directory, and loads each plugin's descriptor file then loads all dependencies for each plugin within their own classloader. On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote: Look into org.apache.nutch.plugin. The custom plugin classloader, and the resource loadeer reside in there. On 4/18/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: > I'm looking to use the Nutch parsing framework in a separate Lucene project. > I'd like to be able to use the existing plugins directory structure as-is, so > wondered Nutch sets up the class loading environment to find all the jar files > in the plugins directories. > > Any pointers to the Nutch class(es) that do the work? > > Thanks > Antony > > > > -- "Conscious decisions by concious minds are what make reality real" -- "Conscious decisions by concious minds are what make reality real"
Re: Classpath and plugins question
Look into org.apache.nutch.plugin. The custom plugin classloader, and the resource loadeer reside in there. On 4/18/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: I'm looking to use the Nutch parsing framework in a separate Lucene project. I'd like to be able to use the existing plugins directory structure as-is, so wondered Nutch sets up the class loading environment to find all the jar files in the plugins directories. Any pointers to the Nutch class(es) that do the work? Thanks Antony -- "Conscious decisions by concious minds are what make reality real"
Re: Source of Outlink and how to get Outlinks in 0.9
I am adding more info to my post from what I have been looking into... So, I have found the LinkDbReader and it seems to be able to dump text out to a file. But, unfortunately, it dumps to a file and I need to parse it (or I might have missed something). So, if this is the correct class, that will have to work... Here is a snippet of the output of the LinkDbReader from a page that I crawled on one of my test machines, which has apache documentation installed. The output of the reader is: http://httpd.apache.org/Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server http://httpd.apache.org/docs-project/ Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Documentation fromUrl: http://nutchdev-1/manual/ anchor: http://www.apache.org/ Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Apache http://www.apache.org/foundation/preFAQ.htmlInlinks: fromUrl: http://nutchdev-1/ anchor: Apache web server http://www.apache.org/licenses/LICENSE-2.0 Inlinks: fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0 So, am I to assume that the format shows outlinks first, then the Inlinks are where the links were found? I'll just have to figure out the format here so I can parse it. I'll probably write a wrapper that exports to xml or something to make transformation of this easier. Anyway, am I on the right track? Briggs. On 4/18/07, Briggs <[EMAIL PROTECTED]> wrote: Is it possible to determine from which domain(s) an outlink was located? The only way I know how is to limit the crawl to a single domain (so, I would know where the outlink came from). Also, I am having difficultly trying to figure out how in 0.9 (probably the same in 0.8) to easily get the outlinks for my segments. In nutch 0.7.* we use to do something like: segmentReader = createSegmentReader(segment); final FetcherOutput fetcherOutput = new FetcherOutput(); final Content content = new Content(); final ParseData indexParseData = new ParseData(); final ParseText parseText= new ParseText(); while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) { extractOutlinksFromParseData(indexParseData, outlinks); } private void extractOutlinksFromParseData(final ParseData indexParseData, finalSet outlinks) { for (final Outlink outlink : indexParseData.getOutlinks()) { if (null != outlink && outlink.getToUrl() != null) { outlinks.add(outlink.getToUrl()); } } } I am finally making the plunge and attempting to get this thing (my application) up to date with the latest and greatest! Thanks for your time! And once I really get through this code I promise to start posting answers. Briggs. -- "Conscious decisions by conscious minds are what make reality real" -- "Conscious decisions by concious minds are what make reality real"
Source of Outlink and how to get Outlinks in 0.9
Is it possible to determine from which domain(s) an outlink was located? The only way I know how is to limit the crawl to a single domain (so, I would know where the outlink came from). Also, I am having difficultly trying to figure out how in 0.9 (probably the same in 0.8) to easily get the outlinks for my segments. In nutch 0.7.* we use to do something like: segmentReader = createSegmentReader(segment); final FetcherOutput fetcherOutput = new FetcherOutput(); final Content content = new Content(); final ParseData indexParseData = new ParseData(); final ParseText parseText= new ParseText(); while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) { extractOutlinksFromParseData(indexParseData, outlinks); } private void extractOutlinksFromParseData(final ParseData indexParseData, finalSet outlinks) { for (final Outlink outlink : indexParseData.getOutlinks()) { if (null != outlink && outlink.getToUrl() != null) { outlinks.add(outlink.getToUrl()); } } } I am finally making the plunge and attempting to get this thing (my application) up to date with the latest and greatest! Thanks for your time! And once I really get through this code I promise to start posting answers. Briggs. -- "Conscious decisions by conscious minds are what make reality real"
Re: Wildly different crawl results depending on environment...
Thanks, I'll look into it. Though, I have never really tried that level of granularity. So, i'll have to figure out what you just told me to do! hah. On 4/2/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: Briggs wrote: > nutch 0.7.2 > > I have 2 scenarios (both using the exact same configurations): > > 1) Running the crawl tool from the command line: > >./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5 > > 2) Running the crawl tool from a web app somewhere in code like: > >final String[] args = new String[]{ >"-local", "/tmp/urlfile.txt", >"-dir", "/tmp/somedir", >"-depth", "5"}; > >CrawlTool.main(args); > > > When I run the first scenario, I may get thousands of pages, but when > I run the second scenario my results vary wildly. I mean, I get > perhaps 0,1,10+, 100+. But, I rarely ever get a good crawl from > within a web application. So, there are many things that could be > going wrong here > > 1) Is there some sort of parsing issue? An xml parser, regex, > timeouts... something? Not sure. But, it just won't crawl as well as > the 'standalone mode'. > > 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing > a crawl tool (more than once) within a instance of a JVM? It seems to > have problems doing this. I am thinking there are some static > references that don't really like handling such use. But this is just > a wild accusation that I am not sure of. > > > Checking out the logs might help in this case. From my experience, i can say that there can be some classloading problem with the crawl running in a servlet container. I suggest you also try running the crawl step wise, by first running inject, generate, fetch. etc. -- "Concious decisions by concious minds are what make reality real"
Wildly different crawl results depending on environment...
nutch 0.7.2 I have 2 scenarios (both using the exact same configurations): 1) Running the crawl tool from the command line: ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5 2) Running the crawl tool from a web app somewhere in code like: final String[] args = new String[]{ "-local", "/tmp/urlfile.txt", "-dir", "/tmp/somedir", "-depth", "5"}; CrawlTool.main(args); When I run the first scenario, I may get thousands of pages, but when I run the second scenario my results vary wildly. I mean, I get perhaps 0,1,10+, 100+. But, I rarely ever get a good crawl from within a web application. So, there are many things that could be going wrong here 1) Is there some sort of parsing issue? An xml parser, regex, timeouts... something? Not sure. But, it just won't crawl as well as the 'standalone mode'. 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing a crawl tool (more than once) within a instance of a JVM? It seems to have problems doing this. I am thinking there are some static references that don't really like handling such use. But this is just a wild accusation that I am not sure of. -- "Conscious decisions by conscious minds are what make reality real"
Re: Logger duplicates entries by the thousands
Status update... So, I have the logging 'fixed', removed appenders and such. But I can see that the logging issue was just a result of something else happening underneath. The memory consumption of the application still grows until an OutOfHeapSpace error is thrown. So, still trying to find where that is happening... It's either Nutch or ActiveMQ stuff. Anyway, Have fun and Cheers! On 3/23/07, Briggs <[EMAIL PROTECTED]> wrote: Currently using 0.7.2. We have a process that runs crawltool from within an application, perhaps hundreds of times during the course of the day. The problem I am seeing is that over time the log statements from my application (I am using commons logging and Log4j) are also being logged within the nutch log. But, the real problem is that over time each log statement gets repeated by some factor that increases over time/calls. So, currently, if I have a debug statement after I call CrawlTool.main(), I will get 7500 entries in the log for that one statement. I see a 'memory leak' in the application as this happens because I eventually run out of it (1.5GB). Has anyone else seen this problem? I have to keep shutting down the app so I can continue. Any clues? Does nutch create log appenders in the crawler code, and is this causing the problem? -- "Concious decisions by concious minds are what make reality real" -- "Concious decisions by concious minds are what make reality real"
Logger duplicates entries by the thousands
Currently using 0.7.2. We have a process that runs crawltool from within an application, perhaps hundreds of times during the course of the day. The problem I am seeing is that over time the log statements from my application (I am using commons logging and Log4j) are also being logged within the nutch log. But, the real problem is that over time each log statement gets repeated by some factor that increases over time/calls. So, currently, if I have a debug statement after I call CrawlTool.main(), I will get 7500 entries in the log for that one statement. I see a 'memory leak' in the application as this happens because I eventually run out of it (1.5GB). Has anyone else seen this problem? I have to keep shutting down the app so I can continue. Any clues? Does nutch create log appenders in the crawler code, and is this causing the problem? -- "Concious decisions by concious minds are what make reality real"
Re: Plugin ClassLoader issues...
Well, I found this: http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading Arrrgh. Well, looks like I am going to use JMX to have my plugin talk to my application. That way I won't have have several copies of my "business" jars around. On 1/31/07, Briggs <[EMAIL PROTECTED]> wrote: So, I am having ClassLoader issues with plugins. It seems that the PluginRepository does some wierd class loading (PluginClassLoader) when it starts up. Does this mean that my plugin will not inherit the classpath of my web application that it is loaded within? A simple example is that my webapp contains spring-2.0.jar. But when I try to call a spring class from within my plugin, I get a "NoClassDefFound" error. So But the real issue is that I need to have my plugins to have access to some business classes that are deployed within my web application. How does one go about this in a nice way? -- "Concious decisions by concious minds are what make reality real" -- "Concious decisions by concious minds are what make reality real"
Plugin ClassLoader issues...
So, I am having ClassLoader issues with plugins. It seems that the PluginRepository does some wierd class loading (PluginClassLoader) when it starts up. Does this mean that my plugin will not inherit the classpath of my web application that it is loaded within? A simple example is that my webapp contains spring-2.0.jar. But when I try to call a spring class from within my plugin, I get a "NoClassDefFound" error. So But the real issue is that I need to have my plugins to have access to some business classes that are deployed within my web application. How does one go about this in a nice way? -- "Concious decisions by concious minds are what make reality real"
List Domains and adding Boost Values for Custom Fields
So, (nutch 0.7.2) Does anyone know if there is such a query in nutch that I could somehow return a full list of all unique domains that have been crawled? I was originally storing each domain's segment separately, but that ended up being a nightmare when it came to creating search beans, since the bean opens up each segment on init. So, I am working on an incremental segment merge tool to handle the thousands of segments I have and get em down to a few. Also... What I really need is a pointer at how to do the following: I have several custom attributes/fields, say "business" and "confidential", " added to a document when it was indexed. I want to assign a boost value to the custom fields and have nutch use those values when it is searching. Where might I look to find such a thing? I do not want to search by those fields, I only want them as part of nutch's scoring so that if there are high boost values for those fields, they will be pushed to the top. Thanks again! Briggs -- "Concious decisions by concious minds are what make reality real"
Re: Merging large sets of segments, help.
Cool, thanks for your responses! Next time I should probably mention that we are using 0.7.2. Not quite sure if we can even think about moving to something 'more current' as I don't really know the reasons to. Most of this information is already available on the Nutch Wiki. All I can say is that there is certainly a limit to what you can do using the "local" mode - if you need to handle large numbers of pages you will need to migrate to the distributed setup. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- "Concious decisions by concious minds are what make reality real"
Re: Merging large sets of segments, help.
Are you running this in a distributed setup, or in "local" mode? Local mode is not designed to cope with such large datasets, so it's likely that you will be getting OOM errors during sorting ... I can only recommend that you use a distributed setup with several machines, and adjust RAM consumption with the number of reduce tasks. Currently we are running in local mode. We do not have the setup for distributing. That is why I want to merge these segments. Would that not help? Insteand of having potentially tens of thousands of segments, I want to create several large segments and index those. Sorry for my ignorance, but not really sure how to scale nutch correctly. Do you know of a document, or some pointers as to how segment/index data should be stored? "Concious decisions by concious minds are what make reality real"
Merging large sets of segments, help.
Has anyone written an API that can merge thousands of segments? The current segment merge tool cannot handle this much data as there just isn't enough RAM available on the box. So, I was wondering if there was a better, incremental way to handle this. Currently I have 1 segment for each domain that was crawled and I want to merge them all into several large segments. So, if anyone has any pointers I would appreciate it. Has anyone else attempted to keep segments at this granularity? This doesn't seem to work so well. "Concious decisions by concious minds are what make reality real"