Re: Wildcard search with nutch distributed search
On 2010-05-06 22:39, JohnRodey wrote: I'm running the Distributed Search's IndexServer. I'm trying to figure out a way to improve the index search to basically work for wildcards. Index- name:Bobby Ex. query for name:Bob will return nothing Ex. query for name:Bob* will be converted to same as above and return nothing. Nutch syntax query doesn't support wildcards. This will be changed soon in Nutch trunk, where we will delegate query parsing to a particular type of search backend (e.g. Solr). Looks like the Lucene Query object does provide this, however nutch's distributed search does not. Neither local Nutch nor distributed support this - it's a (purposeful) limitation of the Nutch query syntax. Is there any solution (that hopefully doesn't require major refactoring) that could provide this functionality? Use Nutch for crawling and indexing to Solr, and then use Solr directly for searching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: full text search for java sources and subversion repository
On 2010-05-09 12:23, Rafael Kubina wrote: Hi i´m trying to do a full text search on my java souces (.java) via nutch (1.0), svn and http (mod_dav_svn). other documents like html are pretty searchable, my sources not. currently the output ist the following: fetching http://s025/svn/java/foo/trunk/src/main/java/Bar.java Pre-configured credentials with scope - host: s025; port: 80; found for url: http://s025/svn/java/foo/trunk/src/main/java/Bar.java url: http://s025/svn/java/foo/trunk/src/main/java/Bar.java; status code: 200; bytes received: 5829; Content-Length: 5829 the content-type for this file is text/plain there are no exceptions, no other problems. i really appreciate any help that I can get. Thanks a lot! You need to check the following: * parse_text in your segment (you can dump this with readseg command). It should contain a plain text content of your file. * use Luke (www.getopt.org/luke) to examine your Lucene index. You should be able to retrieve terms coming from your Java documents - use Reconstruct Edit in Luke. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: JobTracker gets stuck with DFS problems
On 2010-05-03 19:59, Emmanuel de Castro Santana wrote: Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails. I've just found this in the FAQ, can it be done ? http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F The first method of recovering that is mentioned there works only (sometimes) for crawls performed using LocalJobTracker and local file system. It does not work in any other case. By the way, about not parsing, isn't necessary to parse the content anyway in order to generate links for the next segment ? If this is true, one would have to run parse separatedly, which would result the same. Yes, but if the parsing fails you still have the downloaded content, which you can re-parse again after you fixed the config or the code... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: JobTracker gets stuck with DFS problems
On 2010-05-03 22:58, Emmanuel de Castro Santana wrote: The first method of recovering that is mentioned there works only (sometimes) for crawls performed using LocalJobTracker and local file system. It does not work in any other case. if I stop the crawling process, take the crawled content from the dfs into my local disk, do the fix and then put it back into hdfs, would it work ? Or would there be a problem about dfs replication of the new files ? Again, this procedure does NOT work when using HDFS - you won't even see the partial output (without some serious hacking). Yes, but if the parsing fails you still have the downloaded content, which you can re-parse again after you fixed the config or the code... Interesting ... I did not see any option like a -noParsing in the bin/nutch crawl command, that means I will have to code my own .sh for crawling, one that uses the -noparsing option of the fetcher right ? You can simply set the fetcher.parsing config option to false. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: JobTracker gets stuck with DFS problems
On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: Hi All We are using Nutch to crawl ~500K pages with a 3 node cluster, each node features a dual core processor running with 4Gb RAM and circa 100Gb storage. All nodes run on CentOS. These 500K pages are scattered into several sites, each one of them having from 5k up to 200k pages. For each site we start a different crawl process (using bin/nutch crawl), but they are all almost simultaneously started. We are trying to tune Hadoop's configurations in order to have a reliable daily crawling process. After a while of crawling we see some problems occurring, mainly on the TaskTracker nodes, most of them are related to access to the HDFS. We often see Bad response 1 for block and Filesystem closed, among others. When these errors start to get more frequent, the JobTracker gets stuck and we have to run stop-all. If we adjust the maximum of map and reduce tasks to lower values, the process takes longer to get stuck, but we haven't found the adequate configuration yet. Given that setup, there are some question we have been struggling to find an answer 1. What could be the most probable reason for the hdfs problems ? I suspect the following issues in this order: * too small number of file handles on your machines (run ulimit -n, this should be set to 16k or more, the default is 1k). * do you use a SAN or other type of NAS as your storage? * network equipment, such as router or switch, or network card: quite often low-end equipment cannot handle high volume of traffic, even though it's equipped with all the right ports ;) 2. Is it better to start a unique crawl with all sites inside or to just keep it the way we are doing (i.e start a different crawl process for each site) ? It's much much better to crawl all sites at the same time. This allows you to benefit from parallel crawling - otherwise your fetcher will be always stuck in the politeness crawl delay. 3. When it all goes down, is there a way to restart crawling from where the process stopped ? Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Hadoop Disk Error
On 2010-04-26 22:31, Joshua J Pavel wrote: Sending this out to close the thread if anyone else experiences this problem: nutch 1.0 is not AIX-friendly (0.9 is). I'm not 100% sure which command it may be, but by modifying my path so that /opt/freeware/bin has precedence, I no longer get the hadoop disk error. While I though this means the problem comes from the nutch script, not the code itself, manually trying to set system calls to /opt/freeware/bin didn't fix it. I assume until detailed debugging is done, further releases will also require a workaround similar to what I'm doing. Ahhh ... now I understand. The problem lies in Hadoop's use of utilities such as /bin/whoami, /bin/ls and /bin/df. These are used to obtain some filesystem and permissions information that is otherwise not available from JVM. However, these utilities are expected to provide a POSIX-y output if on Unix, or Cygwin output if on Windows. I guess the native commands in AIX don't conform to either, so the output of these utilities can't be parsed, which ultimately results in errors. Whereas the output of /opt/freeware/bin utilities follows the POSIX format. I'm not sure what was the difference in 0.9 that still made it work ... perhaps the parsing of these outputs was more lenient, or some errors were ignored. In any case, we in Nutch can't do anything about this, we can just add your workaround to the documentation. The problem should be reported to the Hadoop project. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
ANNOUNCE: Nutch becomes an Apache Top-Level Project (TLP)
Hi all, I'm happy to announce that the ASF Board has accepted the resolution to separate Nutch from the Lucene project and make it into a top-level project (full text of the resolution can be viewed here [1]). Thanks to all who voted and who worked on preparing this proposal! This means that in the upcoming days/weeks we will start moving our web site and mailing lists to a new prefix, @nutch.apache.org. AFAIK it's possible to automatically move the mailing list subscriptions to the new addresses, so you won't have to do anything (apart from changing your mail filters, perhaps). This change involves the Nutch repository being moved eventually under svn://svn.apache.org/repos/asf/nutch . We will let you know when this happens. JIRA setup will remain the same. [1] http://search.lucidimagination.com/search/document/443c3cf9f67b4f42/vote_2_board_resolution_for_nutch_as_tlp -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2
On 2010-04-26 16:24, David M. Cole wrote: At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote: Most folks that use Nutch are likely familiar with running ant IMHO. I guess then I fall into the category of not most folks. Have been running Nutch for about 14 months and I haven't a clue how to run ant. If there's a place to vote to suggest that compiled versions still be distributed, I vote for that. Actually, we don't have a build target (yet) that produces a binary-only distribution that we can ship and which you can run out of the box (not counting the build/nutch.job alone, because it needs the Hadoop infrastructure to run). The current mixed (source+binary) distribution worked well enough so far, but the size of the distribution is becoming a concern, hence the idea to ship only the source. We may have been too hasty with that, though... What do others think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to do faceting on data indexed by Nutch
On 2010-04-25 15:03, KK wrote: Hi All, I might be repeating this question asked by someone else but googling didn't help tracking any such mail responses. I'm pretty much aware of Solr/Lucene and its basic architecture. I've done hit highlighting in Lucene, has idea on faceting support by Solr but never tried it actually. I wanted to implement faceting on Nutch's indexed data. I already have some MBs of data already indexed by Nutch. I just want to implement faceting on those . Can someone give me pointers on how to proceed further in this regard. Or is it the case that I've to query using Solr interface and redirect all the queries to the index already created by Nutch. What is the best possible way, simplest way for achieving the same. Please help in this regard. Nutch has two indexing/searching backends - the one that is configured by default uses plain Lucene, and it does not support faceting. The other backend uses Solr, and then of course it supports faceting and all other Solr features. So in your case you need to switch to use Solr indexing (and searching). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: About Apache Nutch 1.1 Final Release
On 2010-04-17 05:45, Phil Barnett wrote: On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: More details on this (your environment, OS, JDK version) and logs/stacktraces would be highly appreciated! You mentioned that you have some scripts - if you could extract relevant portions from them (or copy the scripts) it would help us to ensure that it's not a simple command-line error. I posted another thread tonight with the fixed code. See here: https://issues.apache.org/jira/browse/NUTCH-812 Can you please commit it for all of us? I'm traveling today ... Chris, can you perhaps apply the patch before you roll another RC? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: About Apache Nutch 1.1 Final Release
On 2010-04-10 17:49, Phil Barnett wrote: On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote: Hi there, Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC member that's VOTE'd +1 on the release. Hopefully in the next few days someone will have a chance to check... I tried to get the Release Candidate (latest nightly build) running yesterday and I ran into problems with both of the scripts that I use to crawl with 1.0. But the smaller bin/crawl method finished the crawl and then immediately had a java exception when starting the next step. Sorry I don't have more specifics, but I'm at home, the setup is at work and I had to revert to get things back running. But I built a dev machine so I can play with 1.1 and get more specific. More details on this (your environment, OS, JDK version) and logs/stacktraces would be highly appreciated! You mentioned that you have some scripts - if you could extract relevant portions from them (or copy the scripts) it would help us to ensure that it's not a simple command-line error. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Nutch 1.1 Release Candidate #1
On 2010-04-07 07:14, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... +1 - tested both local and distributed workflows, all looks good. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[VOTE RESULTS] Nutch to become a top-level project (TLP)
Hi all, I'm happy to announce that this vote is closed and the proposal has passed with 4 +1 binding votes and 0 -1 binding votes - in fact, there were only +1-s both from the committers and the community. Thanks to all who expressed their opinion - we will now proceed with the remaining formal steps to become a TLP. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch segment merge is very slow
On 2010-04-05 16:54, ashokkumar.raveendi...@wipro.com wrote: Hi, Thank you for your suggestion. I have around 500+ internet urls configured for crawling and crawl process is running in Amazon cloud. I have already reduced my depth to 8, topN to 1000 and also increased fetcher threads to 150 and limited 50 urls per host using generate.max.per.host property. With this configuration Generate, Fetch, Parse, Update completes in max 10 hrs. When comes to segment merge it takes lot of time. As a temporary solution I am not doing the segment merge and directly indexing the fetched segments. With this solution I am able to finish the crawl process with in 24hrs. Now I am looking for long term solution to optimize segment merge process. Segment merging is not strictly necessary, unless you have a hundred segments or so. If this step takes too much time, but still the number of segments is well below a hundred, just don't merge them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[VOTE] Nutch to become a top-level project (TLP)
Hi all, According to an earlier [DISCUSS] thread on the nutch-dev list I'm calling for a vote on the proposal to make Nutch a top-level project. To quickly recap the reasons and consequences of such move: the ASF board is concerned about the size and diversity of goals across various subprojects under the Lucene TLP, and suggests that each subproject should evaluate whether becoming its own TLP would better serve the project itself and the Lucene TLP. We discussed this issue and expressed opinions that ranged from positive (easier management, better exposure, better focus on the mission, not really dependent on Lucene development) to neutral (no significant reason, only political change) to moderately negative (increased admin work, decreased exposure). Therefore, the proposal is to separate Nutch from under Lucene TLP and form a top-level project with its own PMC, own svn and own site. Please indicate one of the following: [ ] +1 - yes, I vote for the proposal [ ] -1 - no, I vote against the proposal (because ...) (Please note that anyone in the Nutch community is invited to express their opinion, though only Nutch committers cast binding votes.) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Nutch to become a top-level project (TLP)
On 2010-04-01 19:40, Robert Hohman wrote: +1 yes, and I also vote we try to somehow make nutch easier to work with maven-based projects. I've had a heck of a time integrating it (although more or less gotten it to work) Patches are welcome - I realize this could be beneficial, but I'm not familiar with maven, so I won't be able to make this change myself... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Can't open a nutch 1.0 index with luke
On 2010-04-01 21:09, Magnús Skúlason wrote: Hi, I am getting the following exception when I try to open a nutch 1.0 (I am using the official release) index with Luke (0.9.9.1) java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput. java:151) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInpu t.java:38) at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInpu t.java:36) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70) at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:93) at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:203) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:256) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java :72) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:704) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at org.apache.lucene.index.IndexReader.open(IndexReader.java:375) at org.getopt.luke.Luke.openIndex(Unknown Source) at org.getopt.luke.Luke.openOk(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at thinlet.Thinlet.invokeImpl(Unknown Source) at thinlet.Thinlet.invoke(Unknown Source) at thinlet.Thinlet.handleMouseEvent(Unknown Source) at thinlet.Thinlet.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Any ideas why this happens and how to fix it? Can Nutch itself open this index and use it? I'm not getting any such errors with the above combination and a small test index ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: hamid sefrani
On 2010-03-29 16:52, Pedro Bezunartea López wrote: Are there any anti-spam measures? The same sender has posted a few spam messages already... Pedro. 2010/3/25 Mike Hayscpun...@hotmail.com http://SPAM...porr...com/...lndex.html Normally only users that subscribed to the list are allowed to post, unless a moderator adds them. It appears that this user slipped through ... I'll try to forcibly unsubscribe him. Sorry! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Fetch Stuck
On 2010-03-13 00:12, Abhi Yerra wrote: So I had -noParsing set. So parsing was not part of the fetch. The pages have been crawled, but the reducers have crashed. So if I restart the fetch will it try to crawl all those pages again? Yes. It would be good to investigate first Why it crashed, otherwise it's likely to happen again. Are you running this on a cluster? Check the logs of the crashed tasks (in logs/userlogs/ on respective tasktracker nodes). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Avoid indexing common html to all pages, promoting page titles.
On 2010-03-12 12:52, Pedro Bezunartea López wrote: Hi, I'm developing a site that has shows the dynamic content in adiv id=content, the rest of the page doesn't really change. I'd like to store and index only the contents of thisdiv, to basically avoid re-indexing over and over the same content (header, footer, menu). I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a couple of things: 1.- Should I extend the parse-html plugin, or should I just replace it? You should write an HtmlParseFilter and extract only the portions that you care about, and then replace the output parseText with your extracted text. 2.- The example talks about finding a meta tag, extracting some information from it, and adding a field in the index. I think I just need to get rid of all html except the div id=content tag, and index its content. Can someone point me in the right direction? See above. And just one more thing, I'd like to give a higher score to pages which the search terms appear in the title. Right now pages that contain the terms in the body rank higher than those that contain the search terms in the title, how could I modify this behaviour? You can define these weights in the configuration, look for query boost properties. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Fetch Stuck
On 2010-03-12 23:39, Abhi Yerra wrote: Hi, We did a fetch and the maps are 100% done, but the reducers have crashed. We did a large fetch so is there a way to restart the reducers without restarting the fetch? Unfortunately no. Was the fetcher in the parsing mode? If so, I strongly recommend that you first fetch, and then run the parsing as a separate step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Where are new linked entries added
On 2010-03-11 15:53, nikinch wrote: Hi everyone I've been using nutch for a while now and i've come up on a snag. I'm trying to find where new linked pages are added to the segment as a specific entry. To make myself clear i've been through the fetch class and the crawlDBFilter and reducer. But i'm looking for the initial entry where, for a given page, the links are transformed into segment entries, my objective here is to pass down te initial inject url to all it's liked pages. So when i create an entry for the linked urls of a wegbpage i'll add metadata to their definition giving them this originating url. By the time i get to CrawlDBFilter i already have entries for linked pages and lost the notion of which seed url brought us here. I thought the job would be done in the Fetcher maybe in the output function but i'm not finding where it happens. So if anyone knows and could point me in the right direction i'd appreciate it. Currently the best place to do this is in your implementation of a ScoringFilter, in distributeScoreToOutlinks(). You can also modify one of the existing scoring plugins. I would advise against modifying the code directly in ParseOutputFormat, it's complex and fragile. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: form-based authentication? Any progress
On 2010-03-10 19:26, conficio wrote: Susam Pal wrote: Hi, Indeed the answer is negative and also, many people have often asked this in this list. Martin has very nicely explained the problems and possible solution. I'll just add what I have thought of. I have often wondered what it would take to create a nice configurable cookie based authentication feature. The following file would be needed:- ... http://wiki.apache.org/nutch/HttpPostAuthentication I was wondering if there has been any work done into this direction? I guess the answer is still no. Would the problem become easier, if one targets particular types of sites, such as popular Wiki, Bug Trackers, Blogs, CMS, Forum, document management systems (first)? I was involved in a project to implement this (as a proprietary plugin). In short, it requires a lot of effort, and there are no generic solutions. If it works with one site, it breaks with another, and eventually you end up with a nasty heap of hacks upon hacks. In that project we gave up after discovering that a large number of sites use Javascript to create and name the input controls, and they used a challenge-response with client-side scripts generating the response ... it was a total mess. So, if you target 10 sites, you can make it work. If you target 10,000 sites all using slightly different methods, then forget it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Content of redirected urls empty
On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New version of nutch?
On 2010-03-03 20:12, John Martyniak wrote: Does anybody have an idea of when a new version of nutch will be availale, specifically supporting a latest version of hadoop. And possibly hbase? Thank you for any information. We should roll out a 1.1 soon (a few weeks), the nutch+hbase is imho still a few months away. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Update on ignoring menu divs
On 2010-02-28 18:42, Ian M. Evans wrote: Using Nutch as a crawler for solr. I've been digging around the nutch-user archives a bit and have seen some people discussing how to ignore menu items or other unnecessary div areas like common footers, etc. I still haven't come across a full answer yet. Is there a to define a div by id that nutch will strip out before tossing the content into solr? There is no such functionality out of the box. One direction that is worth pursuing would be to create an HtmlParseFilter plugin that wraps the Boilerpipe library http://code.google.com/p/boilerpipe/ . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch v0.4
On 2010-02-24 17:34, Pedro Bezunartea López wrote: Hi Ashley, Hi, I'm looking to reproduce program analysis results based on Nutch v0.4. I realize this is a very old release, but is it possible to obtain the source from somewhere? I see some of the classes I'm looking for in v0.7, but I need the older version to confirm it. Thanks, Ashley You can get version 0.6 and higher from apache's archive: http://archive.apache.org/dist/lucene/nutch/ ... but I haven't found anything older, AFAIK older releases of Nutch were archived only on that old SF site, and apparently that site no longer exists. Sorry :( However, you can still check out that code from CVS repository at nutch.sf.net . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SegmentFilter
On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Duplicate content is usually related to the fact that indeed the same content appears under different urls. This is common enough, so I don't see this necessarily as a bug in Nutch - we won't know that the content is identical until we actually fetch it... Urls may differ in certain systematic ways (e.g. by a set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. i know. i have implemented several url filters to filter duplicate content. there is a difference here. the difference here is that in this case the same content is stored under the same url several times. it is stored under http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 and not under http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 the content for the latter url is empty. Content: Ok, then the answer can be found in the protocol status or parse status. You can get protocol status by doing a segment dump of only the crawl_fetch part (disable all other parts, then the output is less confusing). Similarly, parse status can be found in crawl_parse. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SegmentFilter
On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Duplicate content is usually related to the fact that indeed the same content appears under different urls. This is common enough, so I don't see this necessarily as a bug in Nutch - we won't know that the content is identical until we actually fetch it... Urls may differ in certain systematic ways (e.g. by a set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: About HBase Integration
On 2010-02-09 03:08, Hua Su wrote: Thanks. But heritrix is another project, right? Please see this Git repository, it contains the latest work in progress on Nutch+HBase: git://github.com/dogacan/nutchbase.git -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: merge not working anymore
On 2010-01-18 21:56, MilleBii wrote: Help !!! My production environment is blocked by this error. I deleted the segment altogether and restarted crawl/fetch/parse... and I'm still stuck, so I can not add segments anymore. Looking like a hdfs problem ??? 2010-01-18 19:53:00,785 WARN hdfs.DFSClient - DFS Read: java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735 file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx This error is commonly caused by running out of disk space on a datanode. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Post Injecting ?
On 2010-01-15 20:09, MilleBii wrote: Inject is meant to seed the database at the start. But I would like to inject new urls on a production crawldb, I think it works but I was wondering if somebody could confirm that. Yes. New urls are merged with the old ones. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adding additional metadata
On 2010-01-11 13:18, Erlend Garåsen wrote: First of all: I didn't know about the list archive, so sorry for not searching that resource before I sent a new post. MilleBii wrote: For lastModified just enable the index|query-more plugins it will do the job for you. Unfortunately not. Our pages include Dublin core metadata which has a Norwegian name. For other meta searc the mailing list its explained many times how to do it I found several posts concerning metadata, but for me, one question is still unanswered: Do I really have to create a lot of new classes/xml files in order to store the content of just two metadata? I have not managed to parse the content of the lastModified metadata after I tried to rewrite the HtmlParser class. So I tried to add hard coded metadata values in HtmlParser like this instead: entry.getValue().getData().getParseMeta().set(dato.endret, 01.01.2008); My modified MoreIndexingFilter managed to pick up the hard coded values, and the dates were successfully stored into my Solr Index after running the solrindex option. This means that it is not necessary to write a new MoreIndexingFilter class, but I'm still unsure about the HtmlParser class since I haven't managed to parse the content of the metadata. You can of course hack your way through HtmlParser and add/remove/modify as you see fit - it's straightforward and likely you will get the result that you want. However, as MilleBii suggests, the preferred way to do this would be to write a plugin. The reason is the cost of a long-term maintenance - if you ever want to sync up your local modified version of Nutch with the newer public release, your hacked copy of HtmlParser won't merge nicely, whereas if you put your code in a separate plugin then it might. Another reason is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficult to do. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Help Needed with Error: java.lang.StackOverflowError
On 2010-01-11 18:40, Godmar Back wrote: On Mon, Jan 11, 2010 at 12:30 PM, Fuad Efendif...@efendi.ca wrote: Googling reveals http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you could try increasing the Java stack size in bin/nutch (-Xss), or use an alternate regexp if you can. Just out of curiosity, why does a performance critical program such as Nutch use Sun's backtracking-based regexp implementation rather than an efficient Thompson-based one? Do you need the additional expressiveness provided by PCRE? Very interesting point... we should use it for BIXO too. BTW, SUN has memory leaks with LinkedBlockingQueue, http://bugs.sun.com/view_bug.do?bug_id=6806875 http://tech.groups.yahoo.com/group/bixo-dev/message/329 I don't think we use this class in Nutch. And, of course, URL is synchronized; Apache Tomcat uses simplified version of URL class. And, RegexUrlNormalizer is synchronized in Nutch... And, in order to retrieve plain text from HTML we are creating fat DOM object (instead of using, for instance, filters in NekoHtml) We are creating a DOM tree because it's much easier to write filtering plugins that work with DOM tree than implement Neko filters. Besides, we provide an option to use TagSoup for HTML parsing, which is not only more resilient to HTML errors but also more efficient. Besides, Nutch is built around plugins. Deactivate parse-html and write your own HTML plugin that avoids these inefficiencies, and we'll be happy to include it in the distribution. And more... I'm no expert, but the reason I brought this up for discussion was that I recently encountered a paper that pointed out that regular expression matching accounts for a significant fraction of total runtime in search engine indexers [1] and thus it's something that's usually optimized. - Godmar [1] http://portal.acm.org/citation.cfm?id=1542275.1542284 This StackOverflow came probably from the urlfilter-regex, which indeed uses Java regex, definitely one of the worst implementations. The reason it's used by default in Nutch is that it's standard in JDK, FWIW. For high-performance crawlers I usually do the following: * avoid regex filtering completely, if possible, instead using a combination of prefix/suffix/domain/custom filters * use urlfilter-automaton, which is slightly less expressive but much much faster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Purging from Nutch after indexing with Solr
On 2010-01-09 10:18, MilleBii wrote: @Andrzej, To be more specific if one uses cached content (which I do), what is the minimal staff to keep, I guess : + crawl_fetch + parse_data + parse_text the rest is not used ... I guess, before I start testing could you confirm ? crawl_fetch you can ignore - it's just the status of fetching, which should be by that time already integrated into crawldb (if you ran updatedb). It's the content/ that you need to display cached view. @Ulysse, The other reason to keep all data is if you will need to reindex all segments, which does happen in development test phases, less in production though. Right. Also, a common practice is to keep the raw data for a while just to make sure that the parsing and indexing went smoothly (in case you need to re-parse the raw content). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Purging from Nutch after indexing with Solr
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote: I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some questions regarding data redundancy with this setup. Considering the following sample segment: 2.0Gcontent 196Kcrawl_fetch 152Kcrawl_generate 376Kcrawl_parse 392Kparse_data 441Mparse_text 1. From what I have found through searches content holds the raw fetched content, is there any problem if I remove it, ie: does nutch needs it to apply any sort of logic when re-crawling that content/url? No, they are no longer needed, unless you want to provide a cached view of the content. 2. Previous question applies to parse_data and parse_text after i've called nutch solrindex on that segment. Depends how you set up your search. If you search using NutchBean (i.e. the Nutch web application) then you need them. If you search using Solr, then you don't need them. 3. Using samples scritps and tutorials I'm always seeing invertlinks being called over all segments, but its output mentions merging, when I fetch/parse new segments can I call invertlinks only over them? Yes, invertlinks will incrementally merge the existing linkdb with new links from a new segment. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dedup remove all duplicates
On 2010-01-06 18:56, Pascal Dimassimo wrote: Hi, After I run the index command, my index contains 2 documents with the same boost, digest, segment and title, but with different tstamp and url. When I run the dedup command on that index, both documents are removed. Should the document with the latest tstamp be kept? It should, out of multiple documents with the same URL (url duplicates) only the most recent is retained - unless it was removed because there was another document in the index with the same content (a content duplicate). Could you please verify this on a minimal index (2 documents), and if the problem persist please report this in JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing crawled data
On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual solution is to dump the content in a file through the segreader, parse it and then use SolrJ to send the documents. Probably the best solution is to set my own analyzer for the field on solr side, and do keywords extraction there. Thanks for the script, you'll use it! Likely the solution that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing crawled data
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual solution is to dump the content in a file through the segreader, parse it and then use SolrJ to send the documents. Probably the best solution is to set my own analyzer for the field on solr side, and do keywords extraction there. Thanks for the script, you'll use it! Likely the solution that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. This sounds very interesting. So the idea is to take the NutchDocument as it comes out of the crawling and modify it (inside of an IndexingFilter) before it's sent to indexing (inside of nutch), right? Correct - IndexingFilter-s work no matter whether you use Nutch or Solr indexing. So how does it relate to nutch schema and solr schema? Can you give me some pointers? Please take a look at how e.g. the index-more filter is implemented - basically you need to copy this filter and make whatever modifications you need ;) Keep in mind that any fields that you create in NutchDocument need to be properly declared in schema.xml when using Solr indexing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Large files - nutch failing to fetch
On 2009-12-21 17:15, Sundara Kaku wrote: Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch all the files using Nutch, but nutch is throwing outofmemory exception for large files ( have set heap size to 2500m), with heap memory 2500m file size with 250mb are retrieved but larger that that are failing, and nutch takes lot of time after printing -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 if there are three files with size 100mb each then it is failing (at the same depth, with heap size 2500m) to fetch files. i have set http.content.limite to -1 is there way to fetch several large files using nutch.. I am using nutch as webcrawler, i am not using Indexing. I want to download web resources and scan then for virus using ClamA/V. Probably Nutch is not the right tool for you - you should probably use wget. Nutch was designed to fetch many pages of limited size - as a temporary step it caches the downloaded content in memory, before flushing it out to disk. (I had to solve this limitation once for a specific case - the solution was to implement a variant of the protocol and Content that stored data into separate HDFS files without buffering in memory - but it was a brittle hack that only worked for that particular scenario). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException
On 2009-12-17 10:13, Eran Zinman wrote: Hi, I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of Nutch parser reduce tasks. I know this is a known issue with Nutch ( https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058 ) And as far as I can see that patch wasn't committed yet because we wanted to examine it on the new Hadoop 0.20 version. I am using latest Nutch with Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it does) - maybe we should commit the change? Thanks for reporting this - could you perhaps try to apply that patch and see if it helps? I hesitated to commit it because it's really a workaround and not a solution ... but if it works for you then it's better than nothing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: OR support
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Luke reading index in hdfs
On 2009-12-11 22:21, MilleBii wrote: Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x. Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1. Start Luke, dismiss the open dialog, and then go to Plugins / Hadoop, and enter the full URL of the index directory (including the hdfs:// part). You can also open multiple parts of the index (e.g. if you follow the Nutch naming convention, you can directly open the indexes/ directory that contains part-N partial indexes). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NOINDEX, NOFOLLOW
On 2009-12-10 20:33, Kirby Bohling wrote: On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com wrote: hi, i have a page withmeta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index http://example/foo/bar;, knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: domain vs www.domain?
On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this? This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - Exception
Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 1.0 wml plugin
yangfeng wrote: I have completed the plugin for parsing the wml(wiredless mark language). I hope to add it to lucene, what i do? The best long-term option would be to submit this work to the Tika project - see http://lucene.apache.org/tika/. If you already implemented this as a Nutch plugin, please creata a JIRA issue in Nutch, and attach the patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How does generate work ?
MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host limit works. Using scores is de-optimzing the fetching process... Having said that I should first read the code and try to understand it. That wouldn't hurt in any case ;) There is also a method in ScoringFilter-s (e.g. the default scoring-opic), where it determines the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio
BELLINI ADAM wrote: hi, i have this error when crawling org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out Most likely you ran out of tmp disk space. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: odd warnings
Jesse Hires wrote: What is segments.gen and segments_2 ? The warning I am getting happens when I dedup two indexes. I create index1 and index2 through generate/fetch/index/...etc index1 is an index of 1/2 the segments. index2 is an index of the other 1/2 The warning is happening on both datanodes. The command I am running is bin/nutch dedup crawl/index1 crawl/index2 If segments.gen and segments_2 are supposed to be directories, then why are they created as files? They are created as files from the start bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX crawl/segments/YYY I don't see any errors or warnings about creating the index. The command that you quote above produces multiple partial indexes, located in crawl/index1/part-N and only in these subdirectories the Lucene indexes can be found. However, the deduplication process doesn't accept partial indexes, so you need to specify each /part- dir as an input to dedup. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch frozen but not exiting
Paul Tomblin wrote: My nutch crawl just stopped. The process is still there, and doesn't respond to a kill -TERM or a kill -HUP, but it hasn't written anything to the log file in the last 40 minutes. The last thing it logged was some calls to my custom url filter. Nothing has been written in the hadoop directory or the crawldir/crawldb or the segments dir in that time. How can I tell what's going on and why it's stopped? If you run in distributed / pseudo-distributed mode, you can check the status in the JobTracker UI. If you are running in local mode, then it's likely that the process is in a (single) reduce phase sorting the data - with larger jobs in local mode the sorting phase may take very long time, due to a heavy disk IO (and in disk-wait state it may be uninterruptible). Try to generate a thread dump to see what code is being executed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch frozen but not exiting
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Hm, I can't see anything obviously wrong with that thread dump. What's the CPU and swap usage, and loadavg? The process is using a lot of CPU. loadavg is up over 5. top - 15:12:19 up 22 days, 4:06, 2 users, load average: 5.01, 5.00, 4.93 Tasks: 48 total, 2 running, 45 sleeping, 0 stopped, 1 zombie Cpu(s): 1.0% us, 99.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 3170584k total, 2231700k used, 938884k free,0k buffers Swap:0k total,0k used,0k free,0k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 32507 discover 16 0 1163m 974m 8604 S 394.7 31.5 719:40.71 java Actually, the memory is a real annoyance - the hosting company doesn't give me any swap, so when hadoop does a fork/exec just to do a whoami, I have to leave as much memory free as the crawl reserves with -Xmx for itself. Hm, the curious thing here is that the java process is sleeping, and 99% of cpu is in system time ... usually this would indicate swapping, but since there is no swap in your setup I'm stumped. Still, this may be related to the weird memory/swap setup on that machine - try decreasing the heap size and see what happens. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Encoding the content got from Fetcher
Santiago Pérez wrote: Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property nameparser.character.encoding.default/name valueWindows-1250/value descriptionThe character encoding to fall back to when no other information is available/description /property Has anyone had the same problem? (Hungarian o Polish people sure...) The appearance of characters that you quoted in your other email indicates that the problem may be the opposite - your pages seem to use UTF-8, and you are trying to convert them using Windows-1250 ... Try putting UTF-8 in this property, and see what happens. Generally speaking, pages should declare their encoding, either in HTTP headers or in meta tags, but often this declaration is either missing or completely wrong. Nutch uses ICU4J CharsetDetector plus its own heuristic (in util.EncodingDetector and in HtmlParser) that tries to detect character encoding if it's missing or even if it's wrong - but this is a tricky issue and sometimes results are unpredictable. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: Interesting updates on the current run of 450K urls : + 30minutes @ 3Mbits/s + drop to 1Mbit/s (1/X shape) + gradual improvement to 1.5 Mbit/s and steady for 7 hours + sudden drop to 0.9 Mbits/s and steady for 4 hours + up to 1.7 Mbits for 1hour + staircasing down to 0.5 Mbit/s by steps of 1 hour I don't know what to take as a conclusion, but it is quite strange to have those sudden variation of bandwidth and overall very slow. I can post the graph if people are interested. This most likely comes from the allocation of urls to map tasks, and the maximum number of map tasks that you can run on your cluster. when tasks finish their run, you see a sudden drop in speed, until the next task starts running. Initially, I suspect that you have more tasks available than the capacity of your cluster, so it's easy to fill the slots and max the speed. Later on, slow map tasks tend to hang around, but still some of them finish and make space for new tasks. As time goes on, majority of your tasks becomes slow tasks, so the overall speed continues to drop down. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: You mean map/reduce tasks ??? Yes. Being in pseudo-distributed / single node I only have two maps during the fetch phase... so it would be back to the URLs distribution. Well, yes, but my explanation is still valid. Which unfortunately doesn't change the situation. Next week I will be working on integrating the patches from Julien, and if time permits I could perhaps start working on a speed monitoring to lock out slow servers. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Broken segments ?
Mischa Tuffield wrote: Hello All, http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility. can you tell m eplz more about TextProfileSignature ? how should i use it Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch config IOException
Mischa Tuffield wrote: Hello Again, Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. !-- 2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:176) at org.apache.hadoop.conf.Configuration.init(Configuration.java:164) at org.apache.hadoop.hdfs.protocol.FSConstants.clinit(FSConstants.java:51) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) -- Any pointers would be great, I wonder is there a way for me to validate my conf options before I deploy nutch? This exception is innocuous - it helps to debug at which points in the code the Configuration instances are being created. And you wouldn't have seen this if you didn't turn on the DEBUG logging. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial). The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high). In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same content 1- www.domaine/folder/ 2- www.domaine/folder/index.html 3- www.domaine/folder/index.html?lang=fr but i find all of them in my index :( i was wondering that dedup will delete 1 and 2 the dedup wont work correclty !! Please check the value of the Signature field for all the above urls in your crawldb. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Apparently it isn't a bit-exact replica of the page, so its MD5 hash is different. You need to use a more relaxed Signature implementation, e.g. TextProfileSignature. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: AbstractFetchSchedule
reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } First, concerning the segment retention - we want to enforce that pages that were not refreshed longer than maxInterval should be retried, no matter what is their status - because we want to obtain a copy of the page in a newer segment in order to be able to delete the old segment. why is the fetch time set here to curTime? Because we want to fetch it now - see the next line where this condition is checked. and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? Hm, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
Dennis Kubes wrote: I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Great, thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
John Martyniak wrote: Does anybody know of any concrete plans to update Nutch to Hadoop 0.20, 0.21? Something like a Nutch 1.1 release, get in some bug fixes and get current on Hadoop? I think that should be one of the goals. My 2 cents. I'm planning to do this upgrade soon (~a week) - and I agree that we should have a 1.1 release in the near future. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch near future - strategic directions
Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer that can accommodate both HBase and plain MapFile-s. -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler +1, with slight preference towards ivy. My biggest concern is in execution of this (or any other) plan. Some of the changes or improvements that have been proposed are quite heavy in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base. Well ... that's (almost) what Dogacan did with the HBase port. I agree that we should not feel too constrained by the existing code base, but it would be silly to throw everything away and start from scratch - we need to find a middle ground. The crawler-commons and Tika projects should help us to get rid of the ballast and significantly reduce the size of our code. In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones. Nutch 0.7 to 0.8 reused a lot of the existing code. Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc). So to summarize, I would go ahead and put together a branch nutch N.0 that would consist of (a.k.a my wish list, hope I am not being too aggressive here): -runs on top of plain hadoop See above - what do you mean by that? -use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with db abstraction or hbase directly and smart data structures that allow flexible and efficient usage of the data) -basic solr integration for indexing/search -basic parsing with tika After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges. I believe that's more or less where Dogacan's port is right now, except it's not merged with the OSGI port. ps. many of the interesting challenges in your proposal seem to fall in the category of data analysis and manipulation that are mostly, used after the data has been crawled or between the fetch cycles so many of those could be implemented into current code base also, somehow I just feel that things could be made more efficient and understandable if the foundation (eg. data structures, extendability for example) was in better shape. Also if written nicely other projects could use them too! Definitely agree with this. Example: the PageRank package - it works quite well with the current code, but it's design is obscured by the ScoringFilter api and the need to maintain its own extended DB-s. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
Dennis Kubes wrote: I would like to get a couple things in this release as well. Let me know if you want help with the upgrade. You mean you want to do the Hadoop upgrade? I won't stand in your way :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch near future - strategic directions
Subhojit Roy wrote: Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course depends on the last modified timestamp being present on the webpage that is being crawled, which I believe is not mandatory. Still those who do set it would benefit. This is already implemented - see the Signature / MD5Signature / TextProfileSignature. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: decoding nutch readseg -dump 's output
Yves Petinot wrote: Hi, I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation, i find myself unable to decode the content field, even when i try using the various charset hints that are available either in the content metadata, or in the raw content itself. I was wondering if someone on the list has already succeeded in building this type of functionality, or is the content returned by readseg using a specific encoding that i don't know of ? The dump functionality is not intended to provide a bit-by-bit copy of the segment, it's mostly for debugging purposes. It uses System.out, which in turn uses the default platform encoding - any characters outside this encoding will be replaced by question marks. If you want to get an exact copy of the raw binary content then please use the SegmentReader API. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Scalability for one site
Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Your Hadoop cluster does not increase the scalability of the target server and that's the crux of the matter - whether you use Hadoop or not, multiple threads or a single thread, if you want to be polite you will be able to do just 1 req/sec and that's it. You can prioritize certain pages for fetching so that you get the most interesting pages first (whatever interesting means). I know that URLs from one domain as assigned to one fetch segment, and polite crawling is enforced. Should I use lower-level parts of Nutch? The built-in limits are there to avoid causing pain for inexperienced search engine operators (and webmasters who are their victims). The source code is there, if you choose you can modify it to bypass these restrictions, just be aware of the consequences (and don't use Nutch as your user agent ;) ). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop question
TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Synonym Filter with Nutch
Dharan Althuru wrote: Hi, We are trying to incorporate synonym filter during indexing using Nutch. As per my understanding Nutch doesn’t have synonym indexing plug-in by default. Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in available in Lucene using WordNet or custom synonym plug-in without any negative impacts to existing Nutch indexing (i.e., considering bigram etc). Synonym expansion should be done when the text is analyzed (using Analyzers), so you can reuse the Lucene's synonym filter. Unfortunately, this happens at different stages depending on whether you use the built-in Lucene indexer, or the Solr indexer. If you use the Lucene indexer, this happens in LuceneWriter, and the only way to affect it is to implement an analysis plugin, so that it's returned from AnalyzerFactory, and use your analysis plugin instead of the default one. See e.g. analysis-fr for an example of how to implement such plugin. However, when you index to Solr you need to configure the Solr's analysis chain, i.e. in your schema.xml you need to define for your fieldType that it has the synonym filter in its indexing analysis chain. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Problems with Hadoop source
Pablo Aragón wrote: Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in: http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. After adding it to the project in Eclipse everything seems correct but the execution shows: Exception in thread main java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157) at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91) at org.apache.nutch.crawl.Crawl.main(Crawl.java:103) Any idea? Yes - when you worked with a pre-built jar it contained an embedded hadoop-default.xml that defines the implementation of the file:// schema FileSystem. Now you probably forgot to put hadoop-default.xml on your classpath. Go to Build Path and add this file to your classpath, and all should be ok. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch near future - strategic directions
. We should make Nutch an attractive platform for such users, and we should discuss what this entails. Also, if we refactor Nutch in the way I described above, it will be easier for such users to contribute back to Nutch and other related projects. 3. Provide a platform for solving the really interesting issues --- Nutch has many bits and pieces that implement really smart algorithms and heuristics to solve difficult issues that occur in crawling. The problem is that they are often well hidden and poorly documented, and their interaction with the rest of the system is far from obvious. Sometimes this is related to premature performance optimizations, in other cases this is just a poorly abstracted design. Examples would include the OPIC scoring, meta-tags metadata handling, deduplication, redirection handling, etc. Even though these components are usually implemented as plugins, this lack of transparency and poor design makes it difficult to experiment with Nutch. I believe that improving this area will result in many more users contributing back to the project, both from business and from academia. And there are quite a few interesting challenges to solve: * crawl scheduling, i.e. determining the order and composition of fetchlists to maximize the crawling speed. * spam junk detection (I won't go into details on this, there are tons of literature on the subject) * crawler trap handling (e.g. the classic calendar page that generates infinite number of pages). * enterprise-specific ranking and scoring. This includes users' feedback (explicit and implicit, e.g. click-throughs) * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora) * near-duplicate detection, and closely related issue of extraction of the main content from a templated page. * URL aliasing (e.g. www.a.com == a.com == a.com/index.html == a.com/default.asp), and what happens with inlinks to such aliased pages. Also related to this is the problem of temporary/permanent redirects and complete mirrors. Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an attractive platform to develop and experiment with such components. - Briefly ;) that's what comes to my mind when I think about the future of Nutch. I invite you all to share your thoughts and suggestions! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: changing/addding field in existing index
fa...@butterflycluster.net wrote: hi all, i have an existing index - we have a custom field that needs to be added or changed in every currently indexed document ; whats the best way to go about this without recreating the index again? There are ways to do it directly on the index, but this is complicated and involves hacking the low-level Lucene format. Alternatively, you could build a parallel index with just these fields, but synchronized internal docId-s, open both indexes with ParallelReader, and then create a new index using IndexWriter.addIndexes(). I suggest recreating the index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Direct Access to Cached Data
Hugo Pinto wrote: Hello, I am using Nutch for mirroring, rather than crawling and indexing. I need to access directly the cached data in my Nutch index, but I am unable to find an easy way to do so. I browsed the documentation(wiki, javadocs, and skimmed the code), but found no straightforward way to do it. Would anyone suggest a place to look for more information, or perhaps have done this before and could share a few tips? Most likely what you need is not the Lucene index, but the segments (shards), right? There's a utility called SegmentReader (available from cmd-line as readseg), and you can use its API to retrieve either all or individual records from a segment (using URL as key). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: updatedb is talking long long time
Kalaimathan Mahenthiran wrote: I forgot to add the detail... The segment i'm trying to do updatedb on has 1.3 millions urls fetched and 1.08 million urls parsed.. Any help related to this would be appreciated... On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran matha...@gmail.com wrote: hi everyone I'm using nutch 1.0. I have fetched successfully and currently on the updatedb process. I'm doing updatedb and its taking so long. I don't know why its taking this long. I have a new machine with quad core processor and 8 gb of ram. I believe this system is really good in terms of processing power. I don't think processing power is the problem here. I noticed that all the ram is getting using up. close to 7.7gb by the updatedb process. The computer is becoming is really slow. The updatedb process has been running for the last 19 days continually with the message merging segment data into db.. Does anyone know why its taking so long... Is there any configuration setting i can do to increase the speed of the updatedb process... First, this process normally takes just a few minutes, depending on the hardware, and not several days - so something is wrong. * do you run this in local or pseudo-distributed mode (i.e. running a real jobtracker and tasktracker?) Try the pseudo-distributed mode, because then you can monitor the progress in the web UI. * how many reduce tasks do you have? with large updates it helps if you run 1 reducer, to split the final sorting. * if the task appears to be completely stuck, please generate a thread dump (kill -SIGQUIT) and see where it's stuck. This could be related to urlfilter-regex or urlnormalizer-regex - you can identify if these are problematic by removing them from the config and re-running the operation. * minor issue - when specifying the path names of segments and crawldb, do NOT append the trailing slash - it's not harmful in this particular case, but you could have a nasty surprise when doing e.g. copy / mv operations ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: including code between plugins
Eran Zinman wrote: Hi, I've written my own plugin that's doing some custom parsing. I've needed language parsing in that plugin and the language-identifier plugin is wokring great for my needs. However, I can't use the language identifier plugin as it is, since I want to parse only a small portion of the webpage. I've used the language identifier functions and it worked great in eclipse, but when I try to compile my plugin I'm unable to compile it since it depends on the language-identifier source code. My question is - how can I include the language identifier code in my plugin code without actually using the language-identifier plugin? You need to add the language-identifier plugin to the requires section in your plugin.xml, like this: requires import plugin=nutch-extensionpoints/ import plugin=language-identifier/ /requires -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: could you unsubscribe me from this mailing list pls. tks
Nico Sabbi wrote: Il giorno lun, 02/11/2009 alle 10.04 +0100, Heiko Dietze ha scritto: Hello, there is no Administrator. But you can do the unsubscribe your-self. On the Nutch Maling-List information site http://lucene.apache.org/nutch/mailing_lists.html you can find the following E-Mail address: nutch-user-unsubscr...@lucene.apache.org Then your unsubscribe requests should work. regards, Heiko Dietze doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of confirm unsubscribe from nutch-user@lucene.apache.org and did you respond to it from the same email account that you were subscribed from? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)
Andrzej Bialecki wrote: doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of confirm unsubscribe from nutch-user@lucene.apache.org and did you respond to it from the same email account that you were subscribed from? .. I just verified that this process works correctly - I subscribed and unsubscribed successfully. Please make sure that you complete the unsubscription process as listed below: 1. make sure you are sending requests from the same email address that you were subscribed from! 2. send email to nutch-user-subscr...@lucene.apache.org . 3. you will get a confirm unsubscribe message - make sure your anti-spam filters don't block this message, and make sure you are still using the correct email account when responding. 4. you need to reply to the confirm unsubscribe message (duh...) 5. you will get a GOODBYE message. Now, let me understand this clearly: did you go through all 5 steps listed above, and you are still getting messages from this list? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: unbalanced fetching
Jesse Hires wrote: I have a two datanode and one namenode setup. One of my datanodes is slower than the other, causing the fetch to run significantly longer on it. Is there a way to balance this out? Most likely the number of URLs/host is unbalanced, meaning that the tasktracker that takes the longest is assigned a lot of URLs from a single host. A workaround for this is to limit the max number of URLs per host (in nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever works best for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch indexes less pages, then it fetches
caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Deleting stale URLs from Nutch/Solr
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. You can pretend that the time moved on using the -adddays parameter. Then Nutch will generate a new fetchlist, and when it discovers that the page is missing it will mark it as gone - actually, you could then take that information directly from the Nutch segment and instead of processing the CrawlDb you could process the segment to collect a partial list of gone pages. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to index files only with specific type
Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. It's possible to do this with a custom indexing filter - see other indexing filters to get a feeling of what's involved. Or you could do this with a scoring filter, too, although the scoring API looks more complicated. Either way, when you execute the Indexer, these filters are run in a chain, and if one of them returns null then that document is discarded, i.e. it's not added to the output index. So, it's easy to examine in your indexing filter the content type (or just a URL of the document) and either pass the document on or reject it by returning null. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Deleting stale URLs from Nutch/Solr
Gora Mohanty wrote: Hi, We are using Nutch to crawl an internal site, and index content to Solr. The issue is that the site is run through a CMS, and occasionally pages are deleted, so that the corresponding URLs become invalid. Is there any way that Nutch can discover stale URLs during recrawls, or is the only solution a completely fresh crawl? Also, is it possible to have Nutch automatically remove such stale content from Solr? I am stumped by this problem, and would appreciate any pointers, or even thoughts on this. Hi, Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). Now when it comes to removing them from Solr ... The simplest (no coding) way would be to dump the CrawlDb, use some scripting tools to collect just the URL-s with the status GONE, and send them as a delete command to Solr. A slightly more involved solution would be to implement a tool that reads such URLs directly from CrawlDb (using e.g. CrawlDbReader API) and then uses SolrJ API to send the same delete requests + commit. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: Great! in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. The flag should have been automagically added to the target CrawlDatum metadata after you have updated your crawldb (see the details in CrawlDbReducer). Then in generatorSortValue() you can check for the presence of this flag by using the datum.getMetaData(). BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any special way ... I thought it did. It's easy to add this, though - in Generator.java:161 just add this: if (sort == Float.MIN_VALUE) { return; } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing an Index from a shared location
JusteAvantToi wrote: Hi all, I am new on using Nutch and I found that Nutch is really good. I have a problem and hope somebody can shed a light. I have built an index and a web application that makes use of that index. I plan to have two web application servers running the application. Since I do not want to replicate the application and the index on each web application server, I put the application and the index on a shared location and configure nutch-site.xml as follow: property namesearcher.dir/name value\\111.111.111.111\folder\index/value description Path to root of crawl/description /property property nameplugin.folders/name value\\111.111.111.111\folder\plugins/valuedescription /property However it seems that my application can not find the index. I have checked that the web application server have access to the shared location. Is there something that I missed here? Does Nutch allow us to put the index on a network location? UNC paths are not supported in Java - you need to mount this location as a local volume. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Extending HTML Parser to create subpage index documents
malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a master document (for the original post) and an additional document for each reply item. I've reviewed the code for protocol plugins, parser plugins and indexing plugins but each interface allows for a single document or content object to be passed around. Am I missing something simple? My best bet at the moment is to implement some kind of new fake protocol for the reply items then I would use the http client plugin for the first request to the page and generate outlines on the fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go back through and fetch the sub page content. But this seems round-about and would probably generate an http request for each reply on the original page. But perhaps there is a way to lookup the original page in the segment db before requesting it again. Needless to say it would seem more straightforward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies collection and index it as separate lucene documents - but I can't find a way to do this given the interfaces in the indexer plugins. Thanks in advance Malcolm Smith What version of Nutch are you using? This should be already possible to do using the 1.0 release or a nightly build. ParseResult (which is what parsers produce) can hold multiple Parse objects, each with its own URL. The common approach to handle whole-part relationships (like zip/tar archives, RSS, and other compound docs) is to split them in the parser and parse each part, then give each sub-document its own URL (e.g file.tar!myfile.txt) and add the original URL in the metadata, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recrawl sub-documents). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ERROR: current leaseholder is trying to recreate file.
Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Enterprise
Dennis Kubes wrote: Depending on what you are wanting to do Solr may be a better choice as and Enterprise search server. If you are needing crawling you can use Nutch or attach a different crawler to Solr. If you are wanting to do more full web type search, then Nutch is a better option. What are your requirements? Dennis fredericoagent wrote: Does anybody have any information on using Nutch as Enterprise search ?, and what would I need ? is it just a case of the current nutch package or do you need other addons. And how does that compare against Google Enterprise ? thanks I agree with Dennis - use Nutch if you need to do a larger-scale discovery such as when you crawl the web, but if you already know all target pages in advance then Solr will be a much better (and much easier to handle) platform. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException
Jesse Hires wrote: Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on? 2009-10-16 17:02:43,061 ERROR datanode.DataNode - DatanodeRegistration(192.168.1.7:50010, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):DataXceiver org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_90983736382565_3277 is valid, and cannot be written to. Are you sure you are running a single datanode process per machine? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to run a complete crawl?
Vincent155 wrote: I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the topN-statement, there are still only some 50 to 75 files indexed. Check in your nutch-site.xml what is the value of db.max.outlinks.per.page, the default is 100 - when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: http keep alive
Marko Bauhardt wrote: hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? protocol-httpclient can support keep-alive. However, I think that it won't help you much. Please consider that Fetcher needs to wait some time between requests, and in the meantime it will issue requests to other sites. This means that if you want to use keep-alive connections then the number of open connections will climb up quickly, depending on the number of unique sites on your fetchlist, until you run out of available sockets. On the other hand, if the number of unique sites is small, then most of the time the Fetcher will wait anyway, so the benefit from keep-alives (for you as a client) will be small - though there will be still some benefit for the server side. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric Osgood wrote: Ok, I think I am on the right track now, but just to be sure: the code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? No, you need the trunk from here: http://svn.apache.org/repos/asf/lucene/nutch/trunk -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric Osgood wrote: So the trunk contains the most recent nightly update? It's the other way around - nightly build is created from a snapshot of the trunk. The trunk is always the most recent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: indexing just certain content
MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to ignore search results that don't have related keywords in main body?
winz wrote: Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue. Hi, Could I please know how we can ignore template items like header, footer and menu/navigations while crawling and indexing pages which follow the same design?? I'm using a content management system called Infoglue to develop my website. A standard template is applied for all the pages on the website. The search results from Nutch shows content from menu/navigation bar multiple times. I need to get rid of menu/navigation content from the search result. If all you index is this particular site, then you know the positions of navigation items, right? Then you can remove these elements in your HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these elements. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to ignore search results that don't have related keywords in main body?
BELLINI ADAM wrote: hi guyes it's just what im talking about in my post 'indexing just certain content'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post... What is garbage? Can you define it in terms of regex pattern or XPath expression that points to specific elements in DOM tree? If you crawl a single (or few) sites with well defined templates then you can hardcode some rules for removing unwanted parts of the page. If you can't do this, then there are some heuristic methods to solve this. There are two groups of methods: * page at a time (local): this group of methods considers only the current page that you analyze. The quality of filtering is usually limited. * groups of pages (e.g. per site): these methods consider many pages at a time, and try to find recurring theme among them. Since you first need to accumulate some pages it can't be done on the fly, i.e. this requires a separate post-processing step. The easiest to implement in Nutch is the first approach (page at a time). There are many possible implementations - e.g. based on text patterns, on visual position of elements, on DOM tree patterns, on block of content characteristics, etc. Here's for example a simple method: * collect text from the page in blocks, where each block fits within structural tags (div and table tags). Collect also the number of a links in each block. * remove a percentage of the smallest blocks, where link number is high - these are likely navigational elements. * reconstruct the whole page from the remaining blocks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: indexing just certain content
BELLINI ADAM wrote: HI hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking to start to create an HTML tag filter class. mabe i can create my own HTML parser ! as i do for parsing and indexing DublinCore metadata...it sounds possible don't you think so ? i just hv to create also or to find a class which could filter an HTML pages and delete certain tag from it Guys, please take a look at how HtmlParseFilters are implemented - for example the creativecommons plugin. I believe that's exactly the functionality that you are looking for. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com