Re: Nutch and Hadoop not working proper
MilleBii wrote: HLPPP !!! Stuck for 3 days on not able to start any nutch job. hdfs works fine, ie I can put look at files. When i start nutch crawl, I get the following error Job initialization failed: java.lang.IllegalArgumentException: Pathname /d:/Bii/nutch/logs/history/user/_logs/history/localhost_1245788245191_job_200906232217_0001_pc-%5C_inject+urls It is looking for the file at a wrong location Indeed in my case the correct location is /d:/Bii/nutch/logs/history, so why is * history/user/_logs* added and how can I fix that ? 2009/6/21 MilleBii mille...@gmail.com Looks like I just needed to transfer from the local filesystem to hdfs: Is it safe to transfer a crawl directory (and subs) from the local file system to hdfs and start crawling again ? 1. hadoop fs -put crawl crawl 2. nutch generate crawl/crawldb crawl/segments -topN 500 (where now it should use the hdfs) -MilleBii- 2009/6/21 MilleBii mille...@gmail.com I have newly installed hadoop in a distributed single node configuration. When I run nutch commands it is looking for files my user home directory and not at the nutch directory ? How can I change this ? I suspect your hadoop-site.xml uses relative path somewhere, and not an absolute path (with leading slash). Also, /d: looks suspiciously like a Windows pathname, in which case you should either use a full URI (file:///d:/) or just the disk name d:/ without the leading slash. Please also note that if you are running this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch and Hadoop not working proper
MilleBii wrote: What's also i have discovered + hadoop (script) works with unix like paths and works fine on windows + nutch (script) works with Windows paths bin/nutch works with Windows paths? I think this could happen only by accident - both scripts work with Cygwin paths. On the other hand, arguments passed to JVM must be regular Windows paths. Could it be that there is some incompatibility because one works unix like paths and not the other ??? Both scripts work fine for me on Windows XP + Cygwin, without any special settings - I suspect there is something strange in your environment or config... Please note that Hadoop and Nutch scripts are regular shell scripts, so they are aware of Cygwin path conventions, in fact they don't accept un-escaped Windows paths as arguments (i.e. you need to use forward slashes, or you need to put double quotes around a Windows path). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[ANN] Luke + Hadoop, alpha version
Hi all, I prepared a special edition of Luke, the Lucene Index Toolbox, that works with Lucene indexes located on any filesystem supported by Hadoop 0.19.1. At the moment I'm looking for feedback how to best integrate this functionality with various bits and pieces of Luke. You can download the jar file from a direct link: http://www.getopt.org/luke/lukeall-0.9.3.jar This JAR contains all dependencies needed to connect to HDFS, KFS or S3/S3n filesystems, although I tested it only with HDFS so far. Note: this version of Luke still uses Lucene 2.4.1, I didn't start integrating 2.9-dev yet. Quick info for the impatient: yes, you can browse the content, view terms and documents, perform searching, explaining, etc. See below for more details. The initial Open dialog is not integrated yet with this functionality. After you start Luke, you need to dismiss this dialog, go to Plugins / Hadoop Plugin, and enter the full URI of the index in the textfield, and then press the Open button. There is no filesystem browsing for now - you need to know the full URI in advance. Current functionality is as follows: - you can open a single index or partial (sharded) indexes located in part-N/ subdirectories (this is a typical layout resulting from using common map-reduce output formats). In the latter case you will get a single view of partial indexes, thanks to MultiReader. - access is read-only - most FileSystem-s don't support file updates, so it was easiest to disable write access altogether for now. - most of Luke functionality works properly, thanks to the excellent design of IndexReader API. Some operations are disabled due to read-only access, some other information (like top terms) is not populated by default due to a high IO cost, but can be requested explicitly. - the plugin keeps track of the amount of IO reads - I found this very comforting when opening large indexes over a slow VPN line ... There is a Clear button on the plugin's tab that resets the counters - this is useful to see how much IO is needed to complete a specific operation. - a lot of code has been reworked to avoid UI stalls when doing slow IO, which means that you can see the amount of IO being done, but the UI is blocked with a modal dialog. It's a bit unwieldy, but other solutions would require too much refactoring. Any feedback is welcome - please keep in mind that this is an early preview. Also, various UI glitches are probably related to the Thinlet toolkit - again, one day I may re-write Luke using something else, but for now I don't have the strength to do it. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
lei wang wrote: anyone help? so disappointed. On Fri, Jul 10, 2009 at 4:29 PM, lei wang nutchmaill...@gmail.com wrote: Yes, I am also occuring to this problem. Can anyone help? On Sun, Jul 5, 2009 at 11:33 PM, xiao yang yangxiao9...@gmail.com wrote: I often get this error message while crawling the intranet Is it the network problem? What can I do for it? $bin/nutch crawl urls -dir crawl -depth 3 -topN 4 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 4 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20090705212324 Generator: filtering: true Generator: topN: 4 Generator: Partitioning selected urls by host, for politeness. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Generator.generate(Generator.java:524) at org.apache.nutch.crawl.Generator.generate(Generator.java:409) at org.apache.nutch.crawl.Crawl.main(Crawl.java:116) If you are running a large crawl on a single machine, you could be running out of file descriptors - please check ulimit -n, the value should be much much larger than 1024. Also, please check the hadoop.log for clues why shuffle fetching failed - this could be something trivial as a blocked port, or routing problem, or DNS resolution problem, or the problem I mentioned above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Why cant I inject a google link to the database?
Brian Ulicny wrote: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls Please note, we are not saying this is impossible to do this with Nutch (e.g. by setting the agent string to mimick a browser), but we insist on saying that it's RUDE to do this. Anyway, Google monitors such attempts and after you issue too many requests your IP will be blocked for a duration - so no matter if you go the polite or the impolite way you won't be able to do this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch -threads in hadoop
Brian Tingle wrote: Hey, I'm playing around the nutch on hadoop; when I go hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl -threads ... is that threads per node or total threads for all nodes? Threads per map task - if you run multiple map tasks per node then you will get numThreads * numMapTasks per node. So be careful to set it to a number that doesn't overwhelm your network ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch -threads in hadoop
Brian Tingle wrote: Thanks, I eventually found where the job trackers were in the :50030 web page of the cloudera thing, and I saw it said 10 threads for each crawler in the little status update box where it was telling me how far along each crawl was. I have to say, this whole thing (nutch/hadoop) is pretty flipping awesome. Great work. I'm running on aws EC2 us-east and spidering sites that should be hosted on the CENIC network in California, do you have any suggestions on what a good number of threads to try per crawler might be in that situation (I'm guessing it might be hard to saturate the bandwidth)? I'm thinking I'll bump it up to at least 25. You need to be careful when running large crawls on someone else's infrastructure. While the raw bandwidth may be enough, the DNS infra may be insufficient - both on the side of the target domains as well as the local resolver. I strongly recommend setting up a local caching DNS. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Gracefull stop in the middle of a fetch phase ?
Alex McLintock wrote: I am not sure if it solves your problem but you might do something like disconnect your machines from the internet - preferably by making your dns server return dont know that domain This will relatively quickly cause the remaining part of the fetch to fail. Just a suggestion... I solved this once by implementing a check in Fetcher.run() for a marker file on HDFS. If the presence of this file was detected, the FetcherThreads would be stopped one by one (again, by setting a flag in their run() methods to terminate the loop). It's a hack but it works well. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Host specific parsing
Koch Martina wrote: Hi, has anyone built a parsing plugin which decides on a per host basis how the content of the document should be parsed? For example, if the title of a document is in the first h1-tag of a page for host1 , but the title for a document of host2 is in the third h2-tag, the plugin would extract the title differently depending on the host. In my opinion something like a dispatcher plugin would be needed: - Identify host of a document - Read and cache instructions on how to get the information for that host (database or config file) - Execute host-specific plugin Do you have any suggestions on how to implement such a scenario efficiently? Has anyone implemented something similiar and can point out possible performance issues or other critical issues to be considered? Yes, and yes. With the current plugin system you can create a new dispatcher plugin, and then add other necessary plugins as import elements. This way they will be accessible from the same classloader, so that you can instantiate them directly in your dispatcher plugin. As for the lookup ... many solutions are possible. DB connections from map tasks may be problematic, both because of latency and the cost of setting up so many DB connections. OTOH, if you add local caching (using JCS or Ehcache) the hit/miss ratio should be decent enough. If the mapping of host names to plugins can be expressed by rules then maybe a simple rule set would be enough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Meaning of ProtocolStatus.ACCESS_DENIED
Otis Gospodnetic wrote: I don't know of an elegant way, but if you want to hack Nutch sources, you could set its refetch time to some point in time veeey far in the future, for example. Or introduce additional status. This won't work, because the pages will be checked again after a maximum.fetch.interval. Pages that return ACCESS_DENIED may do so only for some time, so Nutch needs to check their status periodically. In a sense, no page is ever truly GONE, if only for the reason that we somehow need to represent nonexistent targets of stale links - if we removed these URLs from the db they would be soon rediscovered and added again. The gory details of maximum.fetch.interval follow .. Nutch periodically checks the status of all pages in CrawlDb, no matter what their state, including GONE, ACCESS_DENIED, ROBOTS_DENIED, etc. If you use some adaptive re-fetch strategy (AdaptiveFetchSchedule) then the re-fetch interval will be set at maximum value in a few cycles, so the checking won't occur too often. You may be tempted to set this to infinity, i.e. to never check these URLs again. However, the purpose of having a specific value for maximum refetch interval is to be able to phase out old segments, so that you can be sure that you can delete old segments after N days, because all their pages have been surely scheduled for refetching and will be found in a newer segment. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch updatedb Crash
MoD wrote: Julien, I did tryed with 2048M / Task child, no luck I still have two reduce that doesn't go through, Is it somewhat related to the number of reduce, on this cluster I have 4 servers : - dual xeon dual core (8 core) - 8Gb ram - 4 disks I did set mapred.reduce.tasks and mapred.map.tasks to 16. because : 4 server of 4 disks. (what do you think) Maybe if this job is too big for my cluster, does adding reduce task could subdivise the problem into smaller reduces. indeed I think no, cause I guess the input key is for the same domain ? so my two last reduce task are the biggest domains of my DB ? This is likely caused by a large number of inlinks for certain urls - the updatedb reduce collects this list in memory, and this sometimes leads to memory exhaustion. Please try limiting the max. number of inlinks per url (see nutch-default.xml for details). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch.SIGNATURE_KEY
Paul Tomblin wrote: On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote: Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? I believe the hashes would be the same for either raw MD5 or text signature, yes. So on the search side these would get collapsed. Don't know about what else you mean as far as same page - e.g. one entry in the CrawlDB? If so, then somebody else with more up-to-date knowledge of Nutch would need to chime in here. Older versions of Nutch would still have these as separate entries, FWIR. Actually, I just checked some of my own pages, and http://xcski.com/ and http://xcski.com/index.html have different signatures, in spite of them being the same page. So I guess the answer to that is no, even if there were logic to make them the same page in CrawlDB, it wouldn't work. There is nothing magic about the process of calculating a signature - eg. MD5Signature just takes Content.getContent() (array of bytes) and runs it through MD5. So if you get different MD5 values, then your content was indeed different (even if it was only an advertisement link somewhere on the page). You could use urlnormalizer to collapse www.example.com/ and www.example.com/index.html into a single entry, in fact there is a commented-out rule like that in urlnormalizer config file. But as you observed above, there may be cases when these two are not really the same page, so you need to be careful ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files I'd argue with this advice. The goal here is to obtain the HTML pages. If you have crawled them, then why do it again? You already have their content locally. However, page content is NOT stored in crawldb, it's stored in segments. So you need to dump the content from segments, and not the content of crawldb. The command 'bin/nutch readseg -dump segmentName output' should do the trick. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: how to upgrade a java application with nutch?
Jaime Martín wrote: Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance! Nutch is not designed for embedding in other applications, so you may face numerous problems. I did such an integration once, and it was far from obvious. A lot depends also whether you want to run it on a distributed cluster or in a single JVM (local mode). Take a look at build/nutch*.job, it's a jar file that contains all dependencies needed to run Nutch except for Hadoop libraries (which are also required). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch randomly skipping locations during crawl
tsmori wrote: This is strange. I manage the webservers for a large university library. On our site we have a staff directory where each user has a location for information. The URLs take the form of: http://mydomain.edu/staff/userid I've added the staff URL to the urls seed file. But even with a crawl set to depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems to only fetch about 50% of the locations in this area of the site. What should I look for to find out why this is happening? * Check that the pages there are not forbidden by robot rules (which may be embedded inside HTML meta tags of index.html, or the top-level robots.txt). * check that your crawldb actually contains entries for these pages - perhaps they are being filtered out. * check your segments whether these URLs were scheduled for fetching, and if so, then what was the status of fetching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: hi, but how to dump the content ? i tried this command : ./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate but the crawl_generate is in this path : /usr/local/nutch-1.0/crawl/segments/20091001120102 and not in this one : /usr/local/nutch-1.0/crawl/segments/20091001120102/content can you plz just give me the correct command ? This command will dump just the content part: ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch randomly skipping locations during crawl
tsmori wrote: Both good ideas. Unfortunately, the content for each user is the same. It's a static php file that simply calls information out of our LDAP. It's very strange because I cannot see any difference between the user files/directories that are fetched and those that aren't. In checking both the crawl log and the hadoop log, the missing users are not even fetched. Check the segment's crawl_generate and crawl_fetch, and also check your crawldb for status. Logs don't always contain this information. The issue seems to be that they're not fetched and there's no indication in the logs why they aren't. See above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links for Crawling
Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric wrote: My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Yes. Make sure that you have the generate.update.db property set to true, and then generate 16 segments each having 100k urls. After you finish generating them, then you can start fetching. Similarly, you can do the same for the next level, only you will have to generate more segments. This could be done much simpler with a modified Generator that outputs multiple segments from one job, but it's not implemented yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only for 1 week, so you need to fetch all segments within one week from generating them. You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: indexing just certain content
BELLINI ADAM wrote: HI hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking to start to create an HTML tag filter class. mabe i can create my own HTML parser ! as i do for parsing and indexing DublinCore metadata...it sounds possible don't you think so ? i just hv to create also or to find a class which could filter an HTML pages and delete certain tag from it Guys, please take a look at how HtmlParseFilters are implemented - for example the creativecommons plugin. I believe that's exactly the functionality that you are looking for. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: indexing just certain content
MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to ignore search results that don't have related keywords in main body?
winz wrote: Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue. Hi, Could I please know how we can ignore template items like header, footer and menu/navigations while crawling and indexing pages which follow the same design?? I'm using a content management system called Infoglue to develop my website. A standard template is applied for all the pages on the website. The search results from Nutch shows content from menu/navigation bar multiple times. I need to get rid of menu/navigation content from the search result. If all you index is this particular site, then you know the positions of navigation items, right? Then you can remove these elements in your HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these elements. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to ignore search results that don't have related keywords in main body?
BELLINI ADAM wrote: hi guyes it's just what im talking about in my post 'indexing just certain content'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post... What is garbage? Can you define it in terms of regex pattern or XPath expression that points to specific elements in DOM tree? If you crawl a single (or few) sites with well defined templates then you can hardcode some rules for removing unwanted parts of the page. If you can't do this, then there are some heuristic methods to solve this. There are two groups of methods: * page at a time (local): this group of methods considers only the current page that you analyze. The quality of filtering is usually limited. * groups of pages (e.g. per site): these methods consider many pages at a time, and try to find recurring theme among them. Since you first need to accumulate some pages it can't be done on the fly, i.e. this requires a separate post-processing step. The easiest to implement in Nutch is the first approach (page at a time). There are many possible implementations - e.g. based on text patterns, on visual position of elements, on DOM tree patterns, on block of content characteristics, etc. Here's for example a simple method: * collect text from the page in blocks, where each block fits within structural tags (div and table tags). Collect also the number of a links in each block. * remove a percentage of the smallest blocks, where link number is high - these are likely navigational elements. * reconstruct the whole page from the remaining blocks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric Osgood wrote: Ok, I think I am on the right track now, but just to be sure: the code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? No, you need the trunk from here: http://svn.apache.org/repos/asf/lucene/nutch/trunk -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Incremental Whole Web Crawling
Eric Osgood wrote: So the trunk contains the most recent nightly update? It's the other way around - nightly build is created from a snapshot of the trunk. The trunk is always the most recent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: http keep alive
Marko Bauhardt wrote: hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? protocol-httpclient can support keep-alive. However, I think that it won't help you much. Please consider that Fetcher needs to wait some time between requests, and in the meantime it will issue requests to other sites. This means that if you want to use keep-alive connections then the number of open connections will climb up quickly, depending on the number of unique sites on your fetchlist, until you run out of available sockets. On the other hand, if the number of unique sites is small, then most of the time the Fetcher will wait anyway, so the benefit from keep-alives (for you as a client) will be small - though there will be still some benefit for the server side. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Enterprise
Dennis Kubes wrote: Depending on what you are wanting to do Solr may be a better choice as and Enterprise search server. If you are needing crawling you can use Nutch or attach a different crawler to Solr. If you are wanting to do more full web type search, then Nutch is a better option. What are your requirements? Dennis fredericoagent wrote: Does anybody have any information on using Nutch as Enterprise search ?, and what would I need ? is it just a case of the current nutch package or do you need other addons. And how does that compare against Google Enterprise ? thanks I agree with Dennis - use Nutch if you need to do a larger-scale discovery such as when you crawl the web, but if you already know all target pages in advance then Solr will be a much better (and much easier to handle) platform. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException
Jesse Hires wrote: Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on? 2009-10-16 17:02:43,061 ERROR datanode.DataNode - DatanodeRegistration(192.168.1.7:50010, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):DataXceiver org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_90983736382565_3277 is valid, and cannot be written to. Are you sure you are running a single datanode process per machine? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to run a complete crawl?
Vincent155 wrote: I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the topN-statement, there are still only some 50 to 75 files indexed. Check in your nutch-site.xml what is the value of db.max.outlinks.per.page, the default is 100 - when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Extending HTML Parser to create subpage index documents
malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a master document (for the original post) and an additional document for each reply item. I've reviewed the code for protocol plugins, parser plugins and indexing plugins but each interface allows for a single document or content object to be passed around. Am I missing something simple? My best bet at the moment is to implement some kind of new fake protocol for the reply items then I would use the http client plugin for the first request to the page and generate outlines on the fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go back through and fetch the sub page content. But this seems round-about and would probably generate an http request for each reply on the original page. But perhaps there is a way to lookup the original page in the segment db before requesting it again. Needless to say it would seem more straightforward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies collection and index it as separate lucene documents - but I can't find a way to do this given the interfaces in the indexer plugins. Thanks in advance Malcolm Smith What version of Nutch are you using? This should be already possible to do using the 1.0 release or a nightly build. ParseResult (which is what parsers produce) can hold multiple Parse objects, each with its own URL. The common approach to handle whole-part relationships (like zip/tar archives, RSS, and other compound docs) is to split them in the parser and parse each part, then give each sub-document its own URL (e.g file.tar!myfile.txt) and add the original URL in the metadata, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recrawl sub-documents). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ERROR: current leaseholder is trying to recreate file.
Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing an Index from a shared location
JusteAvantToi wrote: Hi all, I am new on using Nutch and I found that Nutch is really good. I have a problem and hope somebody can shed a light. I have built an index and a web application that makes use of that index. I plan to have two web application servers running the application. Since I do not want to replicate the application and the index on each web application server, I put the application and the index on a shared location and configure nutch-site.xml as follow: property namesearcher.dir/name value\\111.111.111.111\folder\index/value description Path to root of crawl/description /property property nameplugin.folders/name value\\111.111.111.111\folder\plugins/valuedescription /property However it seems that my application can not find the index. I have checked that the web application server have access to the shared location. Is there something that I missed here? Does Nutch allow us to put the index on a network location? UNC paths are not supported in Java - you need to mount this location as a local volume. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: Great! in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. The flag should have been automagically added to the target CrawlDatum metadata after you have updated your crawldb (see the details in CrawlDbReducer). Then in generatorSortValue() you can check for the presence of this flag by using the datum.getMetaData(). BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any special way ... I thought it did. It's easy to add this, though - in Generator.java:161 just add this: if (sort == Float.MIN_VALUE) { return; } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Deleting stale URLs from Nutch/Solr
Gora Mohanty wrote: Hi, We are using Nutch to crawl an internal site, and index content to Solr. The issue is that the site is run through a CMS, and occasionally pages are deleted, so that the corresponding URLs become invalid. Is there any way that Nutch can discover stale URLs during recrawls, or is the only solution a completely fresh crawl? Also, is it possible to have Nutch automatically remove such stale content from Solr? I am stumped by this problem, and would appreciate any pointers, or even thoughts on this. Hi, Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). Now when it comes to removing them from Solr ... The simplest (no coding) way would be to dump the CrawlDb, use some scripting tools to collect just the URL-s with the status GONE, and send them as a delete command to Solr. A slightly more involved solution would be to implement a tool that reads such URLs directly from CrawlDb (using e.g. CrawlDbReader API) and then uses SolrJ API to send the same delete requests + commit. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Deleting stale URLs from Nutch/Solr
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want to remove them from CrawlDb you can filter them out (using CrawlDbMerger with just one input db, and setting your URLFilters appropriately). [...] Thank you for your help. Your suggestions look promising, but I think that I did not make myself adequately clear. Once we have completed a site crawl with Nutch, ideally I would like to be able to find stale links without doing a complete recrawl, i.e., only through restarting the crawl from where it last left off. Is that possible. I tried a simple test on a local webserver with five pages in a three-level hierarchy. The crawl completes, and discovers all five URLs as expected. Now, I remove a tertiary page. Ideally, I would like to be able run a recrawl, and have Nutch dicover the now-missing URL. However, when I try that, it finds no new links, and exits. I assume you mean that the generate step produces no new URL-s to fetch? That's expected, because they become eligible for re-fetching only after Nutch considers them expired, i.e. after the fetchTime + fetchInterval, and the default fetchInterval is 30 days. You can pretend that the time moved on using the -adddays parameter. Then Nutch will generate a new fetchlist, and when it discovers that the page is missing it will mark it as gone - actually, you could then take that information directly from the Nutch segment and instead of processing the CrawlDb you could process the segment to collect a partial list of gone pages. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to index files only with specific type
Dmitriy Fundak wrote: If I disable html-parser(remove parse-(html from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. It's possible to do this with a custom indexing filter - see other indexing filters to get a feeling of what's involved. Or you could do this with a scoring filter, too, although the scoring API looks more complicated. Either way, when you execute the Indexer, these filters are run in a chain, and if one of them returns null then that document is discarded, i.e. it's not added to the output index. So, it's easy to examine in your indexing filter the content type (or just a URL of the document) and either pass the document on or reject it by returning null. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch indexes less pages, then it fetches
caezar wrote: Some more information. Debugging reduce method I've noticed, that before code if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: unbalanced fetching
Jesse Hires wrote: I have a two datanode and one namenode setup. One of my datanodes is slower than the other, causing the fetch to run significantly longer on it. Is there a way to balance this out? Most likely the number of URLs/host is unbalanced, meaning that the tasktracker that takes the longest is assigned a lot of URLs from a single host. A workaround for this is to limit the max number of URLs per host (in nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever works best for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: updatedb is talking long long time
Kalaimathan Mahenthiran wrote: I forgot to add the detail... The segment i'm trying to do updatedb on has 1.3 millions urls fetched and 1.08 million urls parsed.. Any help related to this would be appreciated... On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran matha...@gmail.com wrote: hi everyone I'm using nutch 1.0. I have fetched successfully and currently on the updatedb process. I'm doing updatedb and its taking so long. I don't know why its taking this long. I have a new machine with quad core processor and 8 gb of ram. I believe this system is really good in terms of processing power. I don't think processing power is the problem here. I noticed that all the ram is getting using up. close to 7.7gb by the updatedb process. The computer is becoming is really slow. The updatedb process has been running for the last 19 days continually with the message merging segment data into db.. Does anyone know why its taking so long... Is there any configuration setting i can do to increase the speed of the updatedb process... First, this process normally takes just a few minutes, depending on the hardware, and not several days - so something is wrong. * do you run this in local or pseudo-distributed mode (i.e. running a real jobtracker and tasktracker?) Try the pseudo-distributed mode, because then you can monitor the progress in the web UI. * how many reduce tasks do you have? with large updates it helps if you run 1 reducer, to split the final sorting. * if the task appears to be completely stuck, please generate a thread dump (kill -SIGQUIT) and see where it's stuck. This could be related to urlfilter-regex or urlnormalizer-regex - you can identify if these are problematic by removing them from the config and re-running the operation. * minor issue - when specifying the path names of segments and crawldb, do NOT append the trailing slash - it's not harmful in this particular case, but you could have a nasty surprise when doing e.g. copy / mv operations ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: including code between plugins
Eran Zinman wrote: Hi, I've written my own plugin that's doing some custom parsing. I've needed language parsing in that plugin and the language-identifier plugin is wokring great for my needs. However, I can't use the language identifier plugin as it is, since I want to parse only a small portion of the webpage. I've used the language identifier functions and it worked great in eclipse, but when I try to compile my plugin I'm unable to compile it since it depends on the language-identifier source code. My question is - how can I include the language identifier code in my plugin code without actually using the language-identifier plugin? You need to add the language-identifier plugin to the requires section in your plugin.xml, like this: requires import plugin=nutch-extensionpoints/ import plugin=language-identifier/ /requires -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: could you unsubscribe me from this mailing list pls. tks
Nico Sabbi wrote: Il giorno lun, 02/11/2009 alle 10.04 +0100, Heiko Dietze ha scritto: Hello, there is no Administrator. But you can do the unsubscribe your-self. On the Nutch Maling-List information site http://lucene.apache.org/nutch/mailing_lists.html you can find the following E-Mail address: nutch-user-unsubscr...@lucene.apache.org Then your unsubscribe requests should work. regards, Heiko Dietze doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of confirm unsubscribe from nutch-user@lucene.apache.org and did you respond to it from the same email account that you were subscribed from? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)
Andrzej Bialecki wrote: doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of confirm unsubscribe from nutch-user@lucene.apache.org and did you respond to it from the same email account that you were subscribed from? .. I just verified that this process works correctly - I subscribed and unsubscribed successfully. Please make sure that you complete the unsubscription process as listed below: 1. make sure you are sending requests from the same email address that you were subscribed from! 2. send email to nutch-user-subscr...@lucene.apache.org . 3. you will get a confirm unsubscribe message - make sure your anti-spam filters don't block this message, and make sure you are still using the correct email account when responding. 4. you need to reply to the confirm unsubscribe message (duh...) 5. you will get a GOODBYE message. Now, let me understand this clearly: did you go through all 5 steps listed above, and you are still getting messages from this list? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Direct Access to Cached Data
Hugo Pinto wrote: Hello, I am using Nutch for mirroring, rather than crawling and indexing. I need to access directly the cached data in my Nutch index, but I am unable to find an easy way to do so. I browsed the documentation(wiki, javadocs, and skimmed the code), but found no straightforward way to do it. Would anyone suggest a place to look for more information, or perhaps have done this before and could share a few tips? Most likely what you need is not the Lucene index, but the segments (shards), right? There's a utility called SegmentReader (available from cmd-line as readseg), and you can use its API to retrieve either all or individual records from a segment (using URL as key). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch near future - strategic directions
. We should make Nutch an attractive platform for such users, and we should discuss what this entails. Also, if we refactor Nutch in the way I described above, it will be easier for such users to contribute back to Nutch and other related projects. 3. Provide a platform for solving the really interesting issues --- Nutch has many bits and pieces that implement really smart algorithms and heuristics to solve difficult issues that occur in crawling. The problem is that they are often well hidden and poorly documented, and their interaction with the rest of the system is far from obvious. Sometimes this is related to premature performance optimizations, in other cases this is just a poorly abstracted design. Examples would include the OPIC scoring, meta-tags metadata handling, deduplication, redirection handling, etc. Even though these components are usually implemented as plugins, this lack of transparency and poor design makes it difficult to experiment with Nutch. I believe that improving this area will result in many more users contributing back to the project, both from business and from academia. And there are quite a few interesting challenges to solve: * crawl scheduling, i.e. determining the order and composition of fetchlists to maximize the crawling speed. * spam junk detection (I won't go into details on this, there are tons of literature on the subject) * crawler trap handling (e.g. the classic calendar page that generates infinite number of pages). * enterprise-specific ranking and scoring. This includes users' feedback (explicit and implicit, e.g. click-throughs) * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora) * near-duplicate detection, and closely related issue of extraction of the main content from a templated page. * URL aliasing (e.g. www.a.com == a.com == a.com/index.html == a.com/default.asp), and what happens with inlinks to such aliased pages. Also related to this is the problem of temporary/permanent redirects and complete mirrors. Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an attractive platform to develop and experiment with such components. - Briefly ;) that's what comes to my mind when I think about the future of Nutch. I invite you all to share your thoughts and suggestions! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: changing/addding field in existing index
fa...@butterflycluster.net wrote: hi all, i have an existing index - we have a custom field that needs to be added or changed in every currently indexed document ; whats the best way to go about this without recreating the index again? There are ways to do it directly on the index, but this is complicated and involves hacking the low-level Lucene format. Alternatively, you could build a parallel index with just these fields, but synchronized internal docId-s, open both indexes with ParallelReader, and then create a new index using IndexWriter.addIndexes(). I suggest recreating the index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Problems with Hadoop source
Pablo Aragón wrote: Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in: http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. After adding it to the project in Eclipse everything seems correct but the execution shows: Exception in thread main java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157) at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91) at org.apache.nutch.crawl.Crawl.main(Crawl.java:103) Any idea? Yes - when you worked with a pre-built jar it contained an embedded hadoop-default.xml that defines the implementation of the file:// schema FileSystem. Now you probably forgot to put hadoop-default.xml on your classpath. Go to Build Path and add this file to your classpath, and all should be ok. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop question
TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Synonym Filter with Nutch
Dharan Althuru wrote: Hi, We are trying to incorporate synonym filter during indexing using Nutch. As per my understanding Nutch doesn’t have synonym indexing plug-in by default. Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in available in Lucene using WordNet or custom synonym plug-in without any negative impacts to existing Nutch indexing (i.e., considering bigram etc). Synonym expansion should be done when the text is analyzed (using Analyzers), so you can reuse the Lucene's synonym filter. Unfortunately, this happens at different stages depending on whether you use the built-in Lucene indexer, or the Solr indexer. If you use the Lucene indexer, this happens in LuceneWriter, and the only way to affect it is to implement an analysis plugin, so that it's returned from AnalyzerFactory, and use your analysis plugin instead of the default one. See e.g. analysis-fr for an example of how to implement such plugin. However, when you index to Solr you need to configure the Solr's analysis chain, i.e. in your schema.xml you need to define for your fieldType that it has the synonym filter in its indexing analysis chain. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch near future - strategic directions
Subhojit Roy wrote: Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course depends on the last modified timestamp being present on the webpage that is being crawled, which I believe is not mandatory. Still those who do set it would benefit. This is already implemented - see the Signature / MD5Signature / TextProfileSignature. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: decoding nutch readseg -dump 's output
Yves Petinot wrote: Hi, I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation, i find myself unable to decode the content field, even when i try using the various charset hints that are available either in the content metadata, or in the raw content itself. I was wondering if someone on the list has already succeeded in building this type of functionality, or is the content returned by readseg using a specific encoding that i don't know of ? The dump functionality is not intended to provide a bit-by-bit copy of the segment, it's mostly for debugging purposes. It uses System.out, which in turn uses the default platform encoding - any characters outside this encoding will be replaced by question marks. If you want to get an exact copy of the raw binary content then please use the SegmentReader API. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Scalability for one site
Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Your Hadoop cluster does not increase the scalability of the target server and that's the crux of the matter - whether you use Hadoop or not, multiple threads or a single thread, if you want to be polite you will be able to do just 1 req/sec and that's it. You can prioritize certain pages for fetching so that you get the most interesting pages first (whatever interesting means). I know that URLs from one domain as assigned to one fetch segment, and polite crawling is enforced. Should I use lower-level parts of Nutch? The built-in limits are there to avoid causing pain for inexperienced search engine operators (and webmasters who are their victims). The source code is there, if you choose you can modify it to bypass these restrictions, just be aware of the consequences (and don't use Nutch as your user agent ;) ). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
John Martyniak wrote: Does anybody know of any concrete plans to update Nutch to Hadoop 0.20, 0.21? Something like a Nutch 1.1 release, get in some bug fixes and get current on Hadoop? I think that should be one of the goals. My 2 cents. I'm planning to do this upgrade soon (~a week) - and I agree that we should have a 1.1 release in the near future. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch near future - strategic directions
Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer that can accommodate both HBase and plain MapFile-s. -split into reusable components with nice and clean public api -publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler +1, with slight preference towards ivy. My biggest concern is in execution of this (or any other) plan. Some of the changes or improvements that have been proposed are quite heavy in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base. Well ... that's (almost) what Dogacan did with the HBase port. I agree that we should not feel too constrained by the existing code base, but it would be silly to throw everything away and start from scratch - we need to find a middle ground. The crawler-commons and Tika projects should help us to get rid of the ballast and significantly reduce the size of our code. In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones. Nutch 0.7 to 0.8 reused a lot of the existing code. Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc). So to summarize, I would go ahead and put together a branch nutch N.0 that would consist of (a.k.a my wish list, hope I am not being too aggressive here): -runs on top of plain hadoop See above - what do you mean by that? -use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with db abstraction or hbase directly and smart data structures that allow flexible and efficient usage of the data) -basic solr integration for indexing/search -basic parsing with tika After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges. I believe that's more or less where Dogacan's port is right now, except it's not merged with the OSGI port. ps. many of the interesting challenges in your proposal seem to fall in the category of data analysis and manipulation that are mostly, used after the data has been crawled or between the fetch cycles so many of those could be implemented into current code base also, somehow I just feel that things could be made more efficient and understandable if the foundation (eg. data structures, extendability for example) was in better shape. Also if written nicely other projects could use them too! Definitely agree with this. Example: the PageRank package - it works quite well with the current code, but it's design is obscured by the ScoringFilter api and the need to maintain its own extended DB-s. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
Dennis Kubes wrote: I would like to get a couple things in this release as well. Let me know if you want help with the upgrade. You mean you want to do the Hadoop upgrade? I won't stand in your way :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
Dennis Kubes wrote: I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Great, thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: AbstractFetchSchedule
reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } First, concerning the segment retention - we want to enforce that pages that were not refreshed longer than maxInterval should be retried, no matter what is their status - because we want to obtain a copy of the page in a newer segment in order to be able to delete the old segment. why is the fetch time set here to curTime? Because we want to fetch it now - see the next line where this condition is checked. and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? Hm, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same content 1- www.domaine/folder/ 2- www.domaine/folder/index.html 3- www.domaine/folder/index.html?lang=fr but i find all of them in my index :( i was wondering that dedup will delete 1 and 2 the dedup wont work correclty !! Please check the value of the Signature field for all the above urls in your crawldb. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Apparently it isn't a bit-exact replica of the page, so its MD5 hash is different. You need to use a more relaxed Signature implementation, e.g. TextProfileSignature. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: dedup dont delete duplicates !
BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages from the dump, and run a unix diff utility. can you tell m eplz more about TextProfileSignature ? how should i use it Configure this type of signature in your nutch-site.xml - please see the nutch-default.xml for instructions. Please note that you will have to re-parse segments and update the db in order to update the signatures. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch config IOException
Mischa Tuffield wrote: Hello Again, Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. !-- 2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:176) at org.apache.hadoop.conf.Configuration.init(Configuration.java:164) at org.apache.hadoop.hdfs.protocol.FSConstants.clinit(FSConstants.java:51) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) -- Any pointers would be great, I wonder is there a way for me to validate my conf options before I deploy nutch? This exception is innocuous - it helps to debug at which points in the code the Configuration instances are being created. And you wouldn't have seen this if you didn't turn on the DEBUG logging. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix I suspect that you have a bunch of hosts that slowly trickle the content, i.e. requests don't time out, crawl-delay is low, but the download speed is very very low due to the limits at their end (either physical or artificial). The solution in that case would be to track a minimum avg. speed per FetchQueue, and lock-out the queue if this number crosses the threshold (similarly to what we do when we discover a crawl-delay that is too high). In the meantime, you could add the number of FetchQueue-s to that diagnostic output, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Broken segments ?
Mischa Tuffield wrote: Hello All, http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Encoding the content got from Fetcher
Santiago Pérez wrote: Yes, I tried in that configuration file setting with the latin encoding Windows-1250, but the value of this property does not affect to the encoding of the content (I also tried with unexistent encoding and the result is the same...) property nameparser.character.encoding.default/name valueWindows-1250/value descriptionThe character encoding to fall back to when no other information is available/description /property Has anyone had the same problem? (Hungarian o Polish people sure...) The appearance of characters that you quoted in your other email indicates that the problem may be the opposite - your pages seem to use UTF-8, and you are trying to convert them using Windows-1250 ... Try putting UTF-8 in this property, and see what happens. Generally speaking, pages should declare their encoding, either in HTTP headers or in meta tags, but often this declaration is either missing or completely wrong. Nutch uses ICU4J CharsetDetector plus its own heuristic (in util.EncodingDetector and in HtmlParser) that tries to detect character encoding if it's missing or even if it's wrong - but this is a tricky issue and sometimes results are unpredictable. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: Interesting updates on the current run of 450K urls : + 30minutes @ 3Mbits/s + drop to 1Mbit/s (1/X shape) + gradual improvement to 1.5 Mbit/s and steady for 7 hours + sudden drop to 0.9 Mbits/s and steady for 4 hours + up to 1.7 Mbits for 1hour + staircasing down to 0.5 Mbit/s by steps of 1 hour I don't know what to take as a conclusion, but it is quite strange to have those sudden variation of bandwidth and overall very slow. I can post the graph if people are interested. This most likely comes from the allocation of urls to map tasks, and the maximum number of map tasks that you can run on your cluster. when tasks finish their run, you see a sudden drop in speed, until the next task starts running. Initially, I suspect that you have more tasks available than the capacity of your cluster, so it's easy to fill the slots and max the speed. Later on, slow map tasks tend to hang around, but still some of them finish and make space for new tasks. As time goes on, majority of your tasks becomes slow tasks, so the overall speed continues to drop down. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: 100 fetches per second?
MilleBii wrote: You mean map/reduce tasks ??? Yes. Being in pseudo-distributed / single node I only have two maps during the fetch phase... so it would be back to the URLs distribution. Well, yes, but my explanation is still valid. Which unfortunately doesn't change the situation. Next week I will be working on integrating the patches from Julien, and if time permits I could perhaps start working on a speed monitoring to lock out slow servers. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch frozen but not exiting
Paul Tomblin wrote: My nutch crawl just stopped. The process is still there, and doesn't respond to a kill -TERM or a kill -HUP, but it hasn't written anything to the log file in the last 40 minutes. The last thing it logged was some calls to my custom url filter. Nothing has been written in the hadoop directory or the crawldir/crawldb or the segments dir in that time. How can I tell what's going on and why it's stopped? If you run in distributed / pseudo-distributed mode, you can check the status in the JobTracker UI. If you are running in local mode, then it's likely that the process is in a (single) reduce phase sorting the data - with larger jobs in local mode the sorting phase may take very long time, due to a heavy disk IO (and in disk-wait state it may be uninterruptible). Try to generate a thread dump to see what code is being executed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch frozen but not exiting
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Hm, I can't see anything obviously wrong with that thread dump. What's the CPU and swap usage, and loadavg? The process is using a lot of CPU. loadavg is up over 5. top - 15:12:19 up 22 days, 4:06, 2 users, load average: 5.01, 5.00, 4.93 Tasks: 48 total, 2 running, 45 sleeping, 0 stopped, 1 zombie Cpu(s): 1.0% us, 99.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 3170584k total, 2231700k used, 938884k free,0k buffers Swap:0k total,0k used,0k free,0k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 32507 discover 16 0 1163m 974m 8604 S 394.7 31.5 719:40.71 java Actually, the memory is a real annoyance - the hosting company doesn't give me any swap, so when hadoop does a fork/exec just to do a whoami, I have to leave as much memory free as the crawl reserves with -Xmx for itself. Hm, the curious thing here is that the java process is sleeping, and 99% of cpu is in system time ... usually this would indicate swapping, but since there is no swap in your setup I'm stumped. Still, this may be related to the weird memory/swap setup on that machine - try decreasing the heap size and see what happens. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: odd warnings
Jesse Hires wrote: What is segments.gen and segments_2 ? The warning I am getting happens when I dedup two indexes. I create index1 and index2 through generate/fetch/index/...etc index1 is an index of 1/2 the segments. index2 is an index of the other 1/2 The warning is happening on both datanodes. The command I am running is bin/nutch dedup crawl/index1 crawl/index2 If segments.gen and segments_2 are supposed to be directories, then why are they created as files? They are created as files from the start bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX crawl/segments/YYY I don't see any errors or warnings about creating the index. The command that you quote above produces multiple partial indexes, located in crawl/index1/part-N and only in these subdirectories the Lucene indexes can be found. However, the deduplication process doesn't accept partial indexes, so you need to specify each /part- dir as an input to dedup. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio
BELLINI ADAM wrote: hi, i have this error when crawling org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out Most likely you ran out of tmp disk space. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How does generate work ?
MilleBii wrote: Oops continuing previous mail. So I wonder if there would be a better algorithm 'generate' which would maintain a constant rate of host per 100 url ... Below a certain threshold it stops or better starts including URLs of lower scores. That's exactly how the max.urls.per.host limit works. Using scores is de-optimzing the fetching process... Having said that I should first read the code and try to understand it. That wouldn't hurt in any case ;) There is also a method in ScoringFilter-s (e.g. the default scoring-opic), where it determines the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 1.0 wml plugin
yangfeng wrote: I have completed the plugin for parsing the wml(wiredless mark language). I hope to add it to lucene, what i do? The best long-term option would be to submit this work to the Tika project - see http://lucene.apache.org/tika/. If you already implemented this as a Nutch plugin, please creata a JIRA issue in Nutch, and attach the patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - Exception
Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NOINDEX, NOFOLLOW
On 2009-12-10 20:33, Kirby Bohling wrote: On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com wrote: hi, i have a page withmeta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index http://example/foo/bar;, knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: domain vs www.domain?
On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this? This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Luke reading index in hdfs
On 2009-12-11 22:21, MilleBii wrote: Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x. Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1. Start Luke, dismiss the open dialog, and then go to Plugins / Hadoop, and enter the full URL of the index directory (including the hdfs:// part). You can also open multiple parts of the index (e.g. if you follow the Nutch naming convention, you can directly open the indexes/ directory that contains part-N partial indexes). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: OR support
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException
On 2009-12-17 10:13, Eran Zinman wrote: Hi, I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of Nutch parser reduce tasks. I know this is a known issue with Nutch ( https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058 ) And as far as I can see that patch wasn't committed yet because we wanted to examine it on the new Hadoop 0.20 version. I am using latest Nutch with Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it does) - maybe we should commit the change? Thanks for reporting this - could you perhaps try to apply that patch and see if it helps? I hesitated to commit it because it's really a workaround and not a solution ... but if it works for you then it's better than nothing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Large files - nutch failing to fetch
On 2009-12-21 17:15, Sundara Kaku wrote: Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch all the files using Nutch, but nutch is throwing outofmemory exception for large files ( have set heap size to 2500m), with heap memory 2500m file size with 250mb are retrieved but larger that that are failing, and nutch takes lot of time after printing -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 if there are three files with size 100mb each then it is failing (at the same depth, with heap size 2500m) to fetch files. i have set http.content.limite to -1 is there way to fetch several large files using nutch.. I am using nutch as webcrawler, i am not using Indexing. I want to download web resources and scan then for virus using ClamA/V. Probably Nutch is not the right tool for you - you should probably use wget. Nutch was designed to fetch many pages of limited size - as a temporary step it caches the downloaded content in memory, before flushing it out to disk. (I had to solve this limitation once for a specific case - the solution was to implement a variant of the protocol and Content that stored data into separate HDFS files without buffering in memory - but it was a brittle hack that only worked for that particular scenario). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing crawled data
On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual solution is to dump the content in a file through the segreader, parse it and then use SolrJ to send the documents. Probably the best solution is to set my own analyzer for the field on solr side, and do keywords extraction there. Thanks for the script, you'll use it! Likely the solution that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Accessing crawled data
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual solution is to dump the content in a file through the segreader, parse it and then use SolrJ to send the documents. Probably the best solution is to set my own analyzer for the field on solr side, and do keywords extraction there. Thanks for the script, you'll use it! Likely the solution that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. This sounds very interesting. So the idea is to take the NutchDocument as it comes out of the crawling and modify it (inside of an IndexingFilter) before it's sent to indexing (inside of nutch), right? Correct - IndexingFilter-s work no matter whether you use Nutch or Solr indexing. So how does it relate to nutch schema and solr schema? Can you give me some pointers? Please take a look at how e.g. the index-more filter is implemented - basically you need to copy this filter and make whatever modifications you need ;) Keep in mind that any fields that you create in NutchDocument need to be properly declared in schema.xml when using Solr indexing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dedup remove all duplicates
On 2010-01-06 18:56, Pascal Dimassimo wrote: Hi, After I run the index command, my index contains 2 documents with the same boost, digest, segment and title, but with different tstamp and url. When I run the dedup command on that index, both documents are removed. Should the document with the latest tstamp be kept? It should, out of multiple documents with the same URL (url duplicates) only the most recent is retained - unless it was removed because there was another document in the index with the same content (a content duplicate). Could you please verify this on a minimal index (2 documents), and if the problem persist please report this in JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Purging from Nutch after indexing with Solr
On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote: I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some questions regarding data redundancy with this setup. Considering the following sample segment: 2.0Gcontent 196Kcrawl_fetch 152Kcrawl_generate 376Kcrawl_parse 392Kparse_data 441Mparse_text 1. From what I have found through searches content holds the raw fetched content, is there any problem if I remove it, ie: does nutch needs it to apply any sort of logic when re-crawling that content/url? No, they are no longer needed, unless you want to provide a cached view of the content. 2. Previous question applies to parse_data and parse_text after i've called nutch solrindex on that segment. Depends how you set up your search. If you search using NutchBean (i.e. the Nutch web application) then you need them. If you search using Solr, then you don't need them. 3. Using samples scritps and tutorials I'm always seeing invertlinks being called over all segments, but its output mentions merging, when I fetch/parse new segments can I call invertlinks only over them? Yes, invertlinks will incrementally merge the existing linkdb with new links from a new segment. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Purging from Nutch after indexing with Solr
On 2010-01-09 10:18, MilleBii wrote: @Andrzej, To be more specific if one uses cached content (which I do), what is the minimal staff to keep, I guess : + crawl_fetch + parse_data + parse_text the rest is not used ... I guess, before I start testing could you confirm ? crawl_fetch you can ignore - it's just the status of fetching, which should be by that time already integrated into crawldb (if you ran updatedb). It's the content/ that you need to display cached view. @Ulysse, The other reason to keep all data is if you will need to reindex all segments, which does happen in development test phases, less in production though. Right. Also, a common practice is to keep the raw data for a while just to make sure that the parsing and indexing went smoothly (in case you need to re-parse the raw content). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Adding additional metadata
On 2010-01-11 13:18, Erlend Garåsen wrote: First of all: I didn't know about the list archive, so sorry for not searching that resource before I sent a new post. MilleBii wrote: For lastModified just enable the index|query-more plugins it will do the job for you. Unfortunately not. Our pages include Dublin core metadata which has a Norwegian name. For other meta searc the mailing list its explained many times how to do it I found several posts concerning metadata, but for me, one question is still unanswered: Do I really have to create a lot of new classes/xml files in order to store the content of just two metadata? I have not managed to parse the content of the lastModified metadata after I tried to rewrite the HtmlParser class. So I tried to add hard coded metadata values in HtmlParser like this instead: entry.getValue().getData().getParseMeta().set(dato.endret, 01.01.2008); My modified MoreIndexingFilter managed to pick up the hard coded values, and the dates were successfully stored into my Solr Index after running the solrindex option. This means that it is not necessary to write a new MoreIndexingFilter class, but I'm still unsure about the HtmlParser class since I haven't managed to parse the content of the metadata. You can of course hack your way through HtmlParser and add/remove/modify as you see fit - it's straightforward and likely you will get the result that you want. However, as MilleBii suggests, the preferred way to do this would be to write a plugin. The reason is the cost of a long-term maintenance - if you ever want to sync up your local modified version of Nutch with the newer public release, your hacked copy of HtmlParser won't merge nicely, whereas if you put your code in a separate plugin then it might. Another reason is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficult to do. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Help Needed with Error: java.lang.StackOverflowError
On 2010-01-11 18:40, Godmar Back wrote: On Mon, Jan 11, 2010 at 12:30 PM, Fuad Efendif...@efendi.ca wrote: Googling reveals http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you could try increasing the Java stack size in bin/nutch (-Xss), or use an alternate regexp if you can. Just out of curiosity, why does a performance critical program such as Nutch use Sun's backtracking-based regexp implementation rather than an efficient Thompson-based one? Do you need the additional expressiveness provided by PCRE? Very interesting point... we should use it for BIXO too. BTW, SUN has memory leaks with LinkedBlockingQueue, http://bugs.sun.com/view_bug.do?bug_id=6806875 http://tech.groups.yahoo.com/group/bixo-dev/message/329 I don't think we use this class in Nutch. And, of course, URL is synchronized; Apache Tomcat uses simplified version of URL class. And, RegexUrlNormalizer is synchronized in Nutch... And, in order to retrieve plain text from HTML we are creating fat DOM object (instead of using, for instance, filters in NekoHtml) We are creating a DOM tree because it's much easier to write filtering plugins that work with DOM tree than implement Neko filters. Besides, we provide an option to use TagSoup for HTML parsing, which is not only more resilient to HTML errors but also more efficient. Besides, Nutch is built around plugins. Deactivate parse-html and write your own HTML plugin that avoids these inefficiencies, and we'll be happy to include it in the distribution. And more... I'm no expert, but the reason I brought this up for discussion was that I recently encountered a paper that pointed out that regular expression matching accounts for a significant fraction of total runtime in search engine indexers [1] and thus it's something that's usually optimized. - Godmar [1] http://portal.acm.org/citation.cfm?id=1542275.1542284 This StackOverflow came probably from the urlfilter-regex, which indeed uses Java regex, definitely one of the worst implementations. The reason it's used by default in Nutch is that it's standard in JDK, FWIW. For high-performance crawlers I usually do the following: * avoid regex filtering completely, if possible, instead using a combination of prefix/suffix/domain/custom filters * use urlfilter-automaton, which is slightly less expressive but much much faster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Post Injecting ?
On 2010-01-15 20:09, MilleBii wrote: Inject is meant to seed the database at the start. But I would like to inject new urls on a production crawldb, I think it works but I was wondering if somebody could confirm that. Yes. New urls are merged with the old ones. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: merge not working anymore
On 2010-01-18 21:56, MilleBii wrote: Help !!! My production environment is blocked by this error. I deleted the segment altogether and restarted crawl/fetch/parse... and I'm still stuck, so I can not add segments anymore. Looking like a hdfs problem ??? 2010-01-18 19:53:00,785 WARN hdfs.DFSClient - DFS Read: java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735 file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx This error is commonly caused by running out of disk space on a datanode. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: About HBase Integration
On 2010-02-09 03:08, Hua Su wrote: Thanks. But heritrix is another project, right? Please see this Git repository, it contains the latest work in progress on Nutch+HBase: git://github.com/dogacan/nutchbase.git -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SegmentFilter
On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Duplicate content is usually related to the fact that indeed the same content appears under different urls. This is common enough, so I don't see this necessarily as a bug in Nutch - we won't know that the content is identical until we actually fetch it... Urls may differ in certain systematic ways (e.g. by a set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SegmentFilter
On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Duplicate content is usually related to the fact that indeed the same content appears under different urls. This is common enough, so I don't see this necessarily as a bug in Nutch - we won't know that the content is identical until we actually fetch it... Urls may differ in certain systematic ways (e.g. by a set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. i know. i have implemented several url filters to filter duplicate content. there is a difference here. the difference here is that in this case the same content is stored under the same url several times. it is stored under http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 and not under http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 the content for the latter url is empty. Content: Ok, then the answer can be found in the protocol status or parse status. You can get protocol status by doing a segment dump of only the crawl_fetch part (disable all other parts, then the output is less confusing). Similarly, parse status can be found in crawl_parse. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch v0.4
On 2010-02-24 17:34, Pedro Bezunartea López wrote: Hi Ashley, Hi, I'm looking to reproduce program analysis results based on Nutch v0.4. I realize this is a very old release, but is it possible to obtain the source from somewhere? I see some of the classes I'm looking for in v0.7, but I need the older version to confirm it. Thanks, Ashley You can get version 0.6 and higher from apache's archive: http://archive.apache.org/dist/lucene/nutch/ ... but I haven't found anything older, AFAIK older releases of Nutch were archived only on that old SF site, and apparently that site no longer exists. Sorry :( However, you can still check out that code from CVS repository at nutch.sf.net . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Update on ignoring menu divs
On 2010-02-28 18:42, Ian M. Evans wrote: Using Nutch as a crawler for solr. I've been digging around the nutch-user archives a bit and have seen some people discussing how to ignore menu items or other unnecessary div areas like common footers, etc. I still haven't come across a full answer yet. Is there a to define a div by id that nutch will strip out before tossing the content into solr? There is no such functionality out of the box. One direction that is worth pursuing would be to create an HtmlParseFilter plugin that wraps the Boilerpipe library http://code.google.com/p/boilerpipe/ . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: New version of nutch?
On 2010-03-03 20:12, John Martyniak wrote: Does anybody have an idea of when a new version of nutch will be availale, specifically supporting a latest version of hadoop. And possibly hbase? Thank you for any information. We should roll out a 1.1 soon (a few weeks), the nutch+hbase is imho still a few months away. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Content of redirected urls empty
On 2010-03-08 14:55, BELLINI ADAM wrote: is there any idea guys ?? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: Content of redirected urls empty Date: Fri, 5 Mar 2010 22:01:05 + hi, the content of my redirected urls is empty...but still have the other metadata... i have an http urls that is redirected to https. in my index i find the http URL but with an empty content... could you explain it plz? There are two ways to redirect - one is with protocol, and the other is with content (either meta refresh, or javascript). When you dump the segment, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: form-based authentication? Any progress
On 2010-03-10 19:26, conficio wrote: Susam Pal wrote: Hi, Indeed the answer is negative and also, many people have often asked this in this list. Martin has very nicely explained the problems and possible solution. I'll just add what I have thought of. I have often wondered what it would take to create a nice configurable cookie based authentication feature. The following file would be needed:- ... http://wiki.apache.org/nutch/HttpPostAuthentication I was wondering if there has been any work done into this direction? I guess the answer is still no. Would the problem become easier, if one targets particular types of sites, such as popular Wiki, Bug Trackers, Blogs, CMS, Forum, document management systems (first)? I was involved in a project to implement this (as a proprietary plugin). In short, it requires a lot of effort, and there are no generic solutions. If it works with one site, it breaks with another, and eventually you end up with a nasty heap of hacks upon hacks. In that project we gave up after discovering that a large number of sites use Javascript to create and name the input controls, and they used a challenge-response with client-side scripts generating the response ... it was a total mess. So, if you target 10 sites, you can make it work. If you target 10,000 sites all using slightly different methods, then forget it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Where are new linked entries added
On 2010-03-11 15:53, nikinch wrote: Hi everyone I've been using nutch for a while now and i've come up on a snag. I'm trying to find where new linked pages are added to the segment as a specific entry. To make myself clear i've been through the fetch class and the crawlDBFilter and reducer. But i'm looking for the initial entry where, for a given page, the links are transformed into segment entries, my objective here is to pass down te initial inject url to all it's liked pages. So when i create an entry for the linked urls of a wegbpage i'll add metadata to their definition giving them this originating url. By the time i get to CrawlDBFilter i already have entries for linked pages and lost the notion of which seed url brought us here. I thought the job would be done in the Fetcher maybe in the output function but i'm not finding where it happens. So if anyone knows and could point me in the right direction i'd appreciate it. Currently the best place to do this is in your implementation of a ScoringFilter, in distributeScoreToOutlinks(). You can also modify one of the existing scoring plugins. I would advise against modifying the code directly in ParseOutputFormat, it's complex and fragile. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Avoid indexing common html to all pages, promoting page titles.
On 2010-03-12 12:52, Pedro Bezunartea López wrote: Hi, I'm developing a site that has shows the dynamic content in adiv id=content, the rest of the page doesn't really change. I'd like to store and index only the contents of thisdiv, to basically avoid re-indexing over and over the same content (header, footer, menu). I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a couple of things: 1.- Should I extend the parse-html plugin, or should I just replace it? You should write an HtmlParseFilter and extract only the portions that you care about, and then replace the output parseText with your extracted text. 2.- The example talks about finding a meta tag, extracting some information from it, and adding a field in the index. I think I just need to get rid of all html except the div id=content tag, and index its content. Can someone point me in the right direction? See above. And just one more thing, I'd like to give a higher score to pages which the search terms appear in the title. Right now pages that contain the terms in the body rank higher than those that contain the search terms in the title, how could I modify this behaviour? You can define these weights in the configuration, look for query boost properties. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com