Re: How to create patch?
Take a look at this from the wiki: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer It shows how to create a patch from SVN. To apply a patch to your source code you would use the patch command (on linux) like this: patch -p0 your_patch_file.patch Dennis Kubes Manoharam Reddy wrote: I have seen some patches been exchanged in the list. I want to know how this patch is created and how is it applied? Any pointers to tutorials on net or wiki or a plain reply here would be helpful.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500603 ] Doğacan Güney commented on NUTCH-392: - From what I understand of MapFile.Writer code in hadoop, if you give CompressionType as an argument in its constructor it overwrites the compression value in config. So since nutch manually sets parse_text and parse_data to RECORD compression ( and crawl_parse to NONE), we will not get the advantages of BLOCK compression even if we set it in config. BLOCK compression seems to work really great if you got the native libraries in place, so IMHO it would be better to not manually set CompressionType and allow people to set it to whatever they want in config. OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: https://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: How to create patch?
http://wiki.apache.org/nutch/HowToContribute On 6/1/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I have seen some patches been exchanged in the list. I want to know how this patch is created and how is it applied? Any pointers to tutorials on net or wiki or a plain reply here would be helpful.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ] Andrzej Bialecki commented on NUTCH-392: - Good point. We can change it to use the following pattern (as Hadoop uses internally), e.g.: contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, Content.class, SequenceFile.getCompressionType(job), progress); However, the original patch had some merits, too. Some types of data are not that compressible in themselves (using RECORD compression), i.e. it takes more effort to compress/decompress than space savings are worth. In case of crawl_parse and crawl_fetch it would make sense to enforce BLOCK or NONE compression type, and disallow the RECORD type. I know that BLOCK compression gives a better space savings, and incidentally may increase the writing speed. But I'm not sure what is the performance impact of using BLOCK compressed MapFile-s when doing random reading - this is the scenario in LinkDbInlinks, FetchedSegments and similar places. Could you perhaps test it? The original patch used RECORD compression for MapFile-s, probably for this reason. OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: https://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728 ] Andrzej Bialecki commented on NUTCH-392: - I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*, content, parse_data. Because I don't think that people will need fast random-access on anything but parse_text. LinkDb is accessed on-line randomly through LinkDbInlinks, when users request anchors. Similarly, parse_data is accessed when requesting explain, and may be also accessed to retrieve other hit metadata. Content is accessed randomly when displaying cached preview. I think in all these cases we can use at most RECORD compression, or NONE. OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: https://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Plugins and Thread Safety
Briggs wrote: What is the design contract on plugins when it comes to thread safety? I was under the assumption that plugins should be thread safe, but I have been running into concurrent modification exceptions from the language identifier plugin while indexing. My application is a bit They should be thread-safe. E.g. Fetcher runs many threads in parallel, each thread using plugins to handle fetching, parsing, url filtering, etc, etc. different from the normal nutch way. I have may crawls going on concurrently within an application. So, that means I would also have many concurrent indexing tasks. So, if I can't be guaranteed that plugins are threadsafe, I may need to do a nasty thing and synchronize my index() method (ouch). Here is the exception, just for info: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787) at java.util.HashMap$ValueIterator.next(HashMap.java:817) at org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277) This is a bug. My guess is that NGramProfile.getSorted() should be synchronized. Could you please test if this works? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Plugins and Thread Safety
What is the design contract on plugins when it comes to thread safety? I was under the assumption that plugins should be thread safe, but I have been running into concurrent modification exceptions from the language identifier plugin while indexing. My application is a bit different from the normal nutch way. I have may crawls going on concurrently within an application. So, that means I would also have many concurrent indexing tasks. So, if I can't be guaranteed that plugins are threadsafe, I may need to do a nasty thing and synchronize my index() method (ouch). Here is the exception, just for info: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787) at java.util.HashMap$ValueIterator.next(HashMap.java:817) at org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277) at org.apache.nutch.analysis.lang.NGramProfile.analyze(NGramProfile.java:244) at org.apache.nutch.analysis.lang.LanguageIdentifier.identify(LanguageIdentifier.java:409) at org.apache.nutch.analysis.lang.LanguageIndexingFilter.filter(LanguageIndexingFilter.java:84) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:131) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:240) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) --briggs Conscious decisions by conscious minds are what make reality real
Re: Plugins and Thread Safety
Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: What is the design contract on plugins when it comes to thread safety? I was under the assumption that plugins should be thread safe, but I have been running into concurrent modification exceptions from the language identifier plugin while indexing. My application is a bit They should be thread-safe. E.g. Fetcher runs many threads in parallel, each thread using plugins to handle fetching, parsing, url filtering, etc, etc. different from the normal nutch way. I have may crawls going on concurrently within an application. So, that means I would also have many concurrent indexing tasks. So, if I can't be guaranteed that plugins are threadsafe, I may need to do a nasty thing and synchronize my index() method (ouch). Here is the exception, just for info: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787) at java.util.HashMap$ValueIterator.next(HashMap.java:817) at org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277) This is a bug. My guess is that NGramProfile.getSorted() should be synchronized. Could you please test if this works? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real
Re: Plugins and Thread Safety
Briggs wrote: Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. Yes, please try this change. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Plugins and Thread Safety
I will get back to you. It isn't the easiest bug to test. So, will let you know soon! On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. Yes, please try this change. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real
[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] Doug Cutting commented on NUTCH-392: Anchors, explain, and the cache are used relatively infrequently, considerably less than once per query, and hence *much* less than once per displayed hit. So it might be acceptable if they're somewhat slower. Block compression should still be fast-enough for interactive use, and these uses would never dominate CPU use in an application, would they? OutputFormat implementations should pass on Progressable Key: NUTCH-392 URL: https://issues.apache.org/jira/browse/NUTCH-392 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Doug Cutting Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-392.patch OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations. This will keep reduce tasks from timing out when block writes are slow. This issue depends on http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[PATCH] Moving HitDetails construction to a HitDetails constructor (v2).
This is a fixed version of the previous patch. Please, don't ignore me =). I'm trying to use Lucene queries with Nutch and this patch will help. This patch also removes a deprecated API usage, removes useless object creation and array copying. Thanks! Index: src/java/org/apache/nutch/searcher/IndexSearcher.java === --- src/java/org/apache/nutch/searcher/IndexSearcher.java (revisión: 543252) +++ src/java/org/apache/nutch/searcher/IndexSearcher.java (copia de trabajo) @@ -21,6 +21,8 @@ import java.util.ArrayList; import java.util.Enumeration; +import java.util.Iterator; +import java.util.List; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; @@ -105,20 +107,8 @@ } public HitDetails getDetails(Hit hit) throws IOException { -ArrayList fields = new ArrayList(); -ArrayList values = new ArrayList(); - Document doc = luceneSearcher.doc(hit.getIndexDocNo()); - -Enumeration e = doc.fields(); -while (e.hasMoreElements()) { - Field field = (Field)e.nextElement(); - fields.add(field.name()); - values.add(field.stringValue()); -} - -return new HitDetails((String[])fields.toArray(new String[fields.size()]), - (String[])values.toArray(new String[values.size()])); +return new HitDetails(doc); } public HitDetails[] getDetails(Hit[] hits) throws IOException { Index: src/java/org/apache/nutch/searcher/HitDetails.java === --- src/java/org/apache/nutch/searcher/HitDetails.java (revisión: 543252) +++ src/java/org/apache/nutch/searcher/HitDetails.java (copia de trabajo) @@ -21,8 +21,11 @@ import java.io.DataOutput; import java.io.IOException; import java.util.ArrayList; +import java.util.List; import org.apache.hadoop.io.*; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; import org.apache.nutch.html.Entities; /** Data stored in the index for a hit. @@ -52,7 +55,23 @@ this.fields[1] = url; this.values[1] = url; } + + /** Construct from Lucene document. */ + public HitDetails(Document doc) + { +List? ff = doc.getFields(); +length = ff.size(); + +fields = new String[length]; +values = new String[length]; +for(int i = 0 ; i length ; i++) { + Field field = (Field)ff.get(i); + fields[i] = field.name(); + values[i] = field.stringValue(); +} + } + /** Returns the number of fields contained in this. */ public int getLength() { return length; }
Re: [PATCH] Moving HitDetails construction to a HitDetails constructor (v2).
Nicolás Lichtmaier wrote: This is a fixed version of the previous patch. In the future, please use JIRA bug tracking system to submit patches. Please, don't ignore me =). We don't - but there's only so much ou can do in 24 hrs/day, and Nutch developers have their own lives to attend to ... ;) I'm trying to use Lucene queries with Nutch and this patch will help. This patch also removes a deprecated API usage, removes useless object creation and array copying. I believe the conversion from Document to HitDetails was separated this way on purpose. Please note that front-end Nutch API has no dependencies on Lucene classes. If we applied your patch, all of a sudden HitDetails would become dependent on Lucene, causing front-end applications to become dependent on Lucene, too. We can certainly fix the use of deprecated API as you suggested. As for the rest of the patch, in my opinion it should not be applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com