Re: How to create patch?

2007-06-01 Thread Dennis Kubes

Take a look at this from the wiki:

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

It shows how to create a patch from SVN.  To apply a patch to your 
source code you would use the patch command (on linux) like this:


patch -p0  your_patch_file.patch

Dennis Kubes

Manoharam Reddy wrote:

I have seen some patches been exchanged in the list.

I want to know how this patch is created and how is it applied? Any
pointers to tutorials on net or wiki or a plain reply here would be
helpful.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500603
 ] 

Doğacan Güney commented on NUTCH-392:
-

From what I  understand of MapFile.Writer code in hadoop, if you give 
CompressionType as an argument in its constructor it overwrites the 
compression value in config. So since nutch manually sets parse_text and 
parse_data to RECORD compression ( and crawl_parse to NONE), we will not get 
the advantages of BLOCK compression even if we set it in config. 

BLOCK compression seems to work really great if you got the native libraries in 
place, so IMHO it would be better to not manually set CompressionType and allow 
people to set it to whatever they want in config.

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: https://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: How to create patch?

2007-06-01 Thread Marcin Okraszewski

http://wiki.apache.org/nutch/HowToContribute



On 6/1/07, Manoharam Reddy [EMAIL PROTECTED] wrote:

I have seen some patches been exchanged in the list.

I want to know how this patch is created and how is it applied? Any
pointers to tutorials on net or wiki or a plain reply here would be
helpful.



[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635
 ] 

Andrzej Bialecki  commented on NUTCH-392:
-

Good point. We can change it to use the following pattern (as Hadoop uses 
internally), e.g.:

contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, 
Content.class, SequenceFile.getCompressionType(job), progress);

However, the original patch had some merits, too. Some types of data are not 
that compressible in themselves (using RECORD compression), i.e. it takes more 
effort to compress/decompress than space savings are worth. In case of 
crawl_parse and crawl_fetch it would make sense to enforce BLOCK or NONE 
compression type, and disallow the RECORD type.

 I know that BLOCK compression gives a better space savings, and incidentally 
may increase the writing speed. But I'm not sure what is the performance impact 
of using BLOCK compressed MapFile-s when doing random reading - this is the 
scenario in LinkDbInlinks, FetchedSegments and similar places. Could you 
perhaps test it? The original patch used RECORD compression for MapFile-s, 
probably for this reason.

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: https://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728
 ] 

Andrzej Bialecki  commented on NUTCH-392:
-

 I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*,
 content, parse_data. Because I don't think that people will need fast 
 random-access
  on anything but parse_text.

LinkDb is accessed on-line randomly through LinkDbInlinks, when users request 
anchors. Similarly, parse_data is accessed when requesting explain, and may 
be also accessed to retrieve other hit metadata. Content is accessed randomly 
when displaying cached preview. I think in all these cases we can use at most 
RECORD compression, or NONE.

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: https://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki

Briggs wrote:

What is the design contract on plugins when it comes to thread safety?
I was under the assumption that plugins should be thread safe, but I
have been running into concurrent modification exceptions from the
language identifier plugin while indexing.  My application is a bit


They should be thread-safe. E.g. Fetcher runs many threads in parallel, 
each thread using plugins to handle fetching, parsing, url filtering, 
etc, etc.




different from the normal nutch way.  I have may crawls going on
concurrently within an application.  So, that means I would also have
many concurrent indexing tasks.  So, if I can't be guaranteed that
plugins are threadsafe, I may need to do a nasty thing and synchronize
my index() method (ouch).


Here is the exception, just for info:

java.util.ConcurrentModificationException
   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787)
   at java.util.HashMap$ValueIterator.next(HashMap.java:817)
   at 
org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277) 


This is a bug. My guess is that NGramProfile.getSorted() should be 
synchronized. Could you please test if this works?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Plugins and Thread Safety

2007-06-01 Thread Briggs

What is the design contract on plugins when it comes to thread safety?
I was under the assumption that plugins should be thread safe, but I
have been running into concurrent modification exceptions from the
language identifier plugin while indexing.  My application is a bit
different from the normal nutch way.  I have may crawls going on
concurrently within an application.  So, that means I would also have
many concurrent indexing tasks.  So, if I can't be guaranteed that
plugins are threadsafe, I may need to do a nasty thing and synchronize
my index() method (ouch).


Here is the exception, just for info:

java.util.ConcurrentModificationException
   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787)
   at java.util.HashMap$ValueIterator.next(HashMap.java:817)
   at 
org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277)
   at 
org.apache.nutch.analysis.lang.NGramProfile.analyze(NGramProfile.java:244)
   at 
org.apache.nutch.analysis.lang.LanguageIdentifier.identify(LanguageIdentifier.java:409)
   at 
org.apache.nutch.analysis.lang.LanguageIndexingFilter.filter(LanguageIndexingFilter.java:84)
   at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:131)
   at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:240)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)


--briggs


Conscious decisions by conscious minds are what make reality real


Re: Plugins and Thread Safety

2007-06-01 Thread Briggs

Oh, you want me to change the getSorted method to be synchronized?
I'll put a lock in there and see what happens, if that is what you are
referring to.


On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Briggs wrote:
 What is the design contract on plugins when it comes to thread safety?
 I was under the assumption that plugins should be thread safe, but I
 have been running into concurrent modification exceptions from the
 language identifier plugin while indexing.  My application is a bit

They should be thread-safe. E.g. Fetcher runs many threads in parallel,
each thread using plugins to handle fetching, parsing, url filtering,
etc, etc.


 different from the normal nutch way.  I have may crawls going on
 concurrently within an application.  So, that means I would also have
 many concurrent indexing tasks.  So, if I can't be guaranteed that
 plugins are threadsafe, I may need to do a nasty thing and synchronize
 my index() method (ouch).


 Here is the exception, just for info:

 java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:787)
at java.util.HashMap$ValueIterator.next(HashMap.java:817)
at
 org.apache.nutch.analysis.lang.NGramProfile.normalize(NGramProfile.java:277)

This is a bug. My guess is that NGramProfile.getSorted() should be
synchronized. Could you please test if this works?

--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Conscious decisions by conscious minds are what make reality real


Re: Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki

Briggs wrote:

Oh, you want me to change the getSorted method to be synchronized?
I'll put a lock in there and see what happens, if that is what you are
referring to.


Yes, please try this change.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Plugins and Thread Safety

2007-06-01 Thread Briggs

I will get back to you.  It isn't the easiest bug to test.  So, will
let you know soon!

On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Briggs wrote:
 Oh, you want me to change the getSorted method to be synchronized?
 I'll put a lock in there and see what happens, if that is what you are
 referring to.

Yes, please try this change.


--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Conscious decisions by conscious minds are what make reality real


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822
 ] 

Doug Cutting commented on NUTCH-392:


Anchors, explain, and the cache are used relatively infrequently, considerably 
less than once per query, and hence *much* less than once per displayed hit.  
So it might be acceptable if they're somewhat slower.  Block compression should 
still be fast-enough for interactive use, and these uses would never dominate 
CPU use in an application, would they?

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: https://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-01 Thread Nicolás Lichtmaier
This is a fixed version of the previous patch. Please, don't ignore me 
=). I'm trying to use Lucene queries with Nutch and this patch will 
help. This patch also removes a deprecated API usage, removes useless 
object creation and array copying.


Thanks!

Index: src/java/org/apache/nutch/searcher/IndexSearcher.java
===
--- src/java/org/apache/nutch/searcher/IndexSearcher.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/IndexSearcher.java	(copia de trabajo)
@@ -21,6 +21,8 @@
 
 import java.util.ArrayList;
 import java.util.Enumeration;
+import java.util.Iterator;
+import java.util.List;
 
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;
@@ -105,20 +107,8 @@
   }
 
   public HitDetails getDetails(Hit hit) throws IOException {
-ArrayList fields = new ArrayList();
-ArrayList values = new ArrayList();
-
 Document doc = luceneSearcher.doc(hit.getIndexDocNo());
-
-Enumeration e = doc.fields();
-while (e.hasMoreElements()) {
-  Field field = (Field)e.nextElement();
-  fields.add(field.name());
-  values.add(field.stringValue());
-}
-
-return new HitDetails((String[])fields.toArray(new String[fields.size()]),
-  (String[])values.toArray(new String[values.size()]));
+return new HitDetails(doc);
   }
 
   public HitDetails[] getDetails(Hit[] hits) throws IOException {
Index: src/java/org/apache/nutch/searcher/HitDetails.java
===
--- src/java/org/apache/nutch/searcher/HitDetails.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/HitDetails.java	(copia de trabajo)
@@ -21,8 +21,11 @@
 import java.io.DataOutput;
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.List;
 
 import org.apache.hadoop.io.*;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
 import org.apache.nutch.html.Entities;
 
 /** Data stored in the index for a hit.
@@ -52,7 +55,23 @@
 this.fields[1] = url;
 this.values[1] = url;
   }
+  
+  /** Construct from Lucene document. */
+  public HitDetails(Document doc)
+  {
+List? ff = doc.getFields();
+length = ff.size();
+
+fields = new String[length];
+values = new String[length];
 
+for(int i = 0 ; i  length ; i++) {
+  Field field = (Field)ff.get(i);
+  fields[i] = field.name();
+  values[i] = field.stringValue();
+}
+  }
+
   /** Returns the number of fields contained in this. */
   public int getLength() { return length; }
 


Re: [PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-01 Thread Andrzej Bialecki

Nicolás Lichtmaier wrote:

This is a fixed version of the previous patch.


In the future, please use JIRA bug tracking system to submit patches.


Please, don't ignore me =).


We don't - but there's only so much ou can do in 24 hrs/day, and Nutch 
developers have their own lives to attend to ... ;)



I'm trying to use Lucene queries with Nutch and this patch will 
help. This patch also removes a deprecated API usage, removes useless 
object creation and array copying.


I believe the conversion from Document to HitDetails was separated this 
way on purpose. Please note that front-end Nutch API has no dependencies 
on Lucene classes. If we applied your patch, all of a sudden HitDetails 
would become dependent on Lucene, causing front-end applications to 
become dependent on Lucene, too.


We can certainly fix the use of deprecated API as you suggested. As for 
the rest of the patch, in my opinion it should not be applied.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com