date:20100702


[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624
 ] 

Julien Nioche commented on NUTCH-835:
-

This patch has been marked for 1.2 but has been committed to trunk only (2.0). 
Shall we also apply it to /nutch/branches/branch-1.2 ?

 document deduplication (exact duplicates) failed using MD5Signature
 ---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.2, 2.0


 The MD5Signature class calculates different signatures for identical 
 documents.
 The reason is that
   byte[] data = content.getContent();
   ... StringBuilder().append(data) ...
 uses java.lang.Object.toString() to get a string representation of the 
 (binary) content
 which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
 arrays
 with identical content.
 A solution would be to take the MD5 sum of the binary content as first part 
 of the
 final signature calculation (the parsed content is the second part):
   ... 
 .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
 Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Andrzej Bialecki


Hi,

(This question is mostly to Dogacan  Enis, but I encourage anyone 
familiar with the code to join the threads with [Nutchbase] - the sooner 
the better ;) ).


I'm looking at src/gora/webpage.avsc and WebPage.java  friends... 
presumably the java code was autogenerated from avsc using Gora? If so, 
we should put this autogeneration step in our build.xml. Or am I missing 
something?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Julien Nioche


 (This question is mostly to Dogacan  Enis, but I encourage anyone familiar
 with the code to join the threads with [Nutchbase] - the sooner the better
 ;) ).

 I'm looking at src/gora/webpage.avsc and WebPage.java  friends...
 presumably the java code was autogenerated from avsc using Gora? If so, we
 should put this autogeneration step in our build.xml. Or am I missing
 something?


correct. if we keep the generated java classes in svn then we probably want
to make this task optional i.e. it would not be done as part of the build
tasks OR we can add it to the build but remove it from svn (or better add to
svn ignore or whatever-it-is-called).

J.
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

[jira] Created: (NUTCH-840) Port tests from parse-html to parse-tika

Port tests from parse-html to parse-tika


 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


We don't have test for HTML in parse-tika so I'll copy them from the old 
parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Nutch 2.0 : Design issue

2010-07-02 Thread Julien Nioche

On 2 July 2010 12:22, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-07-02 12:42, Julien Nioche wrote:

 Hi guys,

 You've probably seen that there has been some progress on 2.0 lately.
 We've
 updated the nutchbase svn branch with the latest developments done on
 Dogacan's Github i.e. using GORA as a storage layer.
 One of the main issues [1] I raised after using nutchbase was that :

 NutchBase currently marks entries in the table to be fetched | parsed |

 etc... and needs to go through the whole table at every step. As the
 table
 gets bigger it takes more and more time to read through the entries and
 check their marks which is not a viable option. NutchBase is currently
 slower than Nutch 1.1 (might be issues with Gora but still...)
 I suggest instead that we create fetchlists in separate tables, fetch
 parse in these tables then merge the entries back to the main table. The
 segment tables could then be deleted if necessary. We would then have a
 linear processing time for fetching + parsing + updating depending on the
 size of the segments and NOT on the size of the main table. This would be
 an
 improvement compared to 1.1 where the processing time in the updates is
 relative to the size of the crawldb .


 Doing this requires to be able to separate the name of a schema from the
 name of a table in Gora [2], which should not be a big problem.


 I think this is a good idea - this model is conceptually close to the
 current model, and I bet it will be easier to debug problems when changes
 are limited to a separate table... we could create 1 table per segment.

 (Oh, and let's stop calling them segments, please - maybe call them a batch
 or crawl cycle or something. The name segments caused a lot of confusion
 already, and it doesn't convey any useful meaning..)


Makes sense



 As for the time savings .. this remains to be seen. At the end of the
 fetching/parsing job we need to merge this data back into the main table,
 which is a massive update that also takes time.


True





 On a second thought I was wondering whether it would also make sense to
 actually keep the segments as they currently are i.e. stored as
 NutchWritables in HDFS. The advantages of doing this would be that we'd
 keep
 exactly the same code for the fetching + parsing + would only need to
 modify
 the generations and update steps + would be able to easily port pre-2.0
 segments to the webtable. The drawbacks being that there would be a dual
 storage GORA / HDFS and we'd need to keep the legacy Nutch Writable
 objects.


 The fetcher code is already ported in nutchbase not to use the plain files.
 I doubt there would be many users who want to jump to Nutch 2.0 and still
 want to hold on to their old segments... so I think this is not useful. Dual
 storage .. *shudder* that's asking for trouble.


Right, + am not too keen on keeping the legacy objects. Another advantage of
having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that
is makes it easier to restart an interrupted fetch or parse.

Forget about the HDFS based storage, let's just do it with GORA




 Note that it would not change anything to the content of the main webtable
 nor the operations done on them. Maybe it would make sense to do that
 anyway
 at least as a transition while we make the webtable and GORA operations
 stable and then see if there is an advantage in storing the segments as
 GORA
 tables as well.

 I am pretty confident that we need to address the point raised in [1]
 anyway. What do you guys think?

 *[1] http://github.com/dogacan/nutchbase/issues#issue/8
 [2] http://github.com/enis/gora/issues#issue/30*


 +1 to both points, -1 to the dual storage.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

[Nutch Wiki] Trivial Update of PluginCentral by AlexM c

2010-07-02 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PluginCentral page has been changed by AlexMc.
The comment on this change is: adding a couple of external tutorials relating 
to plugins (more welcome!!!).
http://wiki.apache.org/nutch/PluginCentral?action=diffrev1=60rev2=61

--

   * [[WritingPluginExample-0.9]] - Step-by-step example of how to write a 
plugin for the current development.
   * WritingPluginExample - A step-by-step example of how to write a plugin for 
the 0.7 branch. - updated by LucasBoullosa
   * [[http://wiki.media-style.com/display/nutchDocu/Write+a+plugin|Writing 
Plugins]] - by Stefan
+  * 
[[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example
 of writing a custom plugin] by Sujitpal
+  * 
[[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a 
plugin to add dates]] by Ryan Pfister
  
  == Plugins that Come with Nutch (0.9) ==

[jira] Updated: (NUTCH-840) Port tests from parse-html to parse-tika


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840.patch

Patch which adds the HTML tests to the Tika Parser

The tests currently rely on some DOM related code from Neko-HTML which 
introduces a dependency to the plugin lib-nekohtml.
Apart from parse-tika lib-nekohtml is used only in clustering-carrot which will 
be removed shortly. Once this is done we can delete lib-nekohtml as well then 
either : 
a) add the neko jar to the parse-tika lib via IVY
b) replace it with another implementation already available from the tika 
dependencies or the main Nutch dependencies (e.g. dom4j)





 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-840.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884671#action_12884671
 ] 

Julien Nioche commented on NUTCH-837:
-

I think we can also get rid of  :

* docs/
* WAR related tasks in ANT
* src/web/
* src/xmlcatalog/
* src/engines/


 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Mattmann, Chris A (388J)

Hey Guys,

Since they are generated, +1 to:


 *   adding a filepattern to svn:ignore to ignore them
 *   updating build.xml to autogenerate

Cheers,
Chris



On 7/2/10 3:24 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote:



(This question is mostly to Dogacan  Enis, but I encourage anyone familiar 
with the code to join the threads with [Nutchbase] - the sooner the better ;) ).

I'm looking at src/gora/webpage.avsc and WebPage.java  friends... presumably 
the java code was autogenerated from avsc using Gora? If so, we should put this 
autogeneration step in our build.xml. Or am I missing something?


correct. if we keep the generated java classes in svn then we probably want to 
make this task optional i.e. it would not be done as part of the build tasks OR 
we can add it to the build but remove it from svn (or better add to svn ignore 
or whatever-it-is-called).

J.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884691#action_12884691
 ] 

Chris A. Mattmann commented on NUTCH-837:
-

Hey Julien:

How are we going to replace the Nutch webapp? 

Cheers,
Chris

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

[
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884712#action_12884712
]

Chris A. Mattmann commented on NUTCH-837:
-

I'm not sure I agree :)

The Nutch webapp is just a set of web pages that let someone know that Search
is working. They are decent web pages, have a great look and feel and are
something I've seen nearly every newbie Nutch user I've been around leverage to
tell whether or not Nutch installed correctly.

I'm also a fan of the let's not loose functionality on a technology upgrade
task mantra. That is, we are reorganizing the architecture of Nutch to improve
it, not to take away functionality. We should at least support the baseline of
functionality that was present in 1.x.

That said, I'm not sure the existing webapp should be maintained in its current
form. Maybe we should take a pass at updating the webapp to work with the Nutch
2.0 architecture underneath. I'm happy to pick up a shovel and dig on that one.

Cheers,
Chris

Remove search servers and Lucene dependencies
--

Key: NUTCH-837
URL: https://issues.apache.org/jira/browse/NUTCH-837
Project: Nutch
Issue Type: Task
Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki
Fix For: 2.0

Attachments: NUTCH-837.patch

One of the main aspects of 2.0 is the delegation of the indexing and search
to external resources like SOLR. We can simplify the code a lot by getting
rid of the :
* search servers
* indexing and analysis with Lucene
* search side functionalities : ontologies / clustering etc...
In the short term only SOLR / SOLRCloud will be supported but the plan would
be to add other systems as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884718#action_12884718
 ] 

Chris A. Mattmann commented on NUTCH-837:
-

Hey Julien,

Yep that's the point. Solr != Nutch, so Solr's Webapp can't be expected to be = 
Nutch's webapp. The example you cited about cached data is a great one, because 
Solr's webapp doesn't really support that (nor should it IMHO).

So, I think we should still have a Nutch webapp and in my mind it's a must-have 
for a 2.0 release...not to worry though I'm volunteering to help do it! :)

Cheers,
Chris

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies


 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: NUTCH-837.patch

Updated patch against r959954 (after NUTCH-836).

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch, NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies


 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: (was: NUTCH-837.patch)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729
 ] 

Andrzej Bialecki  commented on NUTCH-837:
-

bq. So, I think we should still have a Nutch webapp and in my mind it's a 
must-have for a 2.0 release...

I agree. But for the moment it's better to delete the old webapp stuff that we 
know for sure doesn't work with the current Nutch, and it will be completely 
reimplemented anyway.

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-841) Nutch 2.0 webapp

Nutch 2.0 webapp


 Key: NUTCH-841
 URL: https://issues.apache.org/jira/browse/NUTCH-841
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
 Environment: Nutch 2.0
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 2.0


In light of the conversation on NUTCH-837, we are removing the old Nutch webapp 
and will replace it with a 2.0 one that works with GORA + Solr. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884734#action_12884734
 ] 

Julien Nioche commented on NUTCH-837:
-

:-)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884731#action_12884731
 ] 

Chris A. Mattmann commented on NUTCH-837:
-

Okey dok, I created NUTCH-841 to track it. Julien, Andrzej, you have my +1 to 
take your axe to the old one :)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies