[jira] Updated: (NUTCH-814) SegmentMerger bug

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-814:


Attachment: merger.patch

Patch fixing the issue, and a unit test. I will commit this shortly.

 SegmentMerger bug
 -

 Key: NUTCH-814
 URL: https://issues.apache.org/jira/browse/NUTCH-814
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Dennis Kubes
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: merger.patch


 Dennis reported:
 {quote}
 In the SegmentMerger.java file about line 150 we have this:
final SequenceFile.Reader reader =
  new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(),
 job);
 Then about line 166 in the record reader we have this:
 boolean res = reader.next(key, w);
 If I am reading that right, that would mean that the map tap would loop
 over all records for a given file and not just a given split.
 {quote}
 Right, this should instead use SequenceFileRecordReader that already has the 
 logic to handle splits. Patch coming shortly - thanks for spotting this! This 
 could be the reason for out of disk space errors that many users reported.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (NUTCH-466) Flexible segment format

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-466 stopped by Andrzej Bialecki .

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: ParseFilters.java, segmentparts.patch


 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-812:


Affects Version/s: 1.1
 Priority: Critical  (was: Major)

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[VOTE] Board resolution for Nutch as TLP

2010-04-12 Thread Andrzej Bialecki
Hi,

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.

[] +1.  Request the Board make Nutch a TLP
[] +0.  I don't feel strongly about it, but I'm okay with this.
[] -1.  No, don't request the Board make Nutch a TLP, and here are my
 reasons...

This is a majority count vote (i.e. no vetoes). The vote is open for 72
hours.

Here's my +1.

===
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web search
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Lucene Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
===


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Hold on... (Re: [VOTE] Board resolution for Nutch as TLP)

2010-04-12 Thread Andrzej Bialecki
On 2010-04-12 12:57, Andrzej Bialecki wrote:
 Hi,
 
 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.

Ehh, scrap that ... I missed one occurrence of the crawling platform.
Resending...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Andrzej Bialecki
Hi,

Take two, after s/crawling/search/ ...

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.

[] +1.  Request the Board make Nutch a TLP
[] +0.  I don't feel strongly about it, but I'm okay with this.
[] -1.  No, don't request the Board make Nutch a TLP, and here are my
 reasons...

This is a majority count vote (i.e. no vetoes). The vote is open for 72
hours.

Here's my +1.

===
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web search
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web search platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Lucene Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
===


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Andrzej Bialecki
On 2010-04-10 04:13, Mattmann, Chris A (388J) wrote:
 Hi Andrzej,
 
 +1, with the following amendment:
 

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Nutch Project are hereafter discharged.
 
 This should read:
 
 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

Good catch, thanks.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Andrzej Bialecki
On 2010-04-10 15:32, Jukka Zitting wrote:
 Hi,
 
 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.
 
 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?

Yes, that's a good change too.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[DISCUSS] Board resolution for Nutch as TLP

2010-04-09 Thread Andrzej Bialecki
Hi,

I was told that the next step is to come up with the proposed Board
resolution and vote it among committers. Here's the proposed text
(shameless copypaste from Tika and Mahout proposals).

IMPORTANT NOTE: I removed from the members of the PMC those existing
Nutch committers that haven't been active for more than 1 year, with the
intention of moving them to Emeritus status. If any one of these people
feels left out and would like to become an active committer in the
project, please let us know and we will gladly welcome you back :)

The text of the resolution follows. Committers, please read it and
optionally comment on the salient points of the text, the rest is
boilerplate. If there's an overall consensus I will call for a formal
vote to submit this proposal to the Board.


==
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web crawling
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Nutch Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
=




-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,
 
 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...

 
 I know... But I still intend to finish it, I just need to schedule
 some time for it.
 
 My vote would be to go with nutchbase.

Hmm .. this puzzles me, do you think we should port changes from 1.1 to
nutchbase? I thought we should do it the other way around, i.e. merge
nutchbase bits to trunk.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.

 
 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

Again, the advantage of DataNucleus is that we don't have to handcraft
all the mid- to low-level mappings, just the mid-level ones (JOQL or
whatever) - the cost of maintenance is lower, and the number of backends
that are supported out of the box is larger. Of course, this is just
IMHO - we won't know for sure until we try to use both your custom ORM
and DataNucleus...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 19:24, Enis Söztutar wrote:

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

  
 
 So, it seems that at some point, we need to bite the bullet, and
 refactor plugins, dropping backwards compatibility.

Right, that was my point - now is the time to break it, with the
cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well
enough in the interim period.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-06 Thread Andrzej Bialecki
On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,
 
 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 
 Talking about features, what else would we add apart from :
 
 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 
 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

Definitely. :)

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Question: Nutch 0.8.2 and Nutch 0.7.3?

2010-04-04 Thread Andrzej Bialecki
On 2010-04-04 02:59, Mattmann, Chris A (388J) wrote:
 Hey Guys,
 
 Question. I see 2 releases that haven't been cut in JIRA:
 
 0.8.2: 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
 80fixfor=12312064
 
 0.7.3:
 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
 80fixfor=12312176
 
 I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door.
 However, I have a question: is this Nutch 0.8.2 in SVN?
 
 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/

That's the code that was intended to become 0.8.2 ...

However, I'm not sure whether there's any benefit in releasing either of
these. Those who really had the need to track this branch (or 0.7)
likely used the code from this branch even though it wasn't released.
And I believe we are not interested in maintaining a new release based
on this code...?


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Andrzej Bialecki
On 2010-04-02 16:14, Mattmann, Chris A (388J) wrote:

 * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
 OK, Nutchers?

Yes - thanks!


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851331#action_12851331
 ] 

Andrzej Bialecki  commented on NUTCH-789:
-

There are no diffs, so it's difficult to figure out what's changed ... I think 
that Tika will soon release v. 0.7 which may also impact this patch if we 
decide to upgrade before our release. I asked the Tika guys about their 
release, let's wait a couple days more.

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850896#action_12850896
 ] 

Andrzej Bialecki  commented on NUTCH-784:
-

This should have been reviewed first - I don't question the usefulness of this 
class, but I think that this should have been added as an option to 
CrawlDbReader. As it is now we get a new tool with a cryptic name that performs 
a function that is a variant of another existing tool...

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850931#action_12850931
 ] 

Andrzej Bialecki  commented on NUTCH-785:
-

+1. The scoring api should allow us to set this metadata in one call, but 
changing the API now would be problematic.

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850939#action_12850939
 ] 

Andrzej Bialecki  commented on NUTCH-779:
-

CrawlDbReducer, the cramped line {{if (metaFromParse!=null){}} needs some 
whitespace fixing.

Other than that, +1.

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848090#action_12848090
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

I just noticed that the new Generator uses different config property names 
(generator. vs. generate.), and the older versions are now marked with 
(Deprecated). However, this doesn't reflect the reality - properties with old 
names are simply ignored now, whereas deprecated implies that they should 
still work. For back-compat reason I think they should still work - the current 
(admittedly awkward) prefix is good enough, and I think that changing it in a 
minor release would create confusion. I suggest reverting to the old names 
where appropriate, and add new properties with the same prefix, i.e. 
generate..

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848110#action_12848110
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

bq. If we want to replace the old generator altogether - which I think would be 
a good option

I think this makes sense now, since the new Generator in your latest patch is a 
strict superset of the old one. 

bq. I don't have strong feelings on whether or not to modify the prefix in a 
minor release. 

I do :) , see also here: 
http://en.wikipedia.org/wiki/Principle_of_least_astonishment

IMHO it's all about breaking or not breaking existing installs after a minor 
upgrade. I suspect most users won't be aware of a subtle change between 
generate. and generator., especially since the command-line of the new 
Generator is compatible with the old one. So they will try to use the new 
Generator while keeping their existing configs.

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848173#action_12848173
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

bq. The change of prefix also reflected that we now use 2 different parameters 
so specify how to count the URLs (host or domain) and the max number of URLs. 
We can of course maintain the old parameters as well for the sake of 
compatibility, except that generate.max.per.host.by.ip  won't be of much use 
anymore as we don't count per IP.

Ok.

bq. Have just noticed that 'crawl.gen.delay' is not documented in 
nutch-default.xml, and does not seem to be used outside the Generator. What is 
it supposed to be used for? 

Ah, a bit of ancient magic .. ;) This value, expressed in days, defines how 
long we should keep the lock on records in CrawlDb that were just selected for 
fetching. If these records are not updated in the meantime, the lock is 
canceled, i.e. the become eligible for selecting. Default value of this is 7 
days.

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847291#action_12847291
 ] 

Andrzej Bialecki  commented on NUTCH-693:
-

Thanks for the pointer to the article. Indeed, the issue is muddy at best. So 
far Nutch adhered to a strict interpretation, where the links with this 
attribute are deleted from page outlinks immediately (so they are not only not 
followed but also don't affect out-degree metrics). If there is a general 
agreement in Nutch community towards relaxing this behavior we can further 
develop this patch - at the moment I don't see such support. Consequently, I 
propose to discuss it and in the meantime to move this issue to a later release.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-693:


Assignee: (was: Otis Gospodnetic)

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-797:
---

Assignee: Andrzej Bialecki 

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = basePath.substring(baseRightMostIdx+1);
 + }
 + 
 + if (target.startsWith(?))
 + target

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847300#action_12847300
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

If there are no futher comments I'm going to commit the current patch with a 
TODO to revisit this code if/when it's refactored to an external dependency.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.1.

2010-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-787:


Assignee: Andrzej Bialecki 
 Summary: Upgrade Lucene to 3.0.1.  (was: Upgrade Lucene to 3.0.0.)

We're shooting at 3.0.1 now.

 Upgrade Lucene to 3.0.1.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Assignee: Andrzej Bialecki 
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-03-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847315#action_12847315
 ] 

Andrzej Bialecki  commented on NUTCH-787:
-

Using Lucene 3.0.1 artifacts I verified that your patch passes all tests and 
produces correct searchable indexes. I'll commit this shortly.

 Upgrade Lucene to 3.0.0.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-787) Upgrade Lucene to 3.0.1.

2010-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-787.
---

Resolution: Fixed

Committed. Thanks Dawid!

 Upgrade Lucene to 3.0.1.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Assignee: Andrzej Bialecki 
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-803) Upgrade Hadoop to 0.20.2

2010-03-19 Thread Andrzej Bialecki (JIRA)
Upgrade Hadoop to 0.20.2


 Key: NUTCH-803
 URL: https://issues.apache.org/jira/browse/NUTCH-803
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1


Per subject. We are currently using 0.20.1, so there are no API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-803) Upgrade Hadoop to 0.20.2

2010-03-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-803.
---

Resolution: Fixed

All tests pass - committed.

 Upgrade Hadoop to 0.20.2
 

 Key: NUTCH-803
 URL: https://issues.apache.org/jira/browse/NUTCH-803
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1


 Per subject. We are currently using 0.20.1, so there are no API changes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[DISCUSS] Nutch as a top level project (TLP)?

2010-03-19 Thread Andrzej Bialecki

Hi devs,

The ASF Board indicated recently that so called umbrella projects, 
i.e. projects that host many significant sub-projects, should examine 
their structure towards simplification, such as merging or splitting out 
sub-projects.


Lucene TLP is such a project. Recently the Lucene PMC accepted the merge 
of Solr and Lucene core projects. Mahout project will most likely split 
to its own TLP soon. Which leaves Nutch as a sort of odd duck ;)


Moving Nutch to its own TLP has some advantages, mostly an easier 
decision process - voting on new committers and new releases involves 
then only those who participate directly in Nutch dev., i.e. the Nutch 
community.


Also, from the coding point of view, Nutch is not intrinsically tied to 
the Lucene development as if both would require some careful 
coordination - we just use Lucene as one of many dependencies, and in 
fact we aim to cleanly separate Nutch search API from Lucene-based API. 
I can easily imagine Nutch dropping completely the low-level 
Lucene-based components and moving to a more general search fabric (e.g. 
SolrCloud).


Being its own TLP could also give Nutch more exposure and help to 
crystallize our mission.


There are some disadvantages to such a split, too: we would need to 
spend some more effort on various administrative tasks, and maintain a 
separate web site (under Apache, but not under Lucene), and probably 
some other tasks that I'm not yet aware of. This would also mean that 
Nutch would have to stand on its own merit, which considering the small 
number of active committers may be challenging.


Let's discuss this, and after we collect some pros and cons I'm going to 
call for a vote.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846923#action_12846923
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

That's one option, at least until the crawler-commons produces any artifacts 
... Eventually I think that this code and other related code (e.g. deciding 
which URL is canonical in presence of redirects, url normalization and 
filtering) should end up in the crawler-commons.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846927#action_12846927
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

In my experience the IP-based fetching was only (rarely) needed when there was 
a large number of urls from virtual hosts hosted at the same ISP. In other 
words, not a common case - others may have different experience depending on 
their typical crawl targets... IMHO I think we don't have to reimplement this.

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-802:
-

  Assignee: Andrzej Bialecki 

Submitting a patch is not fixing, it's fixed when the patch is accepted and 
applied.

 Problems managing outlinks with large url length
 

 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón
Assignee: Andrzej Bialecki 
 Attachments: ParseOutputFormat.patch


 Nutch can get idle during the collection of outlinks if  the URL address of 
 the outlink is too large.
 The maximum sizes of an URL for the main web servers are:
 * Apache: 4,000 bytes
 * Microsoft Internet Information Server (IIS): 16, 384 bytes
 * Perl HTTP::Daemon: 8.000 bytes
 URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
 be set in the nutch-default.xml configuration file.
 I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846932#action_12846932
 ] 

Andrzej Bialecki  commented on NUTCH-802:
-

We already have a general way to control this and other aspects of URL-s as 
such, namely with URLFilters. I agree that this functionality could be useful, 
but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or 
urlfilter-validator).

 Problems managing outlinks with large url length
 

 Key: NUTCH-802
 URL: https://issues.apache.org/jira/browse/NUTCH-802
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Pablo Aragón
Assignee: Andrzej Bialecki 
 Attachments: ParseOutputFormat.patch


 Nutch can get idle during the collection of outlinks if  the URL address of 
 the outlink is too large.
 The maximum sizes of an URL for the main web servers are:
 * Apache: 4,000 bytes
 * Microsoft Internet Information Server (IIS): 16, 384 bytes
 * Perl HTTP::Daemon: 8.000 bytes
 URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
 be set in the nutch-default.xml configuration file.
 I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-796.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

Patch applied in rev. 924945. Thanks for reporting it.

 Zero results problems difficult to troubleshoot due to lack of logging
 --

 Key: NUTCH-796
 URL: https://issues.apache.org/jira/browse/NUTCH-796
 Project: Nutch
  Issue Type: Improvement
  Components: searcher, web gui
Affects Versions: 1.0.0, 1.1
 Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1
Reporter: Jesse Hires
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: logging.patch


 There are a few places where search can fail in a distributed environment, 
 but when configuration is not quite right, there are no indications of errors 
 or logging.
 Increased logging of failures would help troubleshoot such problems, as well 
 as lower the I get 0 results, why? questions that come across the mailing 
 lists. 
 Areas where logging would be helpful:
 search app cannot locate search-servers.txt
 search app cannot find searcher node listed in search-server.txt
 search app cannot connect to port on searcher specified in search-server.txt
 searcher (bin/nutch server...) cannot find index
 searcher cannot find segments
 Access denied in any of the above scenarios.
 There are probably more that would be helpful, but I am not yet familiar to 
 know all the points of possible failure between the webpage and a search node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847071#action_12847071
 ] 

Andrzej Bialecki  commented on NUTCH-800:
-

I'm puzzled by your problem description. Is Nutch affected by a potentially 
malicious URL data? URL form encoding is just a transport encoding, it doesn't 
make URL inherently safe (or unsafe).

 Generator builds a URL list that is not encoded
 ---

 Key: NUTCH-800
 URL: https://issues.apache.org/jira/browse/NUTCH-800
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
 1.0.0, 1.1
Reporter: Jesse Campbell

 The URL string that is grabbed by the generator when creating the fetch list 
 does not get encoded, could potentially allow unsafe excecution, and breaks 
 reading improperly encoded URLs from the scraped pages.
 Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
 likely do not have control over all content providers, we are currently 
 forced to use a regex normalizer to perform the same function as a built-in 
 java class (it would be unsafe to leave alone)
 A quick solution would be to update Generator.java to utilize the 
 java.net.URLEncoder static class:
 line 187: 
 old: String urlString = url.toString();
 new: String urlString = URLEncoder.encode(url.toString(),UTF-8);
 line 192:
 old: u = new URL(url.toString());
 new: u = new URL(urlString);
 The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847074#action_12847074
 ] 

Andrzej Bialecki  commented on NUTCH-693:
-

This patch is controversial in the sense that a) Nutch strives to adhere to 
Internet standards and netiquette, which says that robots should obey nofollow, 
and b) most Nutch users want a well-behaved robot. You are free of course to 
modify the source as you did. Therefore I think that this functionality is not 
applicable to majority of Nutch users, and I vote -1 on including it in Nutch.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-795) Add ability to maintain nofollow attribute in linkdb

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847075#action_12847075
 ] 

Andrzej Bialecki  commented on NUTCH-795:
-

Please see my comment to that issue. Or is there some other use case that you 
have in mind?

 Add ability to maintain nofollow attribute in linkdb
 

 Key: NUTCH-795
 URL: https://issues.apache.org/jira/browse/NUTCH-795
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.1
Reporter: Sammy Yu
 Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files

2010-03-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847094#action_12847094
 ] 

Andrzej Bialecki  commented on NUTCH-780:
-

Is the purpose of this issue to make Crawl.java usable via strongly-typedAPI 
instead of the generic main, e.g. something like this:

{code}
public class Crawl extends Configured {
 
 public int crawl(Path output, Path seedDir, int threads, int numCycles, int 
topN, ...) {
...
  }
}
{code}

 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Vu Hoang
 Attachments: NUTCH-780.patch


 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846402#action_12846402
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Thanks for reporting this, and providing a patch. An updated revision of the 
standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix 
this shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846418#action_12846418
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Hm, actually the picture is more complicated than I thought - if we apply both 
methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases 
from RFC fail. However, all tests succeed if we only apply the 
fixPureQueryTargets !

Looking at the origin of the fixEmbeddedParams method (NUTCH-436) something 
must been fixed in java.net.URL, because the test case mentioned in that issue 
now passes if we apply only fixPureQueryTargets. The same case with test cases 
in a near-duplicate issue NUTCH-566.

Consequently I'm going to remove fixEmbeddedParams. I added all tests from 
RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch 
shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith

[jira] Updated: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-797:


Attachment: pureQueryUrl-2.patch

Updated patch with some refactoring and unit tests. If no objections I'll 
commit this shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost = basePath.substring(baseRightMostIdx+1

[jira] Updated: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-796:


Attachment: logging.patch

I propose this patch. If there are no objections I'll commit it shortly.

 Zero results problems difficult to troubleshoot due to lack of logging
 --

 Key: NUTCH-796
 URL: https://issues.apache.org/jira/browse/NUTCH-796
 Project: Nutch
  Issue Type: Improvement
  Components: searcher, web gui
Affects Versions: 1.0.0, 1.1
 Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1
Reporter: Jesse Hires
 Attachments: logging.patch


 There are a few places where search can fail in a distributed environment, 
 but when configuration is not quite right, there are no indications of errors 
 or logging.
 Increased logging of failures would help troubleshoot such problems, as well 
 as lower the I get 0 results, why? questions that come across the mailing 
 lists. 
 Areas where logging would be helpful:
 search app cannot locate search-servers.txt
 search app cannot find searcher node listed in search-server.txt
 search app cannot connect to port on searcher specified in search-server.txt
 searcher (bin/nutch server...) cannot find index
 searcher cannot find segments
 Access denied in any of the above scenarios.
 There are probably more that would be helpful, but I am not yet familiar to 
 know all the points of possible failure between the webpage and a search node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846428#action_12846428
 ] 

Andrzej Bialecki  commented on NUTCH-787:
-

Lucene 3.0.1 is out now .. I'll test this patch with 3.0.1 artifacts and will 
report.

 Upgrade Lucene to 3.0.0.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846437#action_12846437
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Unfortunately the way your fix was applied there is not reusable (private 
method in HtmlParser... ugh :( ). So for the time being I think we'll go with 
our utility class ... which we should really move to the crawler-commons anyway!

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx

[jira] Assigned: (NUTCH-774) Retry interval in crawl date is set to 0

2010-03-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-774:
---

Assignee: Andrzej Bialecki 

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846527#action_12846527
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

A few issues with this:

* does this mean that the fixes would be applied to links found in other 
content types as well, not just html (the fixup code in TIKA-287 is located in 
HtmlParser)?

* we need this also in other places, e.g. in the redirection handling code 
(both meta-refresh, javascript location.href and protocol-level redirect)

* for a while we still need this in the parse-html plugin that does not use 
Tika.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Priority: Minor
 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846133#action_12846133
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

It appears this class is not a strict superset - the generate.update.crawldb 
functionality is not there. This is a regression in a useful functionality, so 
I think it needs to be added back.

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846174#action_12846174
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

In case of users generating just 1 segment at a time it's an unexpected loss of 
flexibility. You can't run this version of Generator twice without first 
completing _both_ fetching  updating of all segments from the previous run - 
because some of the same urls would be generated in the next round. The point 
of generate.update.crawldb is to be able to freely interleave generate/update 
steps.

E.g. the following scenario breaks in a non-obvious way:

* generate 10 segments
* fetch  update 8 of them
* realize you need more rounds due to e.g. gone pages
* generate additional 10 segments

..kaboom! now the new segments partially overlap with the unfetched 2 segments 
from the previous generation, and you are going to fetch some urls twice. 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4

2010-03-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843555#action_12843555
 ] 

Andrzej Bialecki  commented on NUTCH-798:
-

+1, preferably before the 1.1 freeze so that we  can test it.

 Upgrade to SOLR1.4
 --

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
 the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843587#action_12843587
 ] 

Andrzej Bialecki  commented on NUTCH-801:
-

Definitely +1, the only reason they lingered so long was the lack of a suitable 
replacement.

 Remove RTF and MP3 parse plugins
 

 Key: NUTCH-801
 URL: https://issues.apache.org/jira/browse/NUTCH-801
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Fix For: 1.1


 *Parse-rtf* and *parse-mp3* are not built by default  due to licensing 
 issues. Since we now have *parse-tika* to handle these formats I would be in 
 favour of removing these 2 plugins altogether to keep things nice and simple. 
 The other plugins will probably be phased out only after the release of 1.1  
 when parse-tika will have been tested a lot more.
 Any reasons not to?
 Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 1.1 release?

2010-03-09 Thread Andrzej Bialecki

On 2010-03-09 18:17, Julien Nioche wrote:

Hi Chris,

Excellent idea! There have been quite a few changes since 1.0 and it's
probably the right time to have a new release.


+1. Let's just check JIRA and make sure we didn't forget anything 
important ...




Not really a blocker but https://issues.apache.org/jira/browse/NUTCH-762
would be nice to have in 1.1, just needs a bit of reviewing / testing I
suppose. Otherwise this can wait until after 1.1


I'll try to test it before the weekend.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841790#action_12841790
 ] 

Andrzej Bialecki  commented on NUTCH-799:
-

I think it's ok to do it this way - the commit per reducer may be actually 
harmful if commit succeeds but the task is killed for any reason and re-ran.

Note: the patch has some formatting errors.

 SOLRIndexer to commit once all reducers have finished
 -

 Key: NUTCH-799
 URL: https://issues.apache.org/jira/browse/NUTCH-799
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-799.patch


 What about doing only one SOLR commit after the MR job has finished in 
 SOLRIndexer instead of doing that at the end of every Reducer? 
 I ran into timeout exceptions in some of my reducers and I suspect that this 
 was due to the fact that other reducers had already finished and called 
 commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832250#action_12832250
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

+1 to commit this - please remember to update nutch-default.xml to switch to 
the tika plugin, perhaps add a comment about the deprecated parse-* plugins - 
most people look here and not in the parse-plugins, where this change is 
documented...

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830065#action_12830065
 ] 

Andrzej Bialecki  commented on NUTCH-673:
-

+1 on both counts. Upgrade to Lucene 3.0 may involve more work than expected 
because of deprecated 2.x APIs that are no longer available in 3.0.

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806031#action_12806031
 ] 

Andrzej Bialecki  commented on NUTCH-775:
-

IMHO this could go as it is ... one suggestion though: this Query/QueryContext 
now resembles SolrQuery/SolrParams. Perhaps we could rename QueryContext to 
QueryParams?

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804558#action_12804558
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent 
deprecation note, but I feel equally strongly that we should not prolong their 
life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. 
We simply don't have resources to maintain so many duplicate plugins, and 
instead we should direct our efforts to improve those in Tika.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175
 ] 

Andrzej Bialecki  commented on NUTCH-779:
-

Personally I would use ScoringFilters because I'm familiar with the API, but 
the approach that you propose is certainly more user friendly especially for 
novice users.

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801875#action_12801875
 ] 

Andrzej Bialecki  commented on NUTCH-779:
-

You can already achieve this with ScoringFilters, although it requires using 
three methods instead ... I would also rename the status to parse_meta, it's 
less cryptic this way. The property needs some documentation in 
nutch-default.xml plus a sensible default.

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-655) Injecting Crawl metadata

2010-01-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797013#action_12797013
 ] 

Andrzej Bialecki  commented on NUTCH-655:
-

I'm not sure about the latest addition (the score option). If we go this route, 
then I suggest doing the last minor step and recognize reserved metadata keys 
to do also other useful things like setting fetch interval. I.e. define and 
recognize nutch.score and nutch.fetchInterval, and document it properly 
somewhere ...(wiki? javadoc? cmd-line synopsis?).

 Injecting Crawl metadata
 

 Key: NUTCH-655
 URL: https://issues.apache.org/jira/browse/NUTCH-655
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
 Fix For: 1.1

 Attachments: Injector.patch, NUTCH-655.v2


 the patch attached allows to inject metadata into the crawlDB. The input file 
 has to contain fields separated by tabs, with the URL being on the first 
 column. The metadata names and values are separated by '='. A input line 
 might look like this:
 http://www.myurl.com  \t  categ=value1 \t categ2=value2
 This functionality can be useful to store external knowledge and index it 
 with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791979#action_12791979
 ] 

Andrzej Bialecki  commented on NUTCH-666:
-

Do you think it was related to the quality of language models that you built 
(presumably the ones in the patch?) versus the ones in the Nutch plugin, or due 
to a different classification algorithm? I'm trying to understand the source of 
such a big difference, because AFAIK the algorithm in textcat is essentially 
the same as the one we use.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2009-12-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791411#action_12791411
 ] 

Andrzej Bialecki  commented on NUTCH-775:
-

+1. I would suggest creating a subclass of Metadata, where we can guarantee the 
presence of some required parameters, e.g.:

{code}
public class SearchContext extends Metadata {
  protected int numHits;
  protected String sortField;
  protected String dedupField;
  ...
  // setters and getters for the above
}
{code}

and change the QueryFilter interface to use SearchContext too.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790225#action_12790225
 ] 

Andrzej Bialecki  commented on NUTCH-666:
-

Dennis, what's the status of this patch (especially the missing part, the new 
language identifier)?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: State of nutchbase

2009-12-06 Thread Andrzej Bialecki

Alban Mouton wrote:

Hello,

I have looked a little into nutch code and mailing lists. I think the 
nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is 
very interesting, with a good potential to improve code clarity and 
flexibility (I find data structure quite obscure in current version). 
The issue is untouched since last august, so my question is : can 
nutchbase really be part of nutch 1.1 ? 


Definitely no. Release 1.1 will be an update to 1.0, with no major 
design changes. However, we intend to integrate the nutchbase branch 
with trunk at some point - but since this would be a major change it 
would come under 2.0 branch or so ...



Is there still much work to do 
or is it almost ready ? Is it a worthy issue for an interested developer 
with a (still !) limited knowledge of the project ?


Please contact Dogacan, who is leading the work on this branch. AFAIK 
he's going to update the design soon.




So far I have only tried to run nutchbase in eclipse by applying the 
tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run 
in errors when building, mostly from Parser and tests. I may start by 
cleaning this up.


See above - please coordinate with Dogacan to avoid duplication of effort.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-767:


Remaining Estimate: 0h
 Original Estimate: 0h

I applied the patch, and I'm closing this issue - we will track the test 
failures when we upgrade to Tika 0.6, which is imminent.

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767-part2.patch, NUTCH-767.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-02 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-767:
-


 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-02 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784790#action_12784790
 ] 

Andrzej Bialecki  commented on NUTCH-767:
-

Reopening this issue, because TestContent is failing now - after fixing a 
trivial compilation problem, now the problem seems to be that the type for 
empty content is auto-detected as text/plain and this value overrides the 
hint from the Content-Type header.

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-12-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784206#action_12784206
 ] 

Andrzej Bialecki  commented on NUTCH-768:
-

+1.

Minor nit: file lib/hsqldb-1.8.0.10.LICENSE.txt uses Windows EOL style, this 
should be probably corrected before commit.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-12-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784250#action_12784250
 ] 

Andrzej Bialecki  commented on NUTCH-770:
-

Fixed in rev. 885776. Thank you!

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, 
 NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-770) Timebomb for Fetcher

2009-12-01 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-770.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, 
 NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-12-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784260#action_12784260
 ] 

Andrzej Bialecki  commented on NUTCH-769:
-

I had to apply this patch by hand, due to NUTCH-770. I also added 
conf/nutch-default.xml documentation. This was committed in rev. 885785 - 
thanks!

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions

2009-12-01 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-769.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Fetcher to skip queues for URLS getting repeated exceptions  
 -

 Key: NUTCH-769
 URL: https://issues.apache.org/jira/browse/NUTCH-769
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-769-2.patch, NUTCH-769.patch


 As discussed on the mailing list (see 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this 
 patch allows to clear URLs queues in the Fetcher when more than a set number 
 of exceptions have been encountered in a row. This can speed up the fetching 
 substantially in cases where target hosts are not responsive (as a 
 TimeoutException would be thrown) and limits cases where a whole Fetch step 
 is slowed down because of a few queues.
 by default the parameter fetcher.max.exceptions.per.queue has a value of -1 
 and is deactivated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-01 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-767.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki   (was: Chris A. Mattmann)

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2009-12-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784337#action_12784337
 ] 

Andrzej Bialecki  commented on NUTCH-767:
-

Fixed in rev. 885869. Thank you!

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767.patch


 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783638#action_12783638
 ] 

Andrzej Bialecki  commented on NUTCH-770:
-

bq.   time limit is definitely better than timebomb (but not as amusing). 

:) let's got for informative and less confusing now ... Could you please 
also add the nutch-default.xml property and its documentation.

Re: FetchQueues - ok, you have a point here.

Re: code style - yes.

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: wrong wiki front page

2009-11-30 Thread Andrzej Bialecki

Alban Mouton wrote:
No reaction ? Isn't the Wiki admin on this mailing list ? I don't see 
any link on the Wiki to contact the admin.


The french frontpage is still the generic MoinMoin wiki home page and 
that can make a bad impression to newcomers !


We have little control over the MoinMoin config (AFAIK it's configured 
for multiple projects), and what you noticed is probably a fallout of 
the recent wiki upgrade - please create a JIRA issue here: 
https://issues.apache.org/jira/browse/INFRA (don't forget to mention the 
project name).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-770) Timebomb for Fetcher

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783283#action_12783283
 ] 

Andrzej Bialecki  commented on NUTCH-770:
-

I propose to change the name of this functionality - timebomb is not 
self-explanatory, and it suggests that if you misbehave then your cluster may 
explode ;) Instead I would use time limit, rename all vars and methods to 
follow this naming, and document it properly in nutch-default.xml.

A few comments to the patch:

* it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy 
to resolve, see also the next point.

* why change the code in FetchQueues at all? Time limit is a global condition, 
we could just break the main loop in run() and ignore the QueueFeeder (or don't 
start it if the time limit already passed when starting run() ).

* the patch does not follow the code style (notably whitespace in for/while 
loops and assignments).

 Timebomb for Fetcher
 

 Key: NUTCH-770
 URL: https://issues.apache.org/jira/browse/NUTCH-770
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
 Attachments: log-770, NUTCH-770.patch


 This patch provides the Fetcher with a timebomb mechanism. By default the 
 timebomb is not activated; it can be set using the parameter 
 fetcher.timebomb.mins. The number of minutes is relative to the start of the 
 Fetch job. When the number of minutes is reached, the QueueFeeder skips all 
 remaining entries then all active queues are purged. This allows to keep the 
 Fetch step under comtrol and works well in combination with NUTCH-769

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-746.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783287#action_12783287
 ] 

Andrzej Bialecki  commented on NUTCH-746:
-

Fixed in rev. 885148. Thanks!

 NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing 
 resource leak in the container.
 

 Key: NUTCH-746
 URL: https://issues.apache.org/jira/browse/NUTCH-746
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or 
 Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode)
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-746.patch


 NutchBeanConstructor is not cleaning up upon application shutdown 
 (contextDestroyed()).   It leaves open the SegmentUpdater, and potentially 
 other resources.  This causes the WebApp's classloader to not be able to 
 GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-738.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 Close SegmentUpdater when FetchedSegments is closed
 ---

 Key: NUTCH-738
 URL: https://issues.apache.org/jira/browse/NUTCH-738
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Martina Koch
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: FetchedSegments.patch, NUTCH-738.patch


 Currently FetchedSegments starts a SegmentUpdater, but never closes it when 
 FetchedSegments is closed.
 (The problem was described in this mailing: 
 http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-739.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290
 ] 

Andrzej Bialecki  commented on NUTCH-739:
-

Fixed in rev. 885152. Thank you!

 SolrDeleteDuplications too slow when using hadoop
 -

 Key: NUTCH-739
 URL: https://issues.apache.org/jira/browse/NUTCH-739
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: hadoop cluster with 3 nodes
 Map Task Capacity: 6
 Reduce Task Capacity: 6
 Indexer: one instance of solr server (on the one of slave nodes)
Reporter: Dmitry Lihachev
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch


 in my environment i always have many warnings like this on the dedup step
 {noformat}
 Task attempt_200905270022_0212_r_03_0 failed to report status for 600 
 seconds. Killing!
 {noformat}
 solr logs:
 {noformat}
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173741
 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor 
 finish
 INFO: {optimize=} 0 173599
 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update 
 params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2}
  status=0 QTime=173599
 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing searc...@2ad9ac58 main
 May 27, 2009 10:29:27 AM 
 org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
 WARNING: Could not getStatistics on info bean 
 org.apache.solr.search.SolrIndexSearcher
 org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
 
 {noformat}
 So I think the problem in the piece of code on line 301 of 
 SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks 
 each of ones tries to optimize solr indexes before closing.
 The simplest way to avoid this bug - removing this line and sending 
 optimize/ message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-755.
---

Resolution: Cannot Reproduce
  Assignee: Andrzej Bialecki 

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783299#action_12783299
 ] 

Andrzej Bialecki  commented on NUTCH-755:
-

I could not verify that the filter indeed crashes - it simply prints the 
exception and then returns null, as you suggested.

 DomainURLFilter crashes on malformed URL
 

 Key: NUTCH-755
 URL: https://issues.apache.org/jira/browse/NUTCH-755
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Tomcat 6.0.14
 Java 1.6.0_14
 Linux
Reporter: Mike Baranczak
Assignee: Andrzej Bialecki 

 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply 
 filter on url: http:/comments.php
 java.lang.NullPointerException
 at 
 org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173)
 at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200)
 at 
 org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96)
 at 
 org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
 at 
 org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 Expected behavior would be to recognize the URL as malformed, and reject it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783302#action_12783302
 ] 

Andrzej Bialecki  commented on NUTCH-692:
-

We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt 
differs there, and the problem may be nonexistent.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783304#action_12783304
 ] 

Andrzej Bialecki  commented on NUTCH-741:
-

Fixed in rev. 885156. Thank you!

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-741.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Job file includes multiple copies of nutch config files.
 

 Key: NUTCH-741
 URL: https://issues.apache.org/jira/browse/NUTCH-741
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Kirby Bohling
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: removeJobDupConf.diff


 From a clean checkout, running ant tar will create a .job file.  The .job 
 file includes two copies of the nutch-site.xml and nutch-default.xml file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-11-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-712.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers

2009-11-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783306#action_12783306
 ] 

Andrzej Bialecki  commented on NUTCH-712:
-

Fixed in rev. 885159. Thank you!

 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers
 -

 Key: NUTCH-712
 URL: https://issues.apache.org/jira/browse/NUTCH-712
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: ParseOutputFormat-NUTCH712v2.patch


 ParseOutputFormat should catch java.net.MalformedURLException coming from 
 normalizers otherwise the whole parsing step crashes instead of simply 
 ignoring dodgy outlinks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1

2009-11-25 Thread Andrzej Bialecki (JIRA)
Upgrade Nutch to use Lucene 2.9.1
-

 Key: NUTCH-772
 URL: https://issues.apache.org/jira/browse/NUTCH-772
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Upgrade Nutch to the latest Lucene release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java

2009-11-25 Thread Andrzej Bialecki

david.stu...@progressivealliance.co.uk wrote:
  While you are doing changes and commits in this area I have been 
waiting for this patch https://issues.apache.org/jira/browse/NUTCH-760 
of mine to be incorporated for a while now. Is it possible it get it in??


It's on my agenda - I'll apply the patch either today or tomorrow, time 
permitting.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Closed: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java

2009-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-773.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 some minor bugs in AbstractFetchSchedule.java
 -

 Key: NUTCH-773
 URL: https://issues.apache.org/jira/browse/NUTCH-773
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-773.patch


 fixes some minor trivial bugs in AbstractFetchSchedule.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java

2009-11-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782509#action_12782509
 ] 

Andrzej Bialecki  commented on NUTCH-773:
-

That was a nasty bug - fixed in rev. 884198. Thanks!

 some minor bugs in AbstractFetchSchedule.java
 -

 Key: NUTCH-773
 URL: https://issues.apache.org/jira/browse/NUTCH-773
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-773.patch


 fixes some minor trivial bugs in AbstractFetchSchedule.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

2009-11-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782516#action_12782516
 ] 

Andrzej Bialecki  commented on NUTCH-753:
-

Fixed in rev. 884203 - thanks!

 Prevent new Fetcher to retrieve the robots twice
 

 Key: NUTCH-753
 URL: https://issues.apache.org/jira/browse/NUTCH-753
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-753.patch


 The new Fetcher which is now used by default handles the robots file directly 
 instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and 
 Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt 
 twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. 
 However in practice the robots file is still fetched as there is a call to 
 robots.getCrawlDelay() a bit further which is not covered by the if 
 (Protocol.CHECK_ROBOTS).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice

2009-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-753.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Prevent new Fetcher to retrieve the robots twice
 

 Key: NUTCH-753
 URL: https://issues.apache.org/jira/browse/NUTCH-753
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-753.patch


 The new Fetcher which is now used by default handles the robots file directly 
 instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and 
 Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt 
 twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. 
 However in practice the robots file is still fetched as there is a call to 
 robots.getCrawlDelay() a bit further which is not covered by the if 
 (Protocol.CHECK_ROBOTS).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2009-11-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782524#action_12782524
 ] 

Andrzej Bialecki  commented on NUTCH-762:
-

This class offers a strict superset of the current Generator functionality. 
Maintaining both tools would be cumbersome and error-prone. I propose to 
replace Generator with MultiGenerator (under the current name Generator).

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-762-MultiGenerator.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

2009-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-761.
---

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Andrzej Bialecki 

 Avoid cloningCrawlDatum in CrawlDbReducer 
 --

 Key: NUTCH-761
 URL: https://issues.apache.org/jira/browse/NUTCH-761
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: optiCrawlReducer.patch


 In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
 reduce phase and these will be the entries coming from the crawlDB and not 
 present in the segments.
 The patch attached optimizes the reduce step by avoid an unnecessary cloning 
 of the CrawlDatum fields when there is only one CrawlDatum in the values. 
 This has more impact has the crawlDB gets larger,  we noticed an improvement 
 of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

2009-11-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782537#action_12782537
 ] 

Andrzej Bialecki  commented on NUTCH-761:
-

I applied the patch with some changes - reverted the logic in the name of the 
boolean var, and applied the same method to other cases of non-multiple values. 
Committed in rev. 884224 - thanks!

 Avoid cloningCrawlDatum in CrawlDbReducer 
 --

 Key: NUTCH-761
 URL: https://issues.apache.org/jira/browse/NUTCH-761
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: optiCrawlReducer.patch


 In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
 reduce phase and these will be the entries coming from the crawlDB and not 
 present in the segments.
 The patch attached optimizes the reduce step by avoid an unnecessary cloning 
 of the CrawlDatum fields when there is only one CrawlDatum in the values. 
 This has more impact has the crawlDB gets larger,  we noticed an improvement 
 of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-760) Allow field mapping from nutch to solr index

2009-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-760.
---

   Resolution: Fixed
Fix Version/s: 1.1

 Allow field mapping from nutch to solr index
 

 Key: NUTCH-760
 URL: https://issues.apache.org/jira/browse/NUTCH-760
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: David Stuart
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: solrindex_schema.patch, solrindex_schema.patch, 
 solrindex_schema.patch, solrindex_schema.patch


 I am using nutch to crawl sites and have combined it
 with solr pushing the nutch index using the solrindex command. I have
 set it up as specified on the wiki using the copyField url to id in the
 schema. Whilst this works fine it is stuff's up my inputs from other
 sources in solr (e.g. using the solr data import handler) as they have
 both id's and url's. I have patch that implements a nutch xml schema
 defining what basic nutch fields map to in your solr push.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >