RE: How to get score in search.jsp

2007-02-14 Thread Anton Potekhin
I have found solution. I've add variable score  into Hit

-Original Message-
From: Anton Potekhin [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 14, 2007 10:48 AM
To: nutch-dev@lucene.apache.org
Subject: How to get score in search.jsp
Importance: High

Hi Nutch Gurus!

I have a small problem. I need to add some changes into search.jsp. I need
to get first 50 results and to sort them in different way. I will change the
score of each result with formula "new_score = nutch_score +
domain_score_from_my_db" to sort. But i don't understand how to get
nutch_score in search.jsp  

Now I use a makeshift. I get the nutch_score using getValue() method of
org.apache.lucene.search.Explanation class. But i think it is a very slow
way. 

Can anybody help me to find a solution for this problem?

P.S. I hope that I described my problem clearly. Thanks in advance.

Sorry for the duplicated mail. I think I had some problems with my mail
account 






Re: Injector checking for other than STATUS_INJECTED

2007-02-14 Thread Andrzej Bialecki

Gal Nitzan wrote:

Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it changes
its status to STATUS_DB_UNFETCHED?

  


With the current version of Injector - it won't. With previous versions 
- it might, depending on the order of values received in reduce().


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: Injector checking for other than STATUS_INJECTED

2007-02-14 Thread Gal Nitzan
Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it changes
its status to STATUS_DB_UNFETCHED?

Gal

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 8:47 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Injector checking for other than STATUS_INJECTED

[EMAIL PROTECTED] wrote:
> Hi All,
>
> I think I am missing something.  In the Injector reduce code we have the
> following.
>
> 
> while (values.hasNext()) {
>   CrawlDatum val = (CrawlDatum)values.next();
>   if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
> injected = val;
> injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
>   } else {
> old = val;
>   }
> }
>
> CrawlDatum res = null;
> if (old != null) res = old; // don't overwrite existing value
> else res = injected;
> 
>
> Basically if it is not just injected then don't overwrite.  But I am not
> seeing where the input could be such that the CrawlDatum wasn't just
> injected and could have previous values.  Is this just in case someone
> uses the Injector as a Reducer and not a Mapper or am I missing how this
> condition can occur.
>   

This handles an important case, when you inject URLs that already exist 
in the DB - then you have both the old value and the newly created value 
under the same key. In previous versions of Injector CrawlDatum-s for 
such URLs could be overwritten with new values, and you could lose 
valuable metadata accumulated in old values.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






Re: Injector checking for other than STATUS_INJECTED

2007-02-14 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:

Hi All,

I think I am missing something.  In the Injector reduce code we have the
following.


while (values.hasNext()) {
  CrawlDatum val = (CrawlDatum)values.next();
  if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
injected = val;
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  } else {
old = val;
  }
}

CrawlDatum res = null;
if (old != null) res = old; // don't overwrite existing value
else res = injected;


Basically if it is not just injected then don't overwrite.  But I am not
seeing where the input could be such that the CrawlDatum wasn't just
injected and could have previous values.  Is this just in case someone
uses the Injector as a Reducer and not a Mapper or am I missing how this
condition can occur.
  


This handles an important case, when you inject URLs that already exist 
in the DB - then you have both the old value and the newly created value 
under the same key. In previous versions of Injector CrawlDatum-s for 
such URLs could be overwritten with new values, and you could lose 
valuable metadata accumulated in old values.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Injector checking for other than STATUS_INJECTED

2007-02-14 Thread nutch-dev
Hi All,

I think I am missing something.  In the Injector reduce code we have the
following.


while (values.hasNext()) {
  CrawlDatum val = (CrawlDatum)values.next();
  if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
injected = val;
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  } else {
old = val;
  }
}

CrawlDatum res = null;
if (old != null) res = old; // don't overwrite existing value
else res = injected;


Basically if it is not just injected then don't overwrite.  But I am not
seeing where the input could be such that the CrawlDatum wasn't just
injected and could have previous values.  Is this just in case someone
uses the Injector as a Reducer and not a Mapper or am I missing how this
condition can occur.

Dennis Kubes








[jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-14 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295
 ] 

Dennis Kubes commented on NUTCH-247:


I think the idea here is to NOT allow people to run fetchers for which they 
haven't configured an agent name and email, etc.  There may be a better way to 
do this then simply logging severe and then stopping.  I think it would be best 
to provide some sort of feedback mechanism to the user either via the command 
line or an explicit exception that tells the user to configure the agent name 
and email in their nutch-*.xml file.  If this is the direction that we want to 
go, I can come up with a patch for this.

> robot parser to restrict.
> -
>
> Key: NUTCH-247
> URL: https://issues.apache.org/jira/browse/NUTCH-247
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Minor
> Fix For: 0.9.0
>
>
> If the agent name and the robots agents are not proper configure the Robot 
> rule parser uses LOG.severe to log the problem but solve it also. 
> Later on the fetcher thread checks for severe errors and stop if there is one.
> RobotRulesParser:
> if (agents.size() == 0) {
>   agents.add(agentName);
>   LOG.severe("No agents listed in 'http.robots.agents' property!");
> } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
>   agents.add(0, agentName);
>   LOG.severe("Agent we advertise (" + agentName
>  + ") not listed first in 'http.robots.agents' property!");
> }
> Fetcher.FetcherThread:
>  if (LogFormatter.hasLoggedSevere()) // something bad happened
> break;  
> I suggest to use warn or something similar instead of severe to log this 
> problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-14 Thread Dennis Kubes
It may fix the problem it may not.  There have been many changes in 
hadoop since 0.4.  I think they are now on .11.x.  So if you are 
upgrading existing dfs implementations that currently have content that 
is something to take into consideration.  That being said the changes in 
hadoop from .4 to present may very well have fixed the error you are 
seeing and to use the most recent version of hadoop you will need to use 
the NUTCH-437 patch.


Looking at your output below though my first thought would be that this 
is something in the PDF parser and not hadoop causing the error.  Nutch 
uses pdfbox software to parse PDF files so you may want to take the 
specific file and see if it parses correctly outside of nutch using pdfbox.


Dennis Kubes

Armel T. Nene wrote:

Dennis

I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
checkout awhile ago from SVN but never updated again. I was able to crawl
1 xml files before with no error whatsoever. This is the following
errors that I get when I'm fetching:

INFO parser.custom: Custom-parse: Parsing content
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
java.lang.NullPointerException
07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
pages/s, 0 kb/s, 
07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException

One of the problem is that my hadoop version says the following:
hadoop-0.4.0-patched. Now I don't know if it means that I am running the
0.4.0 version but it seems a little bit confusing. Once you can clarify that
for me, then I will be able to apply the patch to my version. 


Best Regards,

Armel

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: 13 February 2007 21:09

To: nutch-dev@lucene.apache.org
Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

Actually I take it back.  I don't think it is the same problem but I do 
think it is the right solution.


Dennis Kubes

Dennis Kubes wrote:
This has to do with HADOOP-964.  Replace the jar files in your Nutch 
versions with the most recent versions from Hadoop.  You will also need 
to apply NUTCH-437 patch to get Nutch to work with the most recent 
changes to the Hadoop codebase.


Dennis Kubes

Gal Nitzan wrote:

Hi,

Does anybody uses Nutch trunk?

I am running nutch 0.9 and unable to fetch.

after 50-60K urls I get NPE in
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.

I was wandering if anyone have a work around or maybe something is 
wrong with

my setup.

I have opened a new issue in jira
http://issues.apache.org/jira/browse/hadoop-1008 for this.

Any clue?

Gal






[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v7.patch

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473184
 ] 

Doğacan Güney commented on NUTCH-443:
-

Andrzej:

Why does fetcher need to synchronize? Why does the order fetcher outputs  pairs matters?

Sami:

> I opened an issue for this NUTCH-434 and I am now recommending that the patch 
> in this issue 
> doesn't try to take the world in one piece :) 

Right. I just realized just how much this patch changes and most of them are 
not necessary for the proposed API change. So I am going to post a version that 
uses ObjectWritable in Fetcher, doesn't remove FetcherOutputFormat and only 
changes parse-rss so that it works with the new API (sorry about that Renaud, 
but parse-rss can be updated after this patch)

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473148
 ] 

Sami Siren commented on NUTCH-443:
--

>> Didn't know this, will change this too. (Why is Nutch not using this class 
>> in Indexer?)

>Inertia, and lack of committer time ... ;) 

IIRC you actually cannot use GenericWritable because it requires wrapped 
objects to be Writables, Lucene objects obviously aren't. But you are able to 
imitate it and make similar object capable of storing Objects (as those 
writables are not persisted in indexer).

I opened an issue for this NUTCH-434 and I am now recommending that the patch 
in this issue doesn't try to take the world in one piece :)



> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473147
 ] 

Doğacan Güney commented on NUTCH-443:
-

> Hmm, actually this is an important question. I don't think FetcherOutput is 
> persisted anywhere, it's just an aggregate class to 
> keep things together before they hit the disk. I propose to leave a comment 
> in MapWritable like this "// code -123 was 
> reserved  for FetcherOutput - no longer in use". As for the class itself - 
> again, since it's not persisted we don't have to keep it 
> around, just remove it.

I implemented this approach in one of the earlier patches. The problem is that, 
the code in MapWritable does this:

addIdEntry((byte) (-128 + CLASS_ID_MAP.size() + ++fIdCount), // ...

Now, I don't claim to understand the code perfectly but because of the "-128 + 
CLASS_ID_MAP.size()" part I think CLASS_ID_MAP must have consecutive values 
always, so not having -123 breaks it. IIRC, removing that line and running 
TestMapWritable fails.

> Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that 
> output the data need to be synchronized now - 
> output.collect() is no longer a single atomic operation. Perhaps it's better 
> to leave FetcherOutput after all?

This causes key ordering problems. See my admittedly-could-have-been-clearer 
2nd comment.

Anyway, I am assumming that you are OK with removing 
ParseUtil.getFirstParseEntry and just using Map.get?

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473141
 ] 

Andrzej Bialecki  commented on NUTCH-443:
-

> Didn't know this, will change this too. (Why is Nutch not using this class in 
> Indexer?) 

Inertia, and lack of committer time ... ;)

> Since this patch removes the FetcherOutput class, what to put there instead 
> of it?

Hmm, actually this is an important question. I don't think FetcherOutput is 
persisted anywhere, it's just an aggregate class to keep things together before 
they hit the disk. I propose to leave a comment in MapWritable like this "// 
code -123 was reserved for FetcherOutput - no longer in use". As for the class 
itself - again, since it's not persisted we don't have to keep it around, just 
remove it.

Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that output 
the data need to be synchronized now - output.collect() is no longer a single 
atomic operation. Perhaps it's better to leave FetcherOutput after all?

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129
 ] 

Doğacan Güney commented on NUTCH-443:
-

Andrzej:

Thanks for taking the time to review this.

> The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most 
> cases this is a HashMap, there is no predictable > way to get the first entry 
> added to the map ... I propose also that we should use a specialized class 
> instead of 
> general-purpose Map; and then we can record in that class which entry was the 
> first. 

ParseUtil.getFirstParseEntry is only a convenience method used by plugins to 
get the first(and only) entry in a map when it knows that it will create a 
one-entry parse map(with original url as the key) and it is mostly used in a 
plugin's main method to get the parse and print it. It is not used in any core 
part of Nutch. 

Anyway, it is very incorrectly named. What we meant was 
ParseUtil.getOnlyParseEntry. Hmm, that doesn't make any sense either :D

Instead of creating a specialized class, how about removing the method and just 
using parseMap.get(key)? Most plugins will use it like 
parseMap.get(content.getUrl()). 

> Also, the naming of some methods 
> seems a bit awkward - why should we insist that we createSingleEntryMap while 
> we create an ordinary Map, and we don't use > this special-case knowledge 
> later? I suggest to simply name it createParseMap.

You are right, I will change this in the next patch.

> In recent versions of Hadoop there is a GenericWritable class - it replaces 
> ObjectWritable when classes are known in advance, > and provides a more 
> compact representation.

Didn't know this, will change this too. (Why is Nutch not using this class in 
Indexer?)

> Changes to MapWritable must preserve old code values, at most adding some new 
> ones - otherwise the new code will get 
> confused when working with older data.

I see your point but I am not sure how to fix this. Since this patch removes 
the FetcherOutput class, what to put there instead of it? I guess we can just 
keep FetcherOutput as it is, and update its javadoc to reflect the fact that it 
is not used anymore.

> CrawlDbReducer, TODO item: this should be the time stored under 
> Nutch.FETCH_TIME_KEY, no?
> If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

I will remove the TODO item and fix the imports in the next patch.



> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473114
 ] 

Andrzej Bialecki  commented on NUTCH-443:
-

The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most 
cases this is a HashMap, there is no predictable way to get the first entry 
added to the map ... I propose also that we should use a specialized class 
instead of general-purpose Map; and then we can record in that class which 
entry was the first. Also, the naming of some methods seems a bit awkward - why 
should we insist that we createSingleEntryMap while we create an ordinary Map, 
and we don't use this special-case knowledge later? I suggest to simply name it 
createParseMap.

In recent versions of Hadoop there is a GenericWritable class - it replaces 
ObjectWritable when classes are known in advance, and provides a more compact 
representation.

Changes to MapWritable must preserve old code values, at most adding some new 
ones - otherwise the new code will get confused when working with older data.

CrawlDbReducer, TODO item: this should be the time stored under 
Nutch.FETCH_TIME_KEY, no?

If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

The new model for returning results from parse plugins allows a much better 
approach to parsing archives (eg. zip files) containing multiple documents in 
supported formats - although this should be a separate patch.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-437) MapFile in Hadoop Trunk has changed, must update references

2007-02-14 Thread Armel Nene (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473064
 ] 

Armel Nene commented on NUTCH-437:
--

I was wondering if this patch could fix my problem which is, if not the same, 
very similar to this one. I am using Nutch 0.8.2-dev, I have made checkout 
awhile ago from SVN but never updated again. I was able to crawl 1 xml 
files before with no error whatsoever. This is the following errors that I get 
when I'm fetching:

INFO parser.custom: Custom-parse: Parsing content 
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of 
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with: 
java.lang.NullPointerException
07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0 pages/s, 0 
kb/s,
07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException
07/02/12 22:09:17 FATAL fetcher.Fetcher: at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
07/02/12 22:09:17 FATAL fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher 
caught:java.lang.NullPointerException

One of the problem is that my hadoop version says the following: 
hadoop-0.4.0-patched. Now I don't know if it means that I am running the 0.4.0 
version but it seems a little bit confusing. Once you can clarify that for me, 
then I will be able to apply the patch to my version. 

Best Regards,

Armel


> MapFile in Hadoop Trunk has changed, must update references
> ---
>
> Key: NUTCH-437
> URL: https://issues.apache.org/jira/browse/NUTCH-437
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.2, 0.9.0
> Environment: windows xp and java
>Reporter: Dennis Kubes
> Assigned To: Andrzej Bialecki 
> Fix For: 0.8.2, 0.9.0
>
> Attachments: nutch-hadoop-0.10.2-mapfile.patch
>
>
> The MapFile.Writer signature has changed in hadoop trunk (version 10.x +) to 
> include a Configuration object.  Object in the Nutch codebase that reference 
> MapFile.Writer will need to be updated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.