Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Andrzej Bialecki
Zaheed Haque wrote: what about the following: http://issues.apache.org/jira/browse/NUTCH-125 On its way ... ;-) I'll add it during this week. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval,

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more

[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping

2005-12-14 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ] Stefan Groschupf commented on NUTCH-140: From my point of view this makes things more complicated, why not just use the extension id, where would be the advantage of

Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
What people think if we collect a list of issues and make a voting iteration? +1

vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf
Full list of open issues complete description can be found here : http://issues.apache.org/jira/secure/IssueNavigator.jspa? view=fulltempMax=30 Please add a +1 in case you vote for the issue under this issue. Please keep in mind that this will be more a maintenance release. NUTCH-141

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Matthias Jaekle
NUTCH-134Summarizer doesn't select the best snippets +1 NUTCH-98RobotRulesParser interprets robots.txt incorrectly +1 NUTCH-120one bad link on a page kills parsing +1 NUTCH-95DeleteDuplicates depends on the order of input segments +1 NUTCH-13If dns points to

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf
My personal fav. list In a day or so I will count all votes and post them. NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari +1 NUTCH-140 Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping NUTCH-139 Standard metadata property names in

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Marko Bauhardt
NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari +1 :-) Marko.

translation of Nutch search page

2005-12-14 Thread hind
Hi, I would like to translate in arabic the Nutch index page. I translated the five files concerned : header, about, search, help and search_lang.properties. But I didn't find documents explaining how to make the translation effective, I ask you if you have an idea about make it possible to search

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Andrew McNabb
NUTCH-127 uncorrect values using -du, or ls does not return items NUTCH-127 +1 NUTCH-121 SegmentReader for mapred NUTCH-121 +1 NUTCH-115 jobtracker.jsp shows too much information NUTCH-115 +1 NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting
Andrzej Bialecki wrote: I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. I thought about this, but I could not see a simple way to achieve

mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
When doing a one-pass crawl, I noticed that when I inject more than ~16000 urls, the fetcher only fetches a subset of the set initially injected. I use 1 master and 3 slaves with the following properties: mapred.map.tasks = 30 mapred.reduce.tasks = 6 generate.max.per.host = -1 I tried to inject

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
in my case). Now, the results. I collected all test results in a spreadsheet (OpenDocument or PDF format), you can download it from: http://www.getopt.org/nutch/20051214/nutchPerf.ods http://www.getopt.org/nutch/20051214/nutchPerf.pdf For MAX_HITS=1000 the performance increase was ca. 40

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Stefan Groschupf
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. Do I only need to change the last line to using HashPartitioner.class, or do I need to modify the other 2 references as well?

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
AWESOME !! =:) Stefan Groschupf wrote: ´So, with your patch, did you see 100% of urls *attempting* a fetch ? 100% ! :-)

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Doug Cutting
Stefan Groschupf wrote: - job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. I don't recommend this change. It makes your crawler impolite, since multiple tasks may reference