Zaheed Haque wrote:
what about the following:
http://issues.apache.org/jira/browse/NUTCH-125
On its way ... ;-) I'll add it during this week.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval,
Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more
[
http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ]
Stefan Groschupf commented on NUTCH-140:
From my point of view this makes things more complicated, why not just use the
extension id, where would be the advantage of
What people think if we collect a list of issues and make a voting
iteration?
+1
Full list of open issues
complete description can be found here :
http://issues.apache.org/jira/secure/IssueNavigator.jspa?
view=fulltempMax=30
Please add a +1 in case you vote for the issue under this issue.
Please keep in mind that this will be more a maintenance release.
NUTCH-141
NUTCH-134Summarizer doesn't select the best snippets
+1
NUTCH-98RobotRulesParser interprets robots.txt incorrectly
+1
NUTCH-120one bad link on a page kills parsing
+1
NUTCH-95DeleteDuplicates depends on the order of input segments
+1
NUTCH-13If dns points to
My personal fav. list
In a day or so I will count all votes and post them.
NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari
+1
NUTCH-140 Add alias capability in parse-plugins.xml file that
allows mimeType-extensionId mapping
NUTCH-139 Standard metadata property names in
NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari
+1
:-)
Marko.
Hi,
I would like to translate in arabic the Nutch index page. I translated the
five files concerned : header, about, search, help and
search_lang.properties. But I didn't find documents explaining how to make
the translation effective, I ask you if you have an idea about make it
possible to search
NUTCH-127 uncorrect values using -du, or ls does not return items
NUTCH-127 +1
NUTCH-121 SegmentReader for mapred
NUTCH-121 +1
NUTCH-115 jobtracker.jsp shows too much information
NUTCH-115 +1
NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
Andrzej Bialecki wrote:
I'll test it soon - one comment, though. Currently you use a subclass of
RuntimeException to stop the collecting. I think we should come up with
a better mechanism - throwing exceptions is too costly.
I thought about this, but I could not see a simple way to achieve
When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1
I tried to inject
in my case).
Now, the results. I collected all test results in a spreadsheet
(OpenDocument or PDF format), you can download it from:
http://www.getopt.org/nutch/20051214/nutchPerf.ods
http://www.getopt.org/nutch/20051214/nutchPerf.pdf
For MAX_HITS=1000 the performance increase was ca. 40
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method
yes, this line is the one you need to change. The other stuff can be
as it is for now.
Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?
AWESOME !! =:)
Stefan Groschupf wrote:
´So, with your patch, did you see 100% of urls *attempting* a fetch ?
100% ! :-)
Stefan Groschupf wrote:
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method
yes, this line is the one you need to change. The other stuff can be as
it is for now.
I don't recommend this change. It makes your crawler impolite, since
multiple tasks may reference
16 matches
Mail list logo