settings and don't run the
DistributedAnalysisTool, then all of the page scores are 1.0. So the
Lucene document boost winds up being ln(e + inbound link count). 0
inbound links == 1.0, 10 links = 2.54, 100 links = 4.63, etc.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530
to my project and am
plugging in some additional logging to help track down the issue.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
the language to adjust the nextScore
value for outlinks to pages that don't currently exist. Then in
FetchListTool use this nextScore value, and provide some topN value
such that the top links are going to be in your target language.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
, other than boosting the CPU usage to 80%.
More research results to come...
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Ken Krugler wrote:
We're only using the html text parsers, so I don't think that's
the problem. Plus we dumping the thread stack when it hangs, and
it's always in the ChunkedInputStream.exhaustInputStream() process
(see trace below).
The trace did not make it.
Oops - see at the end
URL weights to avoid having
any one domain with a significantly higher percentage of URLs than
any other domain, but so far that hasn't been an issue for us.
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
an example:
http://jira.atlassian.com/browse/CONF-2848
Do any Nutch users have experience using file:/dev/random?
Thanks,
- Chris
--
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
use
[EMAIL PROTECTED] when posting, to keep it spam free, so either a bcc
or [EMAIL PROTECTED] in the cc field would be better.
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
making some mods to Nutch to improve our performance,
but it's not debugged yet...getting closer, though.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
could focus your crawl on
English-content pages.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
[main1/192.168.0.100:8010]. ex=java.net.ConnectException: Connection
refused Retrying...
It seems like the 192.168.0.103 machine doesn't have the right
settings for connecting to the 192.168.0.100 machine. Is there a way
to check this outside of running Nutch?
Thanks,
-- Ken
--
Ken Krugler
/localedata.jar:/usr/java/jre1.5.0_05/lib/ext/sunpkcs11.jar
[snip]
So obviously somebody is using JAVA_HOME to build the path to these .jar files.
But JAVA_HOME (the top-level path, ie /usr/java/jre1.5.0_05) isn't a
member of this classpath.
Any help would be appreciated!
Thanks,
-- Ken
--
Ken
commands
(readdb, segread, etc) don't seem to be working with the new NDFS
setup.
4. Any idea whether 4 hours is a reasonable amount of time for this
test? It seemed long to me, given that I was starting with a single
URL as the seed.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
for another run once this one
has had a chance to generate some interesting results.
Thanks,
-- Ken
--
Ken Krugler
is only designed to be used via links
from the jobtracker.jsp page.
And thanks to Andrzej for his November post that noted this.
-- Ken
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Sat 14 Jan 2006 05:50:00 PM EST
Subject: [Nutch-general
:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
PROTECTED]
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
060129 13 Zero targets found,
forbidden1.size=2allowSameHostTargets=false
forbidden2.size()=0
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
timeout value being too low.
We were getting lots of timeout errors, which was killing our performance.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Is there an HTTPS protocol implementation for nutch?
If you use protocol-httpclient (versus protocol-http) then it should
support https.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
that we can filter?
I've read a paper on detecting link farms, but from what I remember,
it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning the
results from the crawldb and adding them to the regex-urlfilter file.
-- Ken
--
Ken Krugler
Krugle
NDFS bugs.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
and I'm now
stuck :/
Can anyone provide some clues as to where I might start on debugging
this issue?
Regards,
-Shawn
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
depends on the #
of docs you want to be serving up from each search server - in our
case, I think it's about 10M or so. Obviously this varies depending
on the amount of RAM/horsepower you have on the server, and your
target query performance.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
the crawl of all current pages, to the extent necessary to get
reasonable history/page cash values for OPIC. But that's just a guess
until the actual implementation is at least sketched out.
-- Ken
Andrzej Bialecki [EMAIL PROTECTED] wrote: Ken Krugler wrote:
Eugen Kochuev wrote:
Hello
crawler, and with Nutch 0.8
it's more like 2000+ threads...though you have to reduce the thread
stack size in this type of configuration.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
(Fetcher.java:148)
--
Daniel Varela Santoalla
European Centre for Medium-Range Weather Forecasts (ECMWF)
(http://www.ecmwf.int)
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
On 6/28/06, Ken Krugler [EMAIL PROTECTED] wrote:
Hi Doug,
Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
running into a similar problem.
We wound up dramatically increasing the number of threads, which
seemed to help solve the bandwidth utilization problem
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
of
Nutch as a better solution, but until then I think it's probably
faster to use Nutch as your starting point, and also if/when that
time comes, you'll have a much better understanding of how best to
slice and dice.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
spindle/low seek times (e.g. WD
Raptor SATA disks).
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
Ken Krugler wrote:
On 8/12/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hello,
Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that the
fetch speed didn't increase when I went from using 100 threads,
to 200 threads. Has
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
to get at all of the
previously fetched content.
-- Ken
Ken Krugler wrote:
It's really a sad news for me. I must spend a lot of time on fetching it
again.
If it's only just HTML, then you could do a quick hack in 0.8 to
fetch the pages from your 0.7 crawl, using a modified fetcher. You
to a sequential file. After a fetch
cycle, additional data about the page state gets processed, and the
results are used to update the crawldb, which is a kind of
specialized database for web crawling.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
= big5-hkscs, but then you'd have to
rebuild Nutch.
See the resolveEncodingAlias() method here:
http://www.krugle.com/files/svn/svn.apache.org/lucene/nutch/trunk/src/java/org/apache/nutch/util/StringUtil.java
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
heard about this feature but I am not finding the information
Thanks,
Marco
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
/nutch/util/StringUtil.java
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
, Mac OS X 10.4.7) says that
GB18030 is supported, so I'm guessing that's not your problem.
-- Ken
Ken Krugler wrote:
Thanks for your reply.
I have found that the method you mentioned looks into the http header from
web server. It looks for charset and does the mapping. The apache web
was initially parsed. So it wound up in the
Nutch segments/index with the wrong value.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
)
at
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSe
gments.java:177)
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
to be valid UTF-8, and from my
experience Nutch works correctly with correctly
identified UTF-8 web pages.
So I'm I'm guessing the '?' come about when your
webapp container/server tries to convert the
UTF-8 data to 8859-2.
-- Ken
Ken Krugler ([EMAIL PROTECTED]) wrote:
Hi All,
I would
-transitional.dtd'
The data seems to be valid UTF-8, and from my
experience Nutch works correctly with correctly
identified UTF-8 web pages.
So I'm I'm guessing the '?' come about when your
webapp container/server tries to convert the
UTF-8 data to 8859-2.
-- Ken
Ken Krugler ([EMAIL
Angeles based
Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
the sender immediately and then destroy this transmission, including
all attachments, without copying, distributing or disclosing same.
Thank you.
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
DFS?
Thanks
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
this is a good way or whether including date range
clauses would have an adverse impact on performance.
Am I missing something? Is there a better way of doing this? Any help would
be much appreciated.
Regards,
Chris
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
this isn't the case, even if the
page similarity calculation determines that two pages should be the
same.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it
] /[segment]
afterwards I created crawldb and linkdb using
../bin/nutch crawldb ... and
../bin/nutch invertlinks ...
then I took solrindex in order to put everything in solr.
Can somebody help?
Thank you very much!
--
Ken Krugler
+1 530-210-6378
] ^
[javac] 1 error
BUILD FAILED
/usr/local/nutch-1.0/build.xml:107: Compile failed; see the compiler error
output for details.
--
Ken Krugler
+1 530-210-6378
--
Ken Krugler
+1 530-210-6378
at Nabble.com.
--
Ken Krugler
+1 530-210-6378
140mb. Are you talking about the topN value when you say I should set
the max URLs/hosts, or is there another setting I haven't found yet?
http://pastebin.com/m33bb6e6b
Ken Krugler wrote:
That could be true, but is that something I, as a nutch user can
configure?
It's interesting that your
.
Is there any issue with charset? plz help me.
Thanks in advance.
Regards,
Chetan Patel
--
Ken Krugler
+1 530-210-6378
center, but crawl
using EC2, the time cost of moving the content
could be excessive. Though Amazon recently
introduced AWS Import/Export to help address this
issue.
-- Ken
2009/6/5 Ken Krugler kkrugler_li...@transpac.com
how long does it take for your 6 millions URLs to be
crawled
and mapred.reduce.tasks to 15. Is this correct? HTTP
timeout is 5 seconds, max reties 2, 0.5 seconds between retries.
fetcher.threads.fetch is 300. How can I tweak the performance? What other
options may affect performance? Should I provide some other information
for you to be able to help me?
--
Ken Krugler
(on
a server) * 5, and the number of reducers == the number of cores. Oh,
and the number of threads to 200/# of mappers. But treat that as a
random data point.
-- Ken
Ken Krugler wrote:
See the previous discussion about how having relatively few unique
domains can significantly limit
should see entries like:
-activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
- fetching http://home.swipnet.se/~w-147200/
-- Ken
Ken Krugler wrote:
The real question is how many active fetches you have running
simultaneously. If most fetcher threads are idle, waiting for 30
Ken Krugler wrote:
If this is http.timeout, that's the length of time an HTTP request
will wait for a response before timing out. Which hopefully doesn't
happen very often for you.
Yes, it is it.
Ken Krugler wrote:
Delay between retries - what property name
other Microsoft
language.
See http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help for details.
-- Ken
--
Ken Krugler
http://ken-blog.krugler.org
+1 530-265-2225
, AFAIK there's no special weighting given to text pulled from the
body of the HTML.
I believe Nutch does give higher weight to the anchor text found for
links that point to the page, which is a key factor in generating
better search results.
-- Ken
--
Ken Krugler
+1 530-210-6378
namemapred.system.dir/name
value/home/had/nutch-1.0/filesystem/mapreduce/system/value
/property
property
namemapred.local.dir/name
value/home/had/nutch-1.0/filesystem/mapreduce/local/value
/property
/configuration
--
Ken Krugler
+1 530-210-6378
Nutch has a auto detector for character encoding. Does it convert character
to standard encoding automatically, after detecting it?
Yes - Nutch converts text to Unicode for all subsequent processing.
-- Ken
--
Ken Krugler
+1 530-210-6378
. if there is a URL like
http://www.mysite.com/images/1 which returns an image, will Nutch be
able to identify it and avoid it's download?
I think Nutch will download the file, since filtering of URLs happens
before fetching.
-- Ken
--
Ken Krugler
+1 530-210-6378
to our
topic.
7) we loop back to 3 above.
Eventually we end up with a lucene style index as usual which can be
used with the nutch web app, or solr, or some other code
Who is interested in this or has done it in the past and can we
chat about it?
Alex
--
Ken Krugler
Hi Paul,
On Aug 19, 2009, at 6:08am, Paul Tomblin wrote:
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it? I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send
.
Thanks
--
View this message in context:
http://www.nabble.com/Usage-of-ArcSegmentCreator-tp25373232p25373232.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378
am also using the parse-js
plugin.
But it does not look like Nautch is able to crawl these URLs.
Am I doing something wrong or Nutch is not able to crawl URLs build
by
JavaScript function.
Thanks/Regards,
Parvez
--
Ken Krugler
TransPac Software, Inc.
http
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378
can you plz just tell us in english what the plugin creativecommons
is for ?
i mean if i will include this plugin in my nutch-site.txt, what will
i have as result ?
I think Andrzej is suggesting that you read the code.
If you look at the beginning of the CCParseFilter.java file, you'll see:
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378
chars end up as ? ;
i dont have any special requirement for any special characters, i am
happy with usual utf-8
any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.
thanks.
--
Ken Krugler
characters, i
am
happy with usual utf-8
any suggestion on the best way to configure this correctly;
everything
seems quite ok looking at the code not sure whats missing.
thanks.
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210
unknown parts (folders) of the url?
Something like...
http://([a-z0-9]*\.)*website.com/[^/]+/known-folder/
-- Ken
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
?
Thanks,
Jason
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
or not with issue no. 1.
Any idea guys? I will be very grateful for any help or things that
can
point
me in the right direction.
Thanks,
Eran
--
-MilleBii-
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
. You'd need to tune the
config parameters to do the re-crawl at the target interval.
Though for pure site archiving, Heritrix is a more optimized solution,
especially when used with some of the add-on admin GUIs.
-- Ken
Ken Krugler
+1 530-210-6378
claudio.marte...@tis.bz.it http://www.tis.bz.it
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
to a detailed web page explaining the
purpose of the bot, etc.
But my crawler still banned by several sites... :(
cheers
iful
http://zipclue.com
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
-mail: withan...@asia-europe.uni-heidelberg.de
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
in there by default?
Should be there by default, once the Tika plug-in gets rolled in.
-- Ken
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
this issue ?
Thanks
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
that gets rolled into the source then it should
be easier to use the project with Nutch.
-- Ken
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
sounds very strange - I'd check on the AWS
EC2 forum to see if anybody else has reported this with the AMI that
you're using.
-- Ken
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
the relevancy of this page
without
changing the page itself?
--
View this message in context:
http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Ken
86 matches
Mail list logo