to create fetchlists based on a list of
arbitrary URLs. This comes handy if you want to test various parts of
Nutch with arbitrary URLs, not coming from the DB.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
of April.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
the frame contents. Please download the nightly snapshot and
try it out.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
, translated error codes are recorded in segment data, and a
subset of these translated codes is recorded in WebDB.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
performance penalty. Some disk subsystems
are good with burstable traffic (because of large cache) but quite bad
with sustained traffic.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
segments
into one, and much more.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot
pathSuffix (may be empty) and contentType.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
measure against this short testing period I would leave it disabled by
default.
Please vote +1 if I should commit it before the release, or -1 if after.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
with
-noParsing option. This way we should be able to eliminate problems
related to parsing.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
occurs, and where occasional breakage may happen and may
last even for longer time, and this is acceptable there.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
/on in the config, if it's off, then the unknown content
is skipped and logged, if it's on - then make the best effort to extract
text.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
. html parser may claim that it supports
plaintext. but there is another plugin specifically for plaintext. Which
of them wins?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
this, you need to re-index your segments.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
it first, or to use it as such.
If there are no objections, I will change it in the trunk/ in a couple
of days.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
the content, you just need to re-create
segment indexes to reflect the changes.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
exceed available resources or limits.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
exceed available resources or limits.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
the -noParse flag to
fetcher for all those experiments. In the past it was common for the
fetcher to be stuck in a buggy parser plugin, so you will need to
eliminate this factor.
--
Best regards,
Andrzej Bialecki
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Paul Harrison wrote:
The 250GB is with cached pages.
There is some dependency on your settings for maximum content size - if
you allow content such as PDF, DOC, etc then the average disk space per
page could increase to 20kB and more.
--
Best regards,
Andrzej Bialecki
Victor Lee wrote:
How should I go around the problem?
Don't use php-java bridge - use OpenSearch servlet to get RSS with
results, and then parse RSS using PHP; the servlet container will cache
NutchBean for you.
--
Best regards,
Andrzej Bialecki
, and some broken
links.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
/apache/nutch/searcher and I imagine I'll eventually figure
it out, but if someone could point me in the right direction, I'd
appreciate it.
You need a query plugin - please see e.g. query-host or query-more plugins.
--
Best regards,
Andrzej Bialecki
and LinkDB).
Also luke is every-time a good tool to browse a lucene index.
(Andrzej: it is really cool! :D I use it several times in the week)
Thx :) There are some bugs there, of which I'm aware, but I'm waiting
with the new release for the official Lucene release.
--
Best regards,
Andrzej
function?
The crawl command is just for those who are too lazy to run all 4
steps by hand... ;-)
There is nothing magical about this. Just follow the standard workflow:
generate, fetch, updatedb, invertlinks, generate, fetch ...
dedup
index
search
--
Best regards,
Andrzej Bialecki
Matt Zytaruk wrote:
Hi all,
Just a quick question for you all. Is the segment slicer tool
compatible with the map reduce version of nutch?
Not yet. Any help is appreciated - it should be hard to do. Take a look
at the CrawlDBReader ot LinkDBReader.
--
Best regards,
Andrzej Bialecki
nutch merge. If you expect that there are some
duplicates, you will need to run dedup. That's all.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
that inside each segment directory you have a
per-segment index, because the nutch merge command will use them to
create the master index.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
the archives)
is to change your Tomcat server.xml, and add
useBodyEncodingForURI='true' in your Connector definition. And then
consistently use UTF-8 in all JSPs.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
/bin$ ./nutch ndfs -put nutch nutch
Could you try the same, but using absolute paths? NDFS client has no
notion of relative or current directory, so the file names must always
be absolute, i.e. starting with the leading / .
--
Best regards,
Andrzej Bialecki
. Most likely you
encountered either protocol errors or parsing errors, so there was
nothing to index from these entries.
In addition, if you ran the deduplication, some of the entries in your
index may have been deleted because they were considered duplicates.
--
Best regards,
Andrzej Bialecki
to
perform in order to find all occurences of these words when processing a
phrase query.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
the estimated total hits, and also the first couple of hits.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
investigating?
Do you use the parse-pdf plugin? Please do a thread dump of the stuck
process (Ctrl-E, if I'm not mistaken).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
that are not fetched won't be processed at all, so
those Pages in WebDB won't get updated and you will have to wait another
week (or use -adddays).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval
documented. But remember that the 0.7 branch is now in the
maintenance mode, so no new features and only small bugfixes will show
up there; most of the effort goes now to development of the 0.8 (trunk)
branch.
--
Best regards,
Andrzej Bialecki
K.A.Hussain Ali wrote:
HI all,
Do delete dupliates (dedup) works on single segment ?
Dedup works on multiple indexes. Please see the source of Crawl.main()
for example of its use.
--
Best regards,
Andrzej Bialecki
outlinks point
to the correct page
Is it for the reason that the site has to have a base URL value
Yes. It's enough to add this somewhere in the head element of the HTML.
--
Best regards,
Andrzej Bialecki
(yet?). Please use the Sun
JVM.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot
. There's
only a couple things that are missing, everything else should already be
context-independent.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix
of a JobSubmissionProtocol - but I think there is no way now for
the arbitrary code to reference it's JobClient.. bummer.
Some food for thought, anyway.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
for health-related information?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
will be injected anyway when you
update the DB. Some time ago I added a tool (in JIRA) to create such
fetchlists, it works with 0.7.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
, checking them, etc.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
, but there are similar Linux solutions, or
commercial routers with built-in traffic shaping.
I think that you could also play some tricks with a bandwidth-limiting
proxy server, because protocol-httpclient can use a proxy.
--
Best regards,
Andrzej Bialecki
need no stinking TCP, we route good
ol' IP ;)
Please check your facts before claiming something about all ISPs around
the world.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
newline at the end of the
urllist.txt.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
, second, or both? I suspect only the second
change was really needed, i.e. the change in config files, and not the
change of protocol-httpclient - protocol-http ... It would be very
helpful if you could confirm/deny this.
--
Best regards,
Andrzej Bialecki
Is this possible ??
Not yet. This will be added soon to 0.8 (trunk).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
,
and segments.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
you perhaps check what is the
exception (if any) from the JS parser when it's failing? It could be
emitted into one of the tasktracker logs.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
is a float value, in seconds.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
most of the time... ;-)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
as spam.
This is the purpose of the CrawlDatum metadata patch... coming soon, I
hope :-)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
in the
CrawlDatum metadata. I'm working on this patch, I'll update it soon on JIRA.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
in that patch.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Vanderdray, Jacob wrote:
Is there an HTTPS protocol implementation for nutch?
Yes, protocol-httpclient supports https.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
, i.e. don't collect the result.
That's all. :-)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
it otherwise
I'll be happy to correct it.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
Elwin wrote:
Yes, it's true, although it's not the cause of my problem.
Did you try to use the alternative HTML parser (TagSoup) supported by
the plugin? You need to set a property parser.html.impl to tagsoup.
--
Best regards,
Andrzej Bialecki
on truckin'. This is especially true for pages with
multiple html elements, where Neko ignores all elements but the first
one, while TagSoup just treats any html elements inside a document
like any other nested element.
--
Best regards,
Andrzej Bialecki
a suitable front-end.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
carefully, most probably they
differ only in a single character, or a whitespace.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
David Odmark wrote:
Hi,
Does Nutch 0.8 support https fetches? If not, are there any active
efforts to support it?
It does, using protocol-httpclient plugin.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
that
implement Configurable? Perhaps it should, using the current JobConf.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
), which removes obsolete
versions of pages from indexes. Pages are still present in segments
until you delete old segments, but they won't appear in searchable index.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
will apply this fix too.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.
Yes.
--
Best regards,
Andrzej Bialecki
refetch anyway, and if it doesn't succeed we just increase
the interval by 50%.
Now, fixing this the same way in 0.7 would mean that pages no longer end
up in PAGE_GONE state. Is this a fix of broken behavior or a new
behavior (new feature)? I'm not sure...
--
Best regards,
Andrzej Bialecki
me know if my inferences are correct and sorry for a bigger mail.
No problem with the size.Yes, your conclusions seem correct.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
Doug Cutting wrote:
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only
properties in any place except the currently
running process. All properties are read anew from the config files
whenever you start any nutch processing.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
... It is not wise to put IP addresses in your emails.
Agreed.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact
me to it ...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
is
also good.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
located?
Apparently Nutch doesn't find one of the input directories, so it's
either not there, or the config is wrong, but without more information
it's impossible to tell.
--
Best regards,
Andrzej Bialecki
/GettingNutchRunningWithWindows).
When using Open Source software you should be prepared to do some basic
research on your own.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
- and then you should remove protocol-http from your
config.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
)
at net.nutch.db.WebDBWriter.createWebDB (WebDBWriter.java:1425)
at net.nutch.tools.WebDBAdminTool.main (WebDBAdminTool.java:159)
You are using incompatible GNU Java. Either upgrade your GCC/GCJ to
4.x.x, or use Sun Java.
Besides, Nutch 0.6 is ancient history, you should use 0.7.1 (or 0.7.2).
--
Best regards,
Andrzej
if it's an option, or revert to revision
388299 .
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
to function in the above manner right. Did i miss out
anything???
Yes, this is how it's supposed to work.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
in CrawlDbReducer.java:86 in both versions),
instead it should be initialized with the value from
old.getFetchInterval(), if available. Please fix this in your version,
I'll fix this in the un-patched version.
Thanks for spotting this!
--
Best regards,
Andrzej Bialecki
Andrzej Bialecki wrote:
Mehmet Tan wrote:
Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb,
then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly
a feeling that not too many
people really reviewed this patch.
So, IMHO these patches need more testing, because the potential for
disruption is rather large.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
bring this patch up to date over the weekend.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
on my
TODO list, but with a low priority.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram
is not compatible with 0.7. With (significant)
effort suitable converters could be made, but it would be way less
expensive to just bite the bullet and implement missing functionality in
0.8.
--
Best regards,
Andrzej Bialecki
).
This is technically possible, but simply not implemented (yet).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
, merging indexes) are already
supported if you use individual command-line tools and a single DB. So,
I'm not planning to do anything about it.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
Byron Miller wrote:
Got the following dump at 100% of generate cycle
(.8 svn release)
Just fixed this. Sorry.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
if they come from
redirection or directly from the outlinks. If you make an exception for
such urls, next time you generate a fetchlist or updatedb these urls
will be filtered out anyway.
--
Best regards,
Andrzej Bialecki
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
in the tutorial
(http://wiki.apache.org/nutch/NutchTutorial). Please follow the tutorial
where it says about Step-by-Step or Whole-web Crawling - you will save
yourself (and us) a lot of grief.
--
Best regards,
Andrzej Bialecki
,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Index: src/java/org/apache/nutch/crawl
as
unlimited namehttp.content.limit/name
value-1/value . Will that be the reason?
These particular problems happen among other when you run out of disk
space - please check that you have enough disk space, also on your /tmp
partition.
--
Best regards,
Andrzej Bialecki
1 - 100 of 620 matches
Mail list logo