I think that is the property for the anchor text length but not the
length of a url.
Am 25.08.2006 um 04:28 schrieb Lourival Júnior:
Try this one:
property
namedb.max.anchor.length/name
value800/value
descriptionThe maximum number of characters permitted in an anchor.
/description
I don't know if Chris Schneider's patch for HADOOP-406 will prove
to be the
long-term solution, but it certainly works for me.
If you like please vote for this issue! I also use it in several
projects and wonder why it is not yet part of hadoop.
Thanks.
Stefan
Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.
But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usable
outside the http protocols yet.
+ the fetcher does not support plug
Check:
http://issues.apache.org/jira/browse/NUTCH-233
and let us know if it helps.
Stefan
Am 31.07.2006 um 07:46 schrieb Matthew Holt:
Fetcher for one, and the mapreduce takes forever... IE the
mapreduce is kind of annoying... is it possible to disable it if
I'm not running on a DFS?
Matt
Dear Nutch Users,
web spam is a serious issue also for nutch, but in the moment we
known only a little bit about the problem and how we can work around.
Please invest some time to help the research community by building a
collection for future research work.
Details see below.
Thank you.
I'm only a moderately experienced java programmer, so I was hoping I
could get a few pointers about where to begin on a particular problem.
I want to increase the score of a search result if the title contains
the search query and the site is from a particular site.
Take a look to the
http://find23.net/Web-Site/blog/66A7676A-8C9C-4A93-8B59-
A6A100EF8C1B.html
You may be need to update that to the latest sources.
Am 11.07.2006 um 15:29 schrieb Matthew Holt:
Can someone that has Nutch developement configured for Eclipse
please paste their .project and .classpath files?
Hi,
nutch uses lucene.
So you will find that interesting:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/
Similarity.html
Beside that nutch uses a kind of opic:
http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/
nutch/scoring/opic/OPICScoringFilter.html
Also
Hi,
may be you can try to have a much higher depth something like 20?
However in general check:
+ the regex url filter file.
+ the rebotos.txt
+ nofollow tag in the pages
+ number of out links to extrac in nutch-default.cml
Stefan
On 06.07.2006, at 19:12, kevin pang wrote:
i set up the nutch
Hi Otis,
the link graph live in the linkdb.
I suggest to write a small map reduce tool that reads the existing
linkDb filter the pages you want to remove and write the result back
to disk.
This will be just a couble lines of code.
The hadoop package comes with some nice map reduce examples.
Hi,
It would be nice to use the features of Nutch instead of my own hacky
stuff. How bound is Nutch to the J2EE-container? Would it be a big job
to make it run on an alternative GUI? Or is is the container used for
more than GUI? I.e. do all services (crawler, et.c.) run within the
container? Do
Hi,
may be have a look to the nutch indexer it use a kind of wrapper, may
be this can help you.
Also please browse the haddop developer list archive since there was
some related discussion.
HTH
Stefan
Am 29.06.2006 um 14:41 schrieb Dennis Kubes:
All,
Is there a way to get around having to
Kubes wrote:
The indexer uses an ObjectWritable and I am using that trick.
Problem is I need to input and ObjectWritable but output a
different object. I will take a look at the hadoop list.
Dennis
Stefan Groschupf wrote:
Hi,
may be have a look to the nutch indexer it use a kind of wrapper
Am 12.06.2006 um 19:46 schrieb Dennis Kubes:
Is anyone doing large scale searching and if so what kind of
architecture is good. I have a 25G index now (merged) and the
searches are failing due to memory constraints. Is is better to
have multiple smaller indexes across machines.
yes.
Hi Karsten,
nutch has the limitation one url one document (in crawlDB or index).
The content and metadata for this document is normally available
'behind' url. The only exception is the anchor text. Anchor text are
data from the mother url that is passed and indexed within the
child
Just recrawl and reindex every day. That was the simple answer.
The more complex answer is you need to do write custom code that
deletes documents from your index and crawld.
If you not want to complete learn the internals of nutch, just
recrawl and reindex. :)
Stefan
Am 06.06.2006 um 19:42
Do you use java 1.4 or 1.5 ?
In general have a look to the hadoop code base: TaskRunner.java line:
145.
Stefan
Am 05.06.2006 um 10:51 schrieb Murat Ali Bayir:
Hi eveybody, I have problem in running Jprofiler in remote side. I
am using DFS and submitting crawl job.
I configure LD library
I just found this
http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug
+nutch
It's from Dec. 2005, so I am not sure if it will still work.
It still work, you only need to add more plugins. :)
Stefan
desperately want to be able to give Nutch a
list of
documents.
Ben
On 6/8/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Just recrawl and reindex every day. That was the simple answer.
The more complex answer is you need to do write custom code that
deletes documents from your index and crawld
If the extension point was already available in 0.7 - yes.
Am 06.06.2006 um 11:07 schrieb Peter Swoboda:
hi,
Is ist possible to integrate 0.8 Plugins (ms powerpoint..) in nutch
0.7?
Thanx
Peter
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem
the hadoop map
reduce job. we could even contribute this back and base a small
tutorial on this work.
What do you think?
Rgrds, Thomas
On 6/2/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
using search http is a bad idea, since you get many but not all
pages.
Just write a hadoop map
?
Rgrds, Thomas
On 6/2/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
using search http is a bad idea, since you get many but not all
pages.
Just write a hadoop map reduce job that process the fetched
content
in your segments, that should be easy.
Storing images in a file system will be very slow
Hi,
why not dedub your complete index before and not until runtime?
There is a dedub tool for that.
Stefan
Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
Hi Eugen,
what I've found (and if I'm right) is that the page-calculation is
done
in Lucene. As it is quite expensive (time-consuming)
You can just delete the parse output folders and start the parsing tool.
Parsing a given page again makes only sense for debug reasons since
hadoop io system can not update entries.
If you need to debug I suggest to write you a junit test.
HTH
Stefan
Am 29.05.2006 um 01:01 schrieb Stefan
Hi,
Did you check the reg ex url filter?
By default dynamically ulrs are not allowed by excluding all urls
containing a question mark.
If you configure your url filter proper you should be able to fetch
your dynamically pages.
Stefan
Am 27.05.2006 um 05:30 schrieb Jackey Yang:
Hey Guys,
Great Pictures. :)
Thanks for that you had a camera with you, it was a really nice
event, from my point of view.
We should repeat that. :-)
Cheers,
Stefan
Am 23.05.2006 um 02:14 schrieb Michael Plax:
Hello,
It was great to see everybody.
You can find photos from meeting on flickr
Hi Fabian,
wow nutch 0.6 is really old school.. :-)
However the simplest thing you can do is just write a class that
reads the data from a segment (parsed text and data) and writes those
into a own index.
Should be simple if you know how to write into a lucene index.
HTH
Stefan
Am
I'm not sure what you are planing to do, but you can just switch a
symbolic link on your hdd driven by a cronjob to switch between index
on a given time.
May be you need to touch the web.xml to restart the searcher.
If you try to search in different kind of indexes at the same time, I
general
(there
are variables like dedupField etc.).
Regards,
Stefan
Stefan Groschupf wrote:
Hi,
why not dedub your complete index before and not until runtime?
There is a dedub tool for that.
Stefan
Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
Hi Eugen,
what I've found (and if I'm right
Hi,
no Agenda
see:
http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1
Stefan
Hi there,
since there is such a big interest in the nutch user meeting,
we decide to move to a other location.
We will now meet:
Rite-Spot Cafe
(415) 552-6066
2099 Folsom St
San Francisco, CA 94110
Its in a good location too for parking and its even reachable by
public transport -- 2 blocks
Hi Nutch Users,
Doug already mentioned it in the developers list (thanks!), but for
those of you that does not subscribe the developer list...
The next CommerceNet Thursday Tech Talk will be about Extending
Nutch. I'll present a few slides about the plugin system and meta
data 'flow' in
I tried to upload some screenshooots to the jira but wasn't able to
do so. :(
But installing it, mean downloading it, decompress and start bin/
nutch gui /aFolder..
Stefan
Am 04.05.2006 um 10:07 schrieb Jérôme Charron:
is there any url to see the gui without installing the Bundle?
This
Hi there,
since building the gui is some how complicated I was thinking about
providing a ready to use binary.
This may be would help to get some more beta testers we currently
looking for.
Any thoughts?
However I afraid that this would hit my server to hard and I have to
pay for
Hi Sami, Hi Dawid, Hi All,
yes if there are enough people interested I would love to get a
European user meeting organized as well.
A nice time would be the Wizards of open source conference this year
in September.
http://wizards-of-os.org/index.php?id=36L=3
If people are interested to
(with apologies for multiple postings)
Dear Nutch users, Dear Nutch developers, Dear Hadoop developers,
we would love to invite you to the Nutch user meeting in San Francisco.
Date: Thursday, May 18th, 2006
Time: 7 PM.
Location: Cafe Du Soleil, 200 Fillmore St, San Francisco, CA 94117.
Depends what you are planing to do, nutch 0.8 support meta data that
is very flexible (key value tuples) and fast.
Also you can store information in parseData.getMetaData, these will
be available until indexing as well.
Am 12.04.2006 um 04:31 schrieb sudhendra seshachala:
Sorry to just
Doug Cutting wrote:
Perhaps we could enhance the logic of the loop at Fetcher.java:
320. Currently this exits the fetcher when all threads exceed a
timeout. Instead it could kill any thread that exceeds the
timeout, and restart a new thread to replace it. So instead of
just keeping a
... a beta will be available soon.
Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:
Hi Robert,
You can see this page
http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But
I don't
have any idea about the advancement of this project.
Best regards.
On 4/10/06, Robert Douglass
just 0.8.
Am 11.04.2006 um 23:08 schrieb carmmello:
Will this interface also cope with Nutch 0.7 or just the new 0.8?
- Original Message - From: Stefan Groschupf [EMAIL PROTECTED]
style.com
To: nutch-user@lucene.apache.org
Sent: Tuesday, April 11, 2006 5:53 PM
Subject: Re: Nutch
I already suggested to add a kind of timeout mechanism here and had
done this for my installation,
however the patch suggestion was rejected since it was a 'non
reproducible' problem.
:-/
Am 07.04.2006 um 21:55 schrieb Rajesh Munavalli:
Hi Piotr,
Thanks for the help. I think I
Am 07.04.2006 um 22:13 schrieb Jérôme Charron:
I already suggested to add a kind of timeout mechanism here and had
done this for my installation,
however the patch suggestion was rejected since it was a 'non
reproducible' problem.
Stefan, do you refer to NUTCH-233?
No:
Hi,
the extension point plugin need to be included in the includes also.
Please note that nutc-site do not extend parameters but overwrite it
and it is not a good idea to have just the parser plugins installed,
at least you need one protocol plugin, a query and a index filter also.
Stefan
Better you use nutch .8 to run a crawl using several machines.
There is some documentation in the wiki now.
Am 08.03.2006 um 17:49 schrieb Olive g:
Hi I am new here.
Could someone please let me know the step-by-step instructions to
set up
distributed crawl in 0.7.1?
Thank you.
I guess yahoo.com has a robot.txt to block crawling the complete page.
Also check the level depth you use.
Am 08.03.2006 um 17:53 schrieb Olive g:
Hello everyone,
I am also running distributed crawl on .8.0 (some dev version) and
somehow the stats always
returned TOTAL urls as 1 while I
please!
From: Stefan Groschupf [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: how to search data on DSF (0.8)
Date: Wed, 8 Mar 2006 18:46:25 +0100
MIME-Version: 1.0 (Apple Message framework v746.2)
Received: from mail.apache.org
I just have no login and ip to the box any more.
In case you send me a login, ip and the path where the source are I
can have someone taking a look tomorrow.
Stefan
Am 08.03.2006 um 19:03 schrieb Stefan Groschupf:
Storing the index on a dfs works just change conf to use dfs in
nutch.war
Hi,
storing the index on the hdd would be a good idea.
Take a look to the nutchBean init method to get an idea what you
need to change.
Should be simple by just allowing to provide an location for the
index that is different than the segments folder.
Stefan
Am 06.03.2006 um 12:53 schrieb
Hi Thomas,
for this crawl setup we have a test environment of nutch 0.8,
10xAMD's, custom linux build, 100Mbit eth1, 1Gb eth0, each box has a
'caching' dns server.
Stefan
Am 06.03.2006 um 15:59 schrieb TDLN:
Stefan.
I know people having 500 mio pages index and I personal run
crawls
This is very slow!
You can expect results in less than a second from my experience.
+ check memory settings of tomcat.
+ you do not use ndfs, right?
Am 06.03.2006 um 00:23 schrieb Insurance Squared Inc.:
Asking again for the patience of the list, we're still working on
speed. I guess what I
Hi,
http or www are very good test queries.
double check that the nutch-default.xml which inside the nutch.war
points to the correct folder namesearcher.dir/name.
Stefan
Am 06.03.2006 um 02:31 schrieb Hasan Diwan:
I've followed the nutch tutorial for crawling and started tomcat from
the
If none are being fetched, something is definaltely wrong with
your filter or url file.
Yes, since it is blog it may has dynamic pages like foo.com?entry=23
this definitely filtered by default.
-
blog: http://www.find23.org
company:
is not with
nutch, but instead with something at the OS or tomcat level, or
with another system process that nutch is using).
Stefan Groschupf wrote:
This is very slow!
You can expect results in less than a second from my experience.
+ check memory settings of tomcat.
+ you do not use ndfs
Hi Richard,
I told you I was more than willing to help, and I think many users
feel
the same way, but I for one feel that there is a lack of documentation
and support. This isn't meant to offend anyone, if you are
offended you
need to toughen up your skin a little bit.
Here you can find
The crawl command creates a crawlDB for each call. So as Rchard
mentioned try a higher depth.
In case you like nutch to go deeper with each iteration, try the
whole web tutorial but change the url filter in a manner that it only
crawls your webpage.
This will go as deep as much iteration
Maybe we should organize us ourself a little bit better in this point.
What do you think?
Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very
compressed style what the community is looking for.
However it is not used that often
Jon,
there is also a hadoop user mailing list.
It is not clear to me, what you are planing to do, but in general
hadoop's tasktracks and jobtrackers require to run with a switched-on
dfs.
What you can do is write a map task that is reading from the local
disk you mentioned,but you will no
I had noticed, that you work for a german company. Is it possible
to get
some nutch support from you or your company?
Sure. Please note here you find a list of all people providing support.
http://wiki.apache.org/nutch/Support
If have some problems to get nutch running that way I want.
If
Am 23.02.2006 um 01:55 schrieb sudhendra seshachala:
Is there a way I could use HTTrack for crawling and nutch for just
searching?
Has any body done this before andcomparision between crawlers.
I suggest take a look to lucene, since I guess it is more work
changing nutch to your
23.02.2006 um 02:37 schrieb Wong Ting Kiong:
hi,
Is there any example of java code that i can read the data from
index file
in segments? I had tried segmentReader, ArrayfileReader, and
SequenceReader feel confuse. Thanks.
Wong
On 1/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
you can
Hi Daniel,
thanks we still working on it.
Actually we have to finish something behind the sense and than we
will publish a kind of plugin extension point that will allow other
people to contribute.
Thanks for the offer, may be the only thing you can do is to vote for
this issue since this
P.S. Now finally i could test nutch...:)
Puhh, that was a pain! :-) Welcome!
Puhh, that was a pain! :-) Welcome!
Ups I hit the send button to fast. :-/
Before people may miss understand that, 'welcome' mean 'welcome to
nutch'.
'Welcome' in german means in any case 'someone welcome to something',
sorry.
This should be possible with the latest version nutch 0.8 you may
need build from sources.
There nutchConf is not static anymore and you can pass it down the
stack.
Beside that you my need to store the Nutchbean not as context
attribute but in a hashmap that is stored as content attribute.
No.
Am 22.02.2006 um 22:29 schrieb Saravanaraj Duraisamy:
Hi,
in nutch 0.7.1 is there a way to stop indexing with out corrupting
the index
files in the middle of indexing???
thanks
d.saravanaraj
-
blog: http://www.find23.org
company:
Hi,
- Is there any way to perform form based authentication? I
know
that this is a common request but I haven’t found a “good-enough”
answer to
it. The only references I’ve found are about basic auth, which I’d
prefer to
avoid. I ask this because I’ve noticed that SearchBlox,
Don't worry, I understood what do you meant :)
Sorry my english is too often just terrible I'm trying to improve it,
I feel people to often misunderstand me.
Anyway I guess and hope my java is much better. :-)
But, what is the reason of this kind of problem?
Why nutch is not capable of select
http://cvs.apache.org/dist/lucene/nutch/nightly/
Am 24.02.2006 um 01:44 schrieb sudhendra seshachala:
The latest version I could see in the SVN is 0.7.1,
Where can I get 0.8., source code is even better.
Could I just grab from nightly builds ?
Please let me know..
Thanks
Sudhi
Hi,
may this is what you are looking for:
property
namefetcher.store.content/name
valuetrue/value
descriptionIf true, fetcher will store content./description
/property
Am 22.02.2006 um 15:21 schrieb Martin Gutbrod:
Hi,
I'd like to use nutch to index a large number of pdf files in a
Do you have done some customizing, e.g. store the nutchBean more than
once?
I personal had never such problems.
How many segments / indexes you have?
Am 22.02.2006 um 15:21 schrieb Insurance Squared Inc.:
We're getting an out of memory error when running a search using
nutch 0.71 on a
I guess this it is a historically reason.
I remember a discussion to replace it but didn't remember the details
may you find something in the mail archive (developer list).
Am 22.02.2006 um 16:09 schrieb Elwin:
Why the url filter of nutch use Perl5 regular expressions? Any
benefits?
--
Does the NutchAnalyser support wild card queries ? and * (character)
I don't think so.
What are the modificatins needed to support this?
A set of things like the Nutch- QueryParser, Nutch Query Object,
basic Query filter etc.
? or missing something here.
Rgds
Prabhu
On 2/22/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Does the NutchAnalyser support wild card queries ? and * (character)
I don't think so.
What are the modificatins needed to support this?
A set of things like the Nutch- QueryParser, Nutch Query
didn't create segment file by myself. It
was created via nutch generate. Please let me know what you mean
yuo have one key two times.
Best regards,
Keren
Stefan Groschupf [EMAIL PROTECTED] wrote: I'm not sure if not
the key problem is the real source of the
problem. In general I suggest
My this is your problem?
Entities.encode(url)
Am 17.02.2006 um 15:13 schrieb Fankhauser, Alain:
Hello
I use Nutch 0.8-dev and I'm trying to index a local file system. After
Indexing I start tomcat and search. If I do this, I find the expected
results but the links aren't correct. It's
This depends on the query filter plugins you are using.
As far I know only the scores of a documents increase if the word
occurs in a title but there is not title query filter.
However write a own is very easy, check the query-site plugin.
Stefan
Am 17.02.2006 um 16:36 schrieb Nutch
I'm not sure if not the key problem is the real source of the
problem. In general I suggest using nutch 0.8 that fix a set of issues.
E.g. writes syncs to the files and creates checksums since people not
problems with hdd's.
At least a nutch map file require to have ordered keys and in your
Can you provide a full stack, may by setting the loglevel higher.
I guess this is a problem of the new OPIC score calculation but you
are the first that report such a problem.
In general - sorry to repeat that often - it is a good idea to run
the latest nightly builds of a open source
No it is a plugin running until search time.
Find some more documentation here:
http://www.carrot2.org/website/xml/index.xml
Am 13.02.2006 um 20:02 schrieb Raghavendra Prabhu:
Hi
What is the exact use of clustering carrot plugin
Say i have java the coffe as well as java the language ,will
You can remove a document from the index, what at least is the
storage that make sense to manipulate.
You can also block in general a url from coming into the segment by
using a url filter.
Stefan
Am 13.02.2006 um 18:18 schrieb Raghavendra Prabhu:
Hi
Is there a way if we give a url , can
You can easily add new file formats by writing new content type
parser plugins.
Just browse the code of one of the existing parsers like pdf or the
new swt parser to get an idea what you need to do.
In the end you only need to write a parser for the content and return
some values. ... and
Hey,
there is a interesting discussion in the lucene mailing list about a
similar theme.
http://www.gossamer-threads.com/lists/lucene/java-user/32629
I'm not sure if Dawid has subscribed the nutch user list also, so you
may you can catch and him in the carrot mailing list.
Stefan
Am
I use normally a simple trick in such situations.
I create a new empthy db inject the urls, create my segment and fetch
the segment.
Than I inject the urls a second time to my orginal db and update the
the db with the segment.
Stefan
Am 12.02.2006 um 18:11 schrieb Chris Schneider:
Nutch
in the apache hadoop configuration, this inside the hadoop jar.
Am 10.02.2006 um 17:33 schrieb carmmello:
After some time, I downloaded the latest nightly version of Nutch
(2006-02-10). Going through nutch-default.xml I could not find,
anymore, nor fs.defaul nor marpred.reduce properties.
Hi Scott,
yes this makes a sense.
I would also create a temp web db create the segment, crawl the segment.
If you don't want to add the pages down of the new urls than just
index the segment and add this segment to the other searchable
segments, do not update the db.
In general if you
schrieb Bernd Fehling:
Hi list,
I came across nutch while looking for search engines.
Nutch with its NDFS is very interesting to me.
A basic question:
Is it possible to install nutch with NDFS on a single machine
or do I need at least two maschines?
I followed the instructions from Stefan Groschupf
Stefan Groschupf schrieb:
Hi,
running ndfs on a single box installation makes not much sense,
exceptio you plan to use it for research.
However it is possible to run a name node and a data node on the
same box, also you can run several datanodes on the same box.
Rgearding your second question
I guess this is more a question of the configuration than of the
version.
In any case I suggest using the latest nightly build, since - well -
that is an active open source project. :-)
Carefully check your url reg ex, also check what your webserver
retrun as content type, there is a known
The hadoop.jar is in nutch/lib.
So you may better run under nutch as it was before, also some minutes
ago Doug moved some scripts back to nutch/bin so as far I know it
should work as before.
Am 05.02.2006 um 20:40 schrieb Rafit Izhak_Ratzin:
Hi,
I updated my environment to the newest
Instead of using the crawl command I personal prefer the manually
commands.
Than I use a small script that runs
http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
in a never ending loop where a wait for a day for each iteration.
This will make sure that you have all links that
It is more a question of query filters than query parsing.
You can somehow remove the standard behavior and add your own query
filter plugin.
Just take a look to query-basic and query-more query filter plugins.
Am 04.02.2006 um 16:04 schrieb Albert Chern:
Hello,
I want to search for a
Is the host in your web-browser available?
Does this host block your ip, since he understand nutch as a DOS attack?
Is you bandwidth limited?
Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:
Hi
I am running a crawl using protocol-httpclient
I get a
java.io.IOException:
On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Is the host in your web-browser available?
Does this host block your ip, since he understand nutch as a DOS
attack?
Is you bandwidth limited?
Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:
Hi
I am running a crawl using protocol
not be due to protocol-http but is there a
chance that
this may be also due to same reason ?
Thanks for the answer .
Rgds
Prabhu
On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
I personal prefer protocol-http.
Am 05.02.2006 um 18:26 schrieb Raghavendra Prabhu:
Hi Stefan
My
Yes, I already had done this once, but is is not API conform any
more, when porting ndfs to hadoop is done I may can bring things
again to the api and provide a patch.
However there is a list of other issues on my todo list already so it
will not happen until next days.
Stefan
Am
Check the reg ex url filter!
Your page contains symbols that are filtered.
Am 03.02.2006 um 14:46 schrieb [EMAIL PROTECTED]:
Hello,
I have problems indexing a special internet site:
http://www.gildemeister.com
Nutch only fetches 14 pages but not the complete site.
I'm using the default
There is already a java script parser, you only need to switch it on.
Am 03.02.2006 um 15:55 schrieb mos:
The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need
And also it makes no sense, since it will come back as soon the link
is found on a page.
Use a url filter instead and remove it from the index.
Removing from webdb makes no sense.
Am 03.02.2006 um 21:27 schrieb Keren Yu:
Hi everyone,
It took about 10 minutes to remove a page from WEBDB
What happens if you change your query string to:
String query = quiewId:a3d32ce0cae0da47677f30cc6182d421 HTTP ;
(adding space http)??
Does this return any hits?
Stefan
Am 31.01.2006 um 12:05 schrieb Enrico Triolo:
Hi all, I developed a couple of plugins to add and search a custom
field.
The
Meta data support is actually under developerment and come soon. See
jira for latest discussion.
In any case you can write already a index filter plugin, see the cool
fresh wiki documentation for that.
Am 31.01.2006 um 23:25 schrieb Sunnyvale Fl:
I need to add some meta data to the index
1 - 100 of 219 matches
Mail list logo