Hi!
Here are my steps of crawling.
I started all hadoop daemins,
inserted url file into dfs.
then started to crawl.
Here is part of crawl log.
060306 124851 parsing file:/usr/home/duche/nutch-nightly/conf/mapred-default.xml
060306 124851 parsing
I think it is strange.
OR is supported by Lucene, so it should
be supported by Nutch.
No ?
-Message d'origine-
De : Jack Tang [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 3 mars 2006 18:39
À : nutch-user@lucene.apache.org
Objet : Re: query site
OR is not supported in nutch yet.
On
Hi Jerome!
Would it be possible to generate ngram profiles for LanguageIdentifier
plugin from crawled content and not from file? What is my idea? The best
source for content in one language could be wikipedia.org. We would
just crawl the wikipedia in desired language and then create ngram
I didn't see query-basic/query-more on your list of plugins included. This
is what
handles most queries usually. query-url will only handle parts of the
query that look like url:http://www.google.com, and query-site handles
site:www.google.com. Nothing seems to be handling just regular
text in
I think that licence is OK.
Using that libray for plugin is realy simple. I've done some test some
time ago.
All you have to do is something like this (content is byte[])
Metadata metadata =
JpegMetadataReader.extractMetadataFromJpegSegmentReader(new
JpegSegmentReader(content));
And then
I think its a very good idea. It will be even better if one could
create a separate Crawl script just for ngram creation where one could
add their own URL for example national libraries URL or etc.. My
thinking is that
bin/nutch ngram
which is similler to crawl one shot intranet searching but
Hi,
Have u got an example how query-more plugin is working ?
For type, is it used to do something like that or not ?
+type:text/html
For type, is it used to do something like that or not ?
+type:text/html
If you just type a query like type:text, type:html or type:text/html
it will return no result.
It is a filter, ie you must associate it to a search term, for instance:
type:html nutch if you want to get nutch related
Elwin wrote:
When I read pages out of a webdb and printed out the url of each page, I
found two urls are just the same.
Is it possible that two pages with the same url?
WebDB should not allow two URLs that are exactly the same (Nutch uses
MD5 signature for that). Please check them
It doesnot work for me.
When I search something, I've got no attribute type in the HitDetails.
I should see it, no ?
The plugin is well activated :
2006030612:39:28,377DEBUG.[]plugin: id=query-more name=More
Query Filter version=1.0.0 provider=nutch.orgclass=null
20060306
Hello,
I've just released a modified version of nutch071 and tomcat50 running
off a CDROM or local harddrive cross-platform:
http://sf.net/projects/vicaya
My ambitions are not 'the whole web' but a small and static collection
of pages. I intend to allow users to use nutch offline with the
Hi,
storing the index on the hdd would be a good idea.
Take a look to the nutchBean init method to get an idea what you
need to change.
Should be simple by just allowing to provide an location for the
index that is different than the segments folder.
Stefan
Am 06.03.2006 um 12:53 schrieb
When I search something, I've got no attribute type in the HitDetails.
I should see it, no ?
You should see 3 type fields in HitDetails : one for primary type, one for
subtype and one for full content type.
Are you sure your index has been builded with the index-more plugin
activated?
Jérôme
Thanks, i forgot to activate the index-more plugin.
I only activated the query-more.
-Message d'origine-
De : Jérôme Charron [mailto:[EMAIL PROTECTED]
Envoyé : lundi 6 mars 2006 14:17
À : nutch-user@lucene.apache.org
Objet : Re: query-more
When I search something, I've got no
I'd be glad too, but I need to clean them up a bit (and make them more
generic) first. In the mean time, here is a link to an article that I
found helpful:
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
Search for 'recrawl'. You can use this script out of the box
You can have the inclusion and exclusion urls regex specified in
different lines or combine them by ORing. That does not make much
difference. Make sure that you have this line at the end.
-.
This will make sure all other sites are not crawled.
- Ravi
On 3/3/06, Jack Tang [EMAIL PROTECTED]
Nutch is not able to find the urls file you have specified on the
command line. The filename you have mentioned is urls.txt and not
urls. Correct this by changing the filename or by specifying urls.txt
on the command line.
- Ravi
On 3/3/06, Pine Cone [EMAIL PROTECTED] wrote:
Hello,
I am
Stefan.
I know people having 500 mio pages index and I personal run crawls with
~300 pages per second.
Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch
version) that you manage so many pages per second?
Unless this is a company secret, it would be very nice to know
I have been using nutch for learning purpose as to how it works so far. I have
been fairly successful in actually getting it up and running for some sites on
my local machine.
I sincerely thank the vibrant group helping me and many others..
I have some questions or issues, however
Hello,
Is it possible to have more than one Nutch application on one Nutch
installation?
What I would like to do would be to have several (4-5) indexes
relating to independant websites, searchable independently but with
just one Nutch install (ie, one Tomcat webapp).
On the indexing side, this
Hi Thomas,
for this crawl setup we have a test environment of nutch 0.8,
10xAMD's, custom linux build, 100Mbit eth1, 1Gb eth0, each box has a
'caching' dns server.
Stefan
Am 06.03.2006 um 15:59 schrieb TDLN:
Stefan.
I know people having 500 mio pages index and I personal run
crawls
On 3/4/06, Stefan Groschupf:
Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very
compressed style what the community is looking for.
However it is not used that often yet. It would be great if more
users can use it.
That's a
Hi,
We are using 0.8, and I see a property called db.ignore.internal.links
that is used by LinkDB to, well, ignore internal links. What we need is
a runtime-switchable option that allows the opposite -something like
db.ignore.external links. This is to say, for a given page, we don't
want to
On 3/4/06, Stefan Groschupf:
Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very
compressed style what the community is looking for.
However it is not used that often yet. It would be great if more
users can use it.
That's a
I've seen it noted that a complete recrawl is necessary to migrate from
0.71 to 0.8. Is this absolutely necessary? Or could a converter be
created to migrate the data? Has anyone created this?
I expect at some point I'll have to move versions and something like
this would be very useful.
I need to index Excel and Powerpoint files in nutch 0.7.1 ?
I've seen the plugins in nutch-0.8-dev.
No version of these plugins for nutch 0.7.1 ?
And is it possible to index OpenOffice documents ? if yes, what version
is required ?
Thanks
Hi,
Does Nutch 0.8 support https fetches? If not, are there any active
efforts to support it?
TIA,
David Odmark
I need to index Excel and Powerpoint files in nutch 0.7.1 ?
I've seen the plugins in nutch-0.8-dev.
No version of these plugins for nutch 0.7.1 ?
Originaly, these plugins were writed for nutch-0.7.1, and then adapted and
committed in nutch-0.8
You can retrieve the original patches in JIRA.
David Odmark wrote:
Hi,
Does Nutch 0.8 support https fetches? If not, are there any active
efforts to support it?
It does, using protocol-httpclient plugin.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
Ravi:
Just wondering did you submit your modification in JIIRA? I can't
seems to find it.
Thanks
On 3/6/06, Ravi Chintakunta [EMAIL PROTECTED] wrote:
Hi Frank,
Have a look at this thread.
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03014.html
- Ravi
On 3/6/06, Franz
Ok, i found them thanks.
-Message d'origine-
De : Jérôme Charron [mailto:[EMAIL PROTECTED]
Envoyé : lundi 6 mars 2006 18:41
À : nutch-user@lucene.apache.org
Objet : Re: Indexing Excel and Powerpoint
I need to index Excel and Powerpoint files in nutch 0.7.1 ?
I've seen the plugins in
Hi Andrzej,
I applied your patch for adaptive refetch. In the Indexer.java, the case
statement for STATUS_FETCH_UNMODIFIED is missing in the reduce() method. I
hope a simple break statement is to be added there.
Thanks
D.Saravanaraj
Hello Team,
I am having a lot of fun evaluating 0.8-dev, and after following
Stefan's and the doc team's tutorials, have got everything working in
both local and multi-machine modes using hadoop.
In single-machine mode, I have come unstuck, though, trying to expose
nutch server on port 8081
From: Laurent Michenaud [mailto:[EMAIL PROTECTED]
Sent: 2006-3-06 0:13
To: nutch-user@lucene.apache.org
Subject: RE: query site
I think it is strange.
OR is supported by Lucene, so it should
be supported by Nutch.
No ?
No, Nutch doesn't use Lucene's QueryParser. It has its own
Ravi,
Thanks for your answer, and yes, your problem was very similar to mine
-- more complicated, even, since you want to be able to search one or
several indices at a time, and I need to search only one.
Is your solution available online somewhere, as a patch or plugin?
That would be very
Richard Braman wrote:
I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back. And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.
Here's how
David Wallace wrote:
Also, I've lost count of the number of times someone has posted
something to the effect of I'll pay someone to give me Nutch support,
simply because they find the existing documentation and mailing lists
inadequate. Usually, that person gets told that the best way to get
Florent Gluck wrote:
In hadoop jobtracker's log, I can see several tasks being losts as follow:
060306 184155 Aborting job job_hyhtho
060306 184156 Task 'task_m_7qgat2' has been lost.
060306 184156 Aborting job job_hyhtho
060306 184156 Task 'task_m_lph5qs' has been lost.
060306 184156 Aborting
Monu Ogbe wrote:
Caused by: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at
org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav
It
Matthias Jaekle wrote:
Maybe we should move the tutorial to the wiki so it can be commented on.
+1
+1
Doug
On 06/03/06, Howie Wang [EMAIL PROTECTED] wrote:
Is query-basic or query-more included in your nutch-default.xml?
It is indeed included in my nutch-site.xml :-
property
nameplugin.includes/name
Delete the crawl folder which would have been created in the previous crawl.
On 3/7/06, ilango gurusamy [EMAIL PROTECTED] wrote:
Hi
I am trying to run Nutch by following the instructions given in the
tutorial.
The environment is Suse Linux10, JDK 1.4.2 and Nutch 0.71. And of course
Tomcat 5
Hi, Hasan,
Looking more carefully at the query-more plugin, it seems that it
only adds functionality for date queries and type queries. I think
you need to add query-basic to the list also to get it to search
the default content. Can you try adding query-basic and running:
bin/nutch search http
Hi
I successfully ran Nutch. Thanks for the tip. Strangely I remember deleting
the crawl directory before..but anyway, you worked the magic for me
by the way, Saravanaraj, are you from TN. What are your research interests
with Nutch
ilango
D.Saravanaraj [EMAIL PROTECTED] wrote: Delete
44 matches
Mail list logo