suggestion and comment.
Vinci
--
View this message in context:
http://www.nabble.com/Parsed-Text-and-Re-parsing-tp16392741p16392741.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi all,
Does it possible to delete the document from nutch index and segment?
Thank you,
Vinci
--
View this message in context:
http://www.nabble.com/Delete-document-from-segment-index-tp16254945p16254945.html
Sent from the Nutch - User mailing list archive at Nabble.com.
n
for the naming convention
Follow up: After I check more feeds I crawled, except the broken character,
I found not all the title get the mis-parsing: some text is parsed
correctly, some doesn't, but both are well-formed...
Thank you,
Vinci
sishen wrote:
>
> I also prefer title
Hi all,
I am trying to dump the content by the segment reader(bin/nutch -dump). The
output text contain 2 encoding, utf-8 and a multi-byte character-encoding.
When I open the dump page, I found the multi-byte encoding is broken - even
I convert to the correct encoding, the text displayed is broke
Hi all,
I found that the rss parser plugin is using the content text in
as anchor text but not the - so that it always index
the description, but the title text is always not indexed or used as anchor
text.
But actually the title is much more valuable and should be used as anchor
text.
Is this
Hi all,
I begin with working with nutch fetched page. When I try to dump segments, I
see there is many status code, e.g. 67 (linked), 65(signature),
33(fetch_success)...etc. I googled but no more clue, can anyone give a list
of those status code and explain their different?
Thank you.
--
View th
Hi,
Seems you need to mention what is "modified document". Which case would it
be?
Case 1: you dump the crawled page from nutch segment and do what you like on
them
If this is the case, you need to think of which action you want:
I. modified the document and then ask nutch to crawl the modified d
Hi,
congrat:)
btw, unless you set permission other then 755, no much permission thing you
need to care if you use tomcat.
one question: did you changed the plugin list? What plugin are you using? I
wonder how can you get the language of your query...
John Mendenhall wrote:
>
>> please check
Hi,
please check the path of the search.dir in property file located in
webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or
not.
if you use absolute path then this will be another problem
Hope it help
John Mendenhall wrote:
>
> I am running nutch 0.9, with tomcat 6.0.1
Hi all,
I have change the NutchAnalyzer in the indexing phase by plugin (plug-in
based on anaylsis-fr or analysis-fr), but I found the query tokenized in its
old way - look like the tokenizer did not parse the query with the same
tokenizer index them...
I checked the index, they are indexed as I
Hi all,
I found there is missing zh.ngp for zh locate. I have seen this file via a
screenshot and then I googled the filename return nothing for me...can
anyone provide this file for me?
Thank you
--
View this message in context:
http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-l
Hi all,
[This is a follow up post]
I found this is my fault so I need to crawl one more level that I expected.
Thank you
Vinci wrote:
>
> Hi all,
>
> I have a confusion of the keyword depth...
>
> -seed.txt url1 -link1
>
://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html
*For CJK, the default CJKAnalyzer can handle most of the case (especially
you change documents to unicode...), just let zh/ja/kr go as default case.
Vinci wrote:
>
> Hi all,
>
> How can I change the analyzer which is used by the indexer
Hi all,
How can I change the analyzer which is used by the indexer for specific
language? Also, can I use all the analyzer that I see in luke?
Thank you.
--
View this message in context:
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html
Sent from the Nutch
Hi all,
After looked several materials, I find that Nutch indexing based on the
parsed text - so If I don't want something to be indexed, I most likely need
to remove the thing I don't want to indexed before parsing to pure text...
Also, where is the cached page html file located? Is it the pre-
Hi all,
I found that the related topic is affecting the search performance. besides
I remove the hyperlinks in the parsing stage, can I not to index the word
inside the element?
--
View this message in context:
http://www.nabble.com/Indexing-problem---not-to-index-some-word-appear-in-link--
Hi all,
I have a confusion of the keyword depth...
-seed.txt url1 -link1
-link2
-link3
-link4
url2 -link5
...etc
However, I found the second level link (begin with -link) c
Hi all,
I found there is post that is talking about how to retrieve the parsed text,
but how can I get back the html version, especially with bin/nutch (like
readseg or readdb)? If no command available, what class should I deal with?
Also, If I need to modify the html structure (add or remove ta
; I'm not sure if I understand the question, but you can start server in
> the background (bin/nutch server 4321 crawl/ &) and use it from Nutch
> search web application on the same or any other machine.
>
> Tomislav
>
> On Tue, 2008-03-11 at 18:39 -0700, Vinci wrote:
>
Hi,
to save the resources, I want the crawler not crawling the link outside the
domain of the url - so it focus on the current website (the seed url domain
as well as its subdomain)? If I want to do so, what should I do?
--
View this message in context:
http://www.nabble.com/Crawling-Domain-lim
Hi,
Thank you so much. I think most likely it will be the final post so that I
will get my work done.
Enis Soztutar wrote:
>
>> I need to remove the unnecessary html, xslt transformation(which will
>> deal
>> with the encoding issue for me), as well as file generation.
>> For the program I hav
pplication: point searcher.dir to folder
> containing text file: search-servers.txt and in this file put server(s):
>
> server_host server_port
>
> and start/restart servlet container (Tomcat/Jetty/...)
>
>
> Hope this helps,
>
> Tomislav
>
>
> On Tue, 200
Hi,
please see below for the follow up question
Enis Soztutar wrote:
>
>> 3. If I need to processing the crawled page in more flexible way, Is it
>> better I dump the document to process but not write back, or I write my
>> plugin on the some phase? If I need to write plugin, which pharse is th
How should I use this command to set up a search server to receive query?
--
View this message in context:
http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15975737.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi everybody,
I am trying to use nutch to implement my spider algorithm...I need to get
information from specific resources, then schedule the crawling based on the
link it found (i.e. nutch will be an link analyzer as well as crawler)
Question here:
1. How can I get the links in linkdb? Is ther
hi,
anwser by myself again:
the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this
exception
thank you for your attention,
Vinci
Vinci wrote:
>
> Hi all,
>
> finally I make nutch can crawl and search, but when I click the cache
> page, it throw a
Hi all,
finally I make nutch can crawl and search, but when I click the cache page,
it throw a http 500 to me:
screen dump
type Exception report
message
description The server encountered an internal error () that prevented it
from fulfilling this request.
exception
Hi,
I already add the plugin name into nutch-default.xml, but it still throw
exception "ParseException: parser not found for
contentType=application/atom+xml" while the rss feed work fine after I added
parse-rss.
I checked the feed support atom feed with mime-type application/atom+xml,
did I miss
hi,
I have some trouble with a site that doing content redirection: nutch can't
crawl this site but can crawl its rss, but unfortunately the link in rss is
redirect to the site -- this is the bad thing, but I found the link i want
is appear in the link as an get parameter:
http://site/disallowpar
hi,
I found the anwserthis is generated because robots.txt disallowed
crawling of the current url.
hope it can help.
Vinci wrote:
>
> hi,
>
> I finally make the crawler running without exception by build from trunk,
> but I found the linkdb cannot crawl anything...and
loaded.
Vinci wrote:
>
> Hi,
>
> Here is the additional information: before the exception appear, nutch
> advertise 2 message:
>
> fetching http://cnn.com
> org.apache.tika.mime.MimeUtils load
> INFO loading [mime-types.xml]
> fetch of h
problemdid I need to config the file it loaded?
Vinci wrote:
>
> Hi All,
>
> I get the same exception when I trying with the nightly build for a static
> page, any one can help?
>
>
> Vicious wrote:
>>
>> Hi All,
>>
>> Using the latest nigh
Hi All,
I get the same exception when I trying with the nightly build for a static
page, any one can help?
Vicious wrote:
>
> Hi All,
>
> Using the latest nightly build I am trying to run a crawl. I have set the
> agent property and all relevant plugin. However as soon as I run the crawl
> I
hi,
I finally make the crawler running without exception by build from trunk,
but I found the linkdb cannot crawl anything...and then I dump the crawl db
and seeing this in the metadata:
_pst_:robots_denied(18)
any idea?
--
View this message in context:
http://www.nabble.com/What-is-that-mean
I run the 0.9 crawler with parameter -depth 2 -threads 1, and I get the job
failed message for a dynamic-content site:
Dedup: starting
Dedup: adding indexes in: /var/crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobC
f the invalid code)?
If I want to change the link parser, what do I need to do (especially I
prefer the change it by plugins)?
Martin Kuen wrote:
>
> Hi there,
>
> On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi,
>>
>> Than
Hi,
Here is the anwser for q1 and q3,
1. the tomcat is for the online search interface. If you won't include the
documentation in the release product, you don't need to include it in the
package, just setup the tomcat on the server where has the index file
located, modify the config file and dep
Hi,
Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?
Martin Kuen wrote:
>
> Hi,
>
> On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:
>
>
Hi,
I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.
As I have limited resources, I don't want nutch be too aggressive, so I want
to set some delay, but I am confused with the value of http.max.delays, does
it use millisecond
39 matches
Mail list logo