Parsed Text and Re-parsing

2008-03-31 Thread Vinci
suggestion and comment. Vinci -- View this message in context: http://www.nabble.com/Parsed-Text-and-Re-parsing-tp16392741p16392741.html Sent from the Nutch - User mailing list archive at Nabble.com.

Delete document from segment/index

2008-03-24 Thread Vinci
Hi all, Does it possible to delete the document from nutch index and segment? Thank you, Vinci -- View this message in context: http://www.nabble.com/Delete-document-from-segment-index-tp16254945p16254945.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: RSS parser plugin bug?

2008-03-24 Thread Vinci
n for the naming convention Follow up: After I check more feeds I crawled, except the broken character, I found not all the title get the mis-parsing: some text is parsed correctly, some doesn't, but both are well-formed... Thank you, Vinci sishen wrote: > > I also prefer title

Broken crawled content?

2008-03-24 Thread Vinci
Hi all, I am trying to dump the content by the segment reader(bin/nutch -dump). The output text contain 2 encoding, utf-8 and a multi-byte character-encoding. When I open the dump page, I found the multi-byte encoding is broken - even I convert to the correct encoding, the text displayed is broke

RSS parser plugin bug?

2008-03-24 Thread Vinci
Hi all, I found that the rss parser plugin is using the content text in as anchor text but not the - so that it always index the description, but the title text is always not indexed or used as anchor text. But actually the title is much more valuable and should be used as anchor text. Is this

Nutch crawled page status code explanation needed

2008-03-23 Thread Vinci
Hi all, I begin with working with nutch fetched page. When I try to dump segments, I see there is many status code, e.g. 67 (linked), 65(signature), 33(fetch_success)...etc. I googled but no more clue, can anyone give a list of those status code and explain their different? Thank you. -- View th

RE: Recrawling without deleting crawl directory

2008-03-23 Thread Vinci
Hi, Seems you need to mention what is "modified document". Which case would it be? Case 1: you dump the crawled page from nutch segment and do what you like on them If this is the case, you need to think of which action you want: I. modified the document and then ask nutch to crawl the modified d

Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci
Hi, congrat:) btw, unless you set permission other then 755, no much permission thing you need to care if you use tomcat. one question: did you changed the plugin list? What plugin are you using? I wonder how can you get the language of your query... John Mendenhall wrote: > >> please check

Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci
Hi, please check the path of the search.dir in property file located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or not. if you use absolute path then this will be another problem Hope it help John Mendenhall wrote: > > I am running nutch 0.9, with tomcat 6.0.1

incorrect Query tokenization

2008-03-15 Thread Vinci
Hi all, I have change the NutchAnalyzer in the indexing phase by plugin (plug-in based on anaylsis-fr or analysis-fr), but I found the query tokenized in its old way - look like the tokenizer did not parse the query with the same tokenizer index them... I checked the index, they are indexed as I

Missing zh.ngp for zh locate support for language Identifier

2008-03-15 Thread Vinci
Hi all, I found there is missing zh.ngp for zh locate. I have seen this file via a screenshot and then I googled the filename return nothing for me...can anyone provide this file for me? Thank you -- View this message in context: http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-l

Re: Confusion of -depth parameter

2008-03-15 Thread Vinci
Hi all, [This is a follow up post] I found this is my fault so I need to crawl one more level that I expected. Thank you Vinci wrote: > > Hi all, > > I have a confusion of the keyword depth... > > -seed.txt url1 -link1 >

Re: Change of analyzer for specific language

2008-03-15 Thread Vinci
://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html *For CJK, the default CJKAnalyzer can handle most of the case (especially you change documents to unicode...), just let zh/ja/kr go as default case. Vinci wrote: > > Hi all, > > How can I change the analyzer which is used by the indexer

Change of analyzer for specific language

2008-03-15 Thread Vinci
Hi all, How can I change the analyzer which is used by the indexer for specific language? Also, can I use all the analyzer that I see in luke? Thank you. -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html Sent from the Nutch

Where is the crawled/cached page html?

2008-03-14 Thread Vinci
Hi all, After looked several materials, I find that Nutch indexing based on the parsed text - so If I don't want something to be indexed, I most likely need to remove the thing I don't want to indexed before parsing to pure text... Also, where is the cached page html file located? Is it the pre-

Indexing problem - not to index some word appear in link?

2008-03-14 Thread Vinci
Hi all, I found that the related topic is affecting the search performance. besides I remove the hyperlinks in the parsing stage, can I not to index the word inside the element? -- View this message in context: http://www.nabble.com/Indexing-problem---not-to-index-some-word-appear-in-link--

Confusion of -depth parameter

2008-03-14 Thread Vinci
Hi all, I have a confusion of the keyword depth... -seed.txt url1 -link1 -link2 -link3 -link4 url2 -link5 ...etc However, I found the second level link (begin with -link) c

Crawler javascript handling, retrieve crawled HTML and modify the html structure?

2008-03-13 Thread Vinci
Hi all, I found there is post that is talking about how to retrieve the parsed text, but how can I get back the html version, especially with bin/nutch (like readseg or readdb)? If no command available, what class should I deal with? Also, If I need to modify the html structure (add or remove ta

Re: Search server bin/nutch server?

2008-03-12 Thread Vinci
; I'm not sure if I understand the question, but you can start server in > the background (bin/nutch server 4321 crawl/ &) and use it from Nutch > search web application on the same or any other machine. > > Tomislav > > On Tue, 2008-03-11 at 18:39 -0700, Vinci wrote: >

Crawling Domain limited the url listed in seed file

2008-03-12 Thread Vinci
Hi, to save the resources, I want the crawler not crawling the link outside the domain of the url - so it focus on the current website (the seed url domain as well as its subdomain)? If I want to do so, what should I do? -- View this message in context: http://www.nabble.com/Crawling-Domain-lim

Re: About link analysis and filter usage, and Recrawling

2008-03-12 Thread Vinci
Hi, Thank you so much. I think most likely it will be the final post so that I will get my work done. Enis Soztutar wrote: > >> I need to remove the unnecessary html, xslt transformation(which will >> deal >> with the encoding issue for me), as well as file generation. >> For the program I hav

Re: Search server bin/nutch server?

2008-03-11 Thread Vinci
pplication: point searcher.dir to folder > containing text file: search-servers.txt and in this file put server(s): > > server_host server_port > > and start/restart servlet container (Tomcat/Jetty/...) > > > Hope this helps, > > Tomislav > > > On Tue, 200

Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci
Hi, please see below for the follow up question Enis Soztutar wrote: > >> 3. If I need to processing the crawled page in more flexible way, Is it >> better I dump the document to process but not write back, or I write my >> plugin on the some phase? If I need to write plugin, which pharse is th

Search server bin/nutch server?

2008-03-11 Thread Vinci
How should I use this command to set up a search server to receive query? -- View this message in context: http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15975737.html Sent from the Nutch - User mailing list archive at Nabble.com.

About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci
Hi everybody, I am trying to use nutch to implement my spider algorithm...I need to get information from specific resources, then schedule the crawling based on the link it found (i.e. nutch will be an link analyzer as well as crawler) Question here: 1. How can I get the links in linkdb? Is ther

Re: Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci
hi, anwser by myself again: the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this exception thank you for your attention, Vinci Vinci wrote: > > Hi all, > > finally I make nutch can crawl and search, but when I click the cache > page, it throw a

Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci
Hi all, finally I make nutch can crawl and search, but when I click the cache page, it throw a http 500 to me: screen dump type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception

Cannot parse atom feed with plugin feed installed

2008-01-30 Thread Vinci
Hi, I already add the plugin name into nutch-default.xml, but it still throw exception "ParseException: parser not found for contentType=application/atom+xml" while the rss feed work fine after I added parse-rss. I checked the feed support atom feed with mime-type application/atom+xml, did I miss

Can Nutch use part of the url found for the next crawling?

2008-01-30 Thread Vinci
hi, I have some trouble with a site that doing content redirection: nutch can't crawl this site but can crawl its rss, but unfortunately the link in rss is redirect to the site -- this is the bad thing, but I found the link i want is appear in the link as an get parameter: http://site/disallowpar

Re: What is that mean? robots_denied(18)

2008-01-30 Thread Vinci
hi, I found the anwserthis is generated because robots.txt disallowed crawling of the current url. hope it can help. Vinci wrote: > > hi, > > I finally make the crawler running without exception by build from trunk, > but I found the linkdb cannot crawl anything...and

Re: Fetch issue with Feeds (SOLVED)

2008-01-30 Thread Vinci
loaded. Vinci wrote: > > Hi, > > Here is the additional information: before the exception appear, nutch > advertise 2 message: > > fetching http://cnn.com > org.apache.tika.mime.MimeUtils load > INFO loading [mime-types.xml] > fetch of h

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci
problemdid I need to config the file it loaded? Vinci wrote: > > Hi All, > > I get the same exception when I trying with the nightly build for a static > page, any one can help? > > > Vicious wrote: >> >> Hi All, >> >> Using the latest nigh

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci
Hi All, I get the same exception when I trying with the nightly build for a static page, any one can help? Vicious wrote: > > Hi All, > > Using the latest nightly build I am trying to run a crawl. I have set the > agent property and all relevant plugin. However as soon as I run the crawl > I

What is that mean? robots_denied(18)

2008-01-30 Thread Vinci
hi, I finally make the crawler running without exception by build from trunk, but I found the linkdb cannot crawl anything...and then I dump the crawl db and seeing this in the metadata: _pst_:robots_denied(18) any idea? -- View this message in context: http://www.nabble.com/What-is-that-mean

Dedup: Job Failed and crawl stopped at depth 1

2008-01-29 Thread Vinci
I run the 0.9 crawler with parameter -depth 2 -threads 1, and I get the job failed message for a dynamic-content site: Dedup: starting Dedup: adding indexes in: /var/crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobC

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
f the invalid code)? If I want to change the link parser, what do I need to do (especially I prefer the change it by plugins)? Martin Kuen wrote: > > Hi there, > > On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> Than

Re: Tomcat query

2008-01-29 Thread Vinci
Hi, Here is the anwser for q1 and q3, 1. the tomcat is for the online search interface. If you won't include the documentation in the release product, you don't need to include it in the package, just setup the tomcat on the server where has the index file located, modify the config file and dep

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. No other way besides invert the inverted file? Martin Kuen wrote: > > Hi, > > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: > >

Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci
Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use millisecond