from:"Vinci"

Parsed Text and Re-parsing

2008-03-31 Thread Vinci


Hi all,
As suggested in JIRA discussion, I would like to ask for 2 issue, 
1. Can I change the default neko parser behaviour, so it add tab instead of
space for block level element?
2. Can I do the re-parsing for the crawled page? Does the re-parsing make
change the segments?

Thank you for any suggestion and comment.
Vinci
-- 
View this message in context: 
http://www.nabble.com/Parsed-Text-and-Re-parsing-tp16392741p16392741.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Delete document from segment/index

2008-03-24 Thread Vinci


Hi all,

Does it possible to delete the document from nutch index and segment?

Thank you,
Vinci
-- 
View this message in context: 
http://www.nabble.com/Delete-document-from-segment-index-tp16254945p16254945.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: RSS parser plugin bug?

2008-03-24 Thread Vinci


Hi sishen,

You should, atom feed is broken for quite a long time.
If you don't want to replace the origin plugin, just use another name.
Especially when your plugin only work for atom feed, I think you should use
name like parse-atom or atom-parserplease refer to the rss parser plugin
for the naming convention

Follow up: After I check more feeds I crawled, except the broken character,
I found not all the title get the mis-parsing: some  text is parsed
correctly, some doesn't, but both are well-formed...

Thank you,
Vinci


sishen wrote:
> 
> I also prefer title than description.
> 
> Also, I found there is some problems to parse the atom feed with the lib
> "commons-feedparser".
> I have implemented a new plugin to fix the problem with
> rome<https://rome.dev.java.net/>.
> 
> 
> But i doubt whether should I submit it to the nutch trunk?
> 
> Best regards.
> 
> sishen
> 
> On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi all,
>> I found that the rss parser plugin is using the content text in
>>  as anchor text but not the  - so that it always
>> index
>> the description, but the title text is always not indexed or used as
>> anchor
>> text.
>>
>> But actually the title is much more valuable and should be used as anchor
>> text.
>>
>> Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody
>> post in JIRA ?
>>
>> Thank you for your attention.
>> --
>> View this message in context:
>> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Broken crawled content?

2008-03-24 Thread Vinci


Hi all, 
I am trying to dump the content by the segment reader(bin/nutch -dump). The
output text contain 2 encoding, utf-8 and a multi-byte character-encoding.
When I open the dump page, I found the multi-byte encoding is broken - even
I convert to the correct encoding, the text displayed is broken. How can I
fix the text?

Thank you.
-- 
View this message in context: 
http://www.nabble.com/Broken-crawled-content--tp16246942p16246942.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RSS parser plugin bug?

2008-03-24 Thread Vinci


Hi all,
I found that the rss parser plugin is using the content text in
 as anchor text but not the  - so that it always index
the description, but the title text is always not indexed or used as anchor
text.

But actually the title is much more valuable and should be used as anchor
text.

Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody
post in JIRA ? 

Thank you for your attention.
-- 
View this message in context: 
http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch crawled page status code explanation needed

2008-03-23 Thread Vinci


Hi all,

I begin with working with nutch fetched page. When I try to dump segments, I
see there is many status code, e.g. 67 (linked), 65(signature),
33(fetch_success)...etc. I googled but no more clue, can anyone give a list
of those status code and explain their different?
Thank you.
-- 
View this message in context: 
http://www.nabble.com/Nutch-crawled-page-status-code-explanation-needed-tp16237183p16237183.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Recrawling without deleting crawl directory

2008-03-23 Thread Vinci

Hi,

Seems you need to mention what is "modified document". Which case would it
be?
Case 1: you dump the crawled page from nutch segment and do what you like on
them
If this is the case, you need to think of which action you want:
I. modified the document and then ask nutch to crawl the modified directory?
II. modified the document, write back to segment (the crawl DB), then do the
indexing?
Case 2: Keep track the document update
For this case, when you keep on doing re-crawl based on the same crawl DB
(what you need to tune the the day of re-crawl), then nutch will do the
update for you.
Hope it help :)

Jean-Christophe Alleman wrote:
> 
> 
> 
> Hi,
> 
> I have nothing said. This works fine ! It's morning and I'm still not woke
> up :-D
> 
> I just want to know if it was possible to re index modified documents ? Or
> re index documents which are already in database ?
> 
> Thank's in advance !
> 
> Jisay
> 
> 
>> 
>> Hi Susam Pal and thank's for your help !
>> 
>> The solution you give to me doesn't work... I have still an error with
>> Hadoop... And if I download an older version of the API, will this patch
>> work ? I have Nutch-0.9 and I don't know if I compile with an oder Hadoop
>> API, this patch will work. But if it will work where can I find an older
>> version of Hadoop API ?
>> 
>> Thank's in advance for your help,
>> 
>> Jisay
>> 
>> 
>>>
>>> I am not sure but it seems that this is because of an older version of
>>> Hadoop. I don't have older versions of Nutch or Hadoop with me to
>>> confirm this. Just try omitting the second argument in:
>>> fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()) and see if it
>>> compiles?
>>>
>>> I guess, fs.listPaths(indexes) should work since I can find such a
>>> method (though it is deprecated now) in the latest Hadoop API.
>>>
>>> Regards,
>>> Susam pal
>>>
>>> On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman
>>>  wrote:

 Thank's for your reply Susam Pal !

 I have run ant and I have an error I can't resolve... Look at this :

 debian:~/nutch-0.9# ant
 Buildfile: build.xml

 init:
 [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into
 /root/nutch-0.9/build/hadoop
 [untar] Expanding: /root/nutch-0.9/build/hadoop/bin.tgz into
 /root/nutch-0.9/bin
 [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into
 /root/nutch-0.9/build

 compile-core:
 [javac] Compiling 133 source files to /root/nutch-0.9/build/classes
 [javac] /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java:150:
 cannot find symbol
 [javac] symbol : variable HadoopFSUtil
 [javac] location: class org.apache.nutch.crawl.Crawl
 [javac] merger.merge(fs.listPaths(indexes,
 HadoopFSUtil.getPassAllFilter()),
 [javac] ^
 [javac] Note: Some input files use or override a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 [javac] Note: Some input files use unchecked or unsafe operations.
 [javac] Note: Recompile with -Xlint:unchecked for details.
 [javac] 1 error

 BUILD FAILED
 /root/nutch-0.9/build.xml:106: Compile failed; see the compiler error
 output for details.

 Total time: 8 seconds

 I have already corrected 3errors but I can't correct this one... I
 don't know what's HadoopFSUtil and so I can't correct the error... Help
 me please,

 Thank's for your help !

 Jisay

>
> The patch was generated for Nutch 1.0 development version which is
> currently in trunk. So, it is unable to patch your older version
> cleanly.
>
> I also see that you are using NUTCH-601v0.3.patch. However,
> NUTCH-601v1.0.patch is the recommended patch. If this patch fails, you
> can make the modifications manually. This patch is extremely simple
> and if you just open the patch using a text editor, you would find
> that 3 lines have been removed from the original source code
> (indicated by leading minus signs) and 11 new lines have been added
> (indicated by plus signs). You have to make these changes manually to
> your Nutch 0.9 source code directory.
>
> Once you make the changes, just build your project again with ant and
> you would be ready for recrawl.
>
> Regards,
> Susam Pal
>
> On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman

> wrote:
>>
>>
>> Hi, I'm interested by this patch but I can't patch it. I have some
>> problems when I try to patch...
>>
>> Here is what I do :
>>
>> debian:~/patch# patch -p0> can't find file to patch at input line 5

>> Perhaps you used the wrong -p or --strip option?
>> The text leading up to this was:
>> --
>> |Index: src/java/org/apache/nutch/crawl/Crawl.java
>> |===
>> |--- s

Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci


Hi,

congrat:)

btw, unless you set permission other then 755, no much permission thing you
need to care if you use tomcat.

one question: did you changed the plugin list? What plugin are you using? I
wonder how can you get the language of your query...



John Mendenhall wrote:
> 
>> please check the path of the search.dir in property file (nutch-site.xml)
>> located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is
>> accessable or not.
>> 
>> if you use absolute path then this will be another problem
> 
> Super!  Thanks a bunch!  That was it.
> The property is actually serverer.dir.
> We always use absolute paths since it helps tremendously
> not having to worry about where one is when the process is
> started.
> 
> We had moved it from one matchine to another and had
> forgotten to make sure the tomcat process owner 'tomcat'
> was in the nutch group 'nutch'.  Fixed that and it works
> like a charm.
> 
> Thanks again!
> 
> JohnM
> 
> -- 
> john mendenhall
> [EMAIL PROTECTED]
> surf utopia
> internet services
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075816.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci


Hi,

please check the path of the search.dir in property file located in
webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or
not.

if you use absolute path then this will be another problem

Hope it help



John Mendenhall wrote:
> 
> I am running nutch 0.9, with tomcat 6.0.14.
> When I use the NutchBean to search the index,
> it works fine.  I get back results, no errors.
> I have used tomcat before and it has worked
> fine.
> 
> Now I am getting an error searching through
> tomcat.  This is the tomcat error I am seeing
> in the catalina.out log file:
> 
> -
> 2008-03-15 15:38:38,715 INFO  NutchBean - query request from
> 192.168.245.58
> 2008-03-15 15:38:38,717 INFO  NutchBean - query: penasquitos
> 2008-03-15 15:38:38,717 INFO  NutchBean - lang: en
> Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve
> invoke
> SEVERE: Servlet.service() for servlet jsp threw exception
> java.lang.NullPointerException
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159)
> at
> org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177)
> -
> 
> When I run a search using the NutchBean, I
> see debug log entries in the hadoop.log.
> When I run the search using Tomcat, I never
> see any hadoop.log entires.
> 
> We have 1.4 million indexed pages, taking
> up 31gb for the nutch/crawl directory.
> 
> The search term doesn't matter.
> 
> My guess is it may be a memory error,
> but I am not seeing it anywhere.
> Is there a place where I can set the memory
> footprint for tomcat to use more memory?
> 
> Or, is there another place I should be looking?
> 
> Thanks in advance for any pointers or assistance.
> 
> JohnM
> 
> -- 
> john mendenhall
> [EMAIL PROTECTED]
> surf utopia
> internet services
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075186.html
Sent from the Nutch - User mailing list archive at Nabble.com.

incorrect Query tokenization

2008-03-15 Thread Vinci


Hi all,

I have change the NutchAnalyzer in the indexing phase by plugin (plug-in
based on anaylsis-fr or analysis-fr), but I found the query tokenized in its
old way - look like the tokenizer did not parse the query with the same
tokenizer index them...
I checked the index, they are indexed as I want. I also checked the hadoop
log, all plugin loaded (Include the one changed the Indexer). However, both
from the nutchBean and webapps, the tokenization is not correct.

How can I fix it? 
(*The fastest solution Look like assign the language [by plugin
language-identifier] of query, but I don't know where to start...)
-- 
View this message in context: 
http://www.nabble.com/incorrect-Query-tokenization-tp16070144p16070144.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Missing zh.ngp for zh locate support for language Identifier

2008-03-15 Thread Vinci


Hi all,

I found there is missing zh.ngp for zh locate. I have seen this file via a
screenshot and then I googled the filename return nothing for me...can
anyone provide this file for me?

Thank you
-- 
View this message in context: 
http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Confusion of -depth parameter

2008-03-15 Thread Vinci


Hi all,
[This is a follow up post]

I found this is my fault so I need to crawl one more level that I expected.

Thank you


Vinci wrote:
> 
> Hi all,
> 
> I have a confusion of the keyword depth...
> 
> -seed.txt url1 -link1
>-link2
>-link3
>-link4
>   url2 -link5
> ...etc
> 
> However, I found the second level link (begin with -link) cannot be
> crawled unless I set the depth is 3 but not 2, why? Does the depth 1 is
> the seed url file?
> 

-- 
View this message in context: 
http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16067808.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Change of analyzer for specific language

2008-03-15 Thread Vinci

Hi all,
[Follow up post]
I found the method by myself. 
1. Write a plugin for your own language. The method can refer to the
analysis-de and analysis-fr to wrap the luence analyzer into your plugin.

2. Then you need to add them to your plugin-include list in nutch-site.xml
or nutch-sites.xml . Also you need to add the language-identifier 

3. [For those language is not supported by language identifier or think
language identifier is too slow] 
OK, their is 50% chance you will fail if you are writing for eurpoean
lanuguage, and 100% fail if you writing for Eastern Asia Language.

The reason for that is , language-identifier fail  - your language is not
supported and you will see the default indexer do the indexing task for you.

There is 2 method 

A. Hack the plugin language-identifier.
i. hack all the class except the LanguageIdentifier.java: The detail will
not mention here, because this is too many step and I write in rush. But 2
principle here is:
a. remove all the reference to a LanguageIdentifier object, include
declaration and call of this method via this reference. This is much easier
if you have an IDE like NetBeans or Eclipse  
b. remember to change the language variable inner class of
HTMLLanguageParser or Change the default return language when all the case
fail.
ii. change the langmappings.properties to the acutal encoding of your
language - include all possible combination, in lower case. e.g.
za = za, zah, utf, utf8
For the full list you can refer to the list of Iconv language support list -
most system will support everything and you will see your language variance
(well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include
the first part if the target encoding has - or _ , like utf-8 written in utf
and utf8 in example.

then build the language-identifier again

*XML is you need to create your own Parser based on HTMLLanguageParser . But
you will fail in to default case quite soon if the xml witten bad enough
that using UTF-8 as encoding but no lang element here.

B. Hack the Indexer.java , mentioned by this post:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html
*For CJK, the default CJKAnalyzer can handle most of the case (especially
you change documents to unicode...), just let zh/ja/kr go as default case.

Vinci wrote:
> 
> Hi all,
> 
> How can I change the analyzer which is used by the indexer for specific
> language? Also, can I use all the analyzer that I see in luke?
> 
> Thank you.
> 

-- 
View this message in context: 
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Change of analyzer for specific language

2008-03-15 Thread Vinci


Hi all,

How can I change the analyzer which is used by the indexer for specific
language? Also, can I use all the analyzer that I see in luke?

Thank you.
-- 
View this message in context: 
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Where is the crawled/cached page html?

2008-03-14 Thread Vinci


Hi all,

After looked several materials, I find that Nutch indexing based on the
parsed text - so If I don't want something to be indexed, I most likely need
to remove the thing I don't want to indexed before parsing to pure text...

Also, where is the cached page html file located? Is it the pre-parsed html
or another html file stored in somewhere?

Thank you for any answer or discussion
-- 
View this message in context: 
http://www.nabble.com/Where-is-the-crawled-cached-page-html--tp16048280p16048280.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Indexing problem - not to index some word appear in link?

2008-03-14 Thread Vinci


Hi all, 


I found that the related topic is affecting the search performance. besides
I remove the hyperlinks in the parsing stage, can I not to index the word
inside the   element?
-- 
View this message in context: 
http://www.nabble.com/Indexing-problem---not-to-index-some-word-appear-in-link--tp16047313p16047313.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Confusion of -depth parameter

2008-03-14 Thread Vinci


Hi all,

I have a confusion of the keyword depth...

-seed.txt url1 -link1
   -link2
   -link3
   -link4
  url2 -link5
...etc

However, I found the second level link (begin with -link) cannot be crawled
unless I set the depth is 3 but not 2, why? Does the depth 1 is the seed url
file?
-- 
View this message in context: 
http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16047305.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawler javascript handling, retrieve crawled HTML and modify the html structure?

2008-03-13 Thread Vinci


Hi all,

I found there is post that is talking about how to retrieve the parsed text,
but how can I get back the html version, especially with bin/nutch (like
readseg or readdb)? If no command available, what class should I deal with?

Also, If I need to modify the html structure (add or remove tag), is it
better for me to do the transform on the  dumped html, than ask nutch to
crawl them back or ask another tool to do the indexing for me?

*Does nutch turn on the javascript parsing by default? If so, how can I turn
it off?
-- 
View this message in context: 
http://www.nabble.com/Crawler-javascript-handling%2C-retrieve-crawled-HTML-and-modify-the-html-structure--tp16023197p16023197.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search server bin/nutch server?

2008-03-12 Thread Vinci


Hi,

I see your point and I understand the usage - after I started the server I
still need nutch webapp to receive the query for me. so there is noway other
than I using the webapp for query processing, or call the searcher in
command-line?

Thank you


Tomislav Poljak wrote:
> 
> Hi,
> I'm not sure if I understand the question, but you can start server in
> the background (bin/nutch server 4321 crawl/ &) and use it from Nutch
> search web application on the same or any other machine.
> 
> Tomislav
> 
> On Tue, 2008-03-11 at 18:39 -0700, Vinci wrote:
>> Hi,
>> 
>> Thank you for the usage of this.
>> One more question: If I started a search server to background, can I use
>> it
>> for receiving direct query from other webpage?
>> Thank you
>> 
>> 
>> Tomislav Poljak wrote:
>> > 
>> > Hi,
>> > this is used for Distributed Search, so if you want to use it start
>> > server(s):
>> > 
>> > bin/nutch server   
>> > 
>> > on the machine(s) where you have index(es) (you can put any free port
>> > and crawl dir should point to your crawl folder). Then you should
>> > configure Nutch search web app to use this server(s): you have to edit
>> > nutch-site.xml in Nutch web application: point searcher.dir to folder
>> > containing text file: search-servers.txt and in this file put
>> server(s):
>> > 
>> > server_host server_port
>> > 
>> > and start/restart servlet container (Tomcat/Jetty/...)
>> > 
>> > 
>> > Hope this helps,
>> > 
>> > Tomislav
>> > 
>> >  
>> > On Tue, 2008-03-11 at 03:06 -0700, Vinci wrote:
>> >> How should I use this command to set up a search server to receive
>> query?
>> > 
>> > 
>> > 
>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p16002053.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawling Domain limited the url listed in seed file

2008-03-12 Thread Vinci


Hi,

to save the resources, I want the crawler not crawling the link outside the
domain of the url - so it focus on the current website (the seed url domain
as well as its subdomain)? If I want to do so, what should I do?
-- 
View this message in context: 
http://www.nabble.com/Crawling-Domain-limited-the-url-listed-in-seed-file-tp16001433p16001433.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: About link analysis and filter usage, and Recrawling

2008-03-12 Thread Vinci


Hi,

Thank you so much. I think most likely it will be the final post so that I
will get my work done.

Enis Soztutar wrote:
> 
>> I need to remove the unnecessary html, xslt transformation(which will
>> deal
>> with the encoding issue for me), as well as file generation. 
>> For the program I have, dump out everything and not write back is much
>> preferred, but look like If I do so I will lose some information of the
>> page
>> crawled?
>>
>>   
> You can write a parse-html plugin for this, or you can manually 
> manipulate the parse data by writing a
> mapreduce program.
> 

I see...
As I seen before, the parsing phase is done for link analysis. If I do
processing in this point, will I slow down the crawling?

*map reduce looks interesting, but I don't have too much to to go depth.


Enis Soztutar wrote:
> 
>> Enis Soztutar wrote:
>>   
>>> With the adaptive crawl, after several cycles the fetch frequency of a 
>>> page will be
>>> automatically adjusted.
>>>
>>> 
>> so If I keep on crawling based on same crawldb, I will get this effect?
>>   
> yes, exactly.
> 
I see your point...One more question for this: after I look some of the
config file, I found the default recrawl is 15 day. However, I want only the
url in the seed url file to be recrawled but not the url it found while
crawling. (because what I crawl is static page which will not be updated its
main content, once it crawled the recrawl is unnecessary). Does it possible
to do with updatedb but not starting a new crawl and merge?
And, Which part of nutch is related to the url recrawl schedule? link db or
injection?
Also, what will nutch do if similar/same url is found while crawling?

Thank you

Little Off topic: Can nutch using Luke to stmulate the indexer management
like Solr?
-- 
View this message in context: 
http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p16001325.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Search server bin/nutch server?

2008-03-11 Thread Vinci


Hi,

Thank you for the usage of this.
One more question: If I started a search server to background, can I use it
for receiving direct query from other webpage?
Thank you


Tomislav Poljak wrote:
> 
> Hi,
> this is used for Distributed Search, so if you want to use it start
> server(s):
> 
> bin/nutch server   
> 
> on the machine(s) where you have index(es) (you can put any free port
> and crawl dir should point to your crawl folder). Then you should
> configure Nutch search web app to use this server(s): you have to edit
> nutch-site.xml in Nutch web application: point searcher.dir to folder
> containing text file: search-servers.txt and in this file put server(s):
> 
> server_host server_port
> 
> and start/restart servlet container (Tomcat/Jetty/...)
> 
> 
> Hope this helps,
> 
> Tomislav
> 
>  
> On Tue, 2008-03-11 at 03:06 -0700, Vinci wrote:
>> How should I use this command to set up a search server to receive query?
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15996260.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci


Hi,

please see below for the follow up question

Enis Soztutar wrote:
> 
>> 3. If I need to processing the crawled page in more flexible way, Is it
>> better I dump the document to process but not write back, or I write my
>> plugin on the some phase? If I need to write plugin, which pharse is the
>> best point for me to implement my own extension?
>>   
> This depends on what you want to do with want kind of data. You should
> be more specific.
> 
I need to remove the unnecessary html, xslt transformation(which will deal
with the encoding issue for me), as well as file generation. 
For the program I have, dump out everything and not write back is much
preferred, but look like If I do so I will lose some information of the page
crawled?

Enis Soztutar wrote:
> 
>> 4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of
>> the
>> crawling?
>>   
> no, linkdb is used in the indexing phase.
> 
So If I use other Indexer like Solr, I need to do additional processing on
the page in order to keep the source link information? (like add the source
link infomation)

Enis Soztutar wrote:
> 
>> 5. Is there any method to avoid nutch recrawl a page in recrawling
>> script?
>> (e.g. not to crawl a page since no update from last time) Any information
>> can provided me to implement this?
>>   
> With the adaptive crawl, after several cycles the fetch frequency of a 
> page will be
> automatically adjusted.
> 
so If I keep on crawling based on same crawldb, I will get this effect?

Enis Soztutar wrote:
> 
>>
>> Thank you for reading this long post, and any answer or suggestion
>>   
> You're welcome.
> 
> Enis
> 
Thank you for your kindly help, it really help a lots :)

-- 
View this message in context: 
http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p15996240.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Search server bin/nutch server?

2008-03-11 Thread Vinci


How should I use this command to set up a search server to receive query?
-- 
View this message in context: 
http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15975737.html
Sent from the Nutch - User mailing list archive at Nabble.com.

About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci


Hi everybody,

I am trying to use nutch to implement my spider algorithm...I need to get
information from specific resources, then schedule the crawling based on the
link it found (i.e. nutch will be an link analyzer as well as crawler)

Question here:
1. How can I get the links in linkdb? Is there any other method other than
bin/nutch readlinkdb -dump?
2. I want all of my page crawled not begin updated, but I know I will do the
recrawling based on the those crawled page, Is there any other method other
than I dump the crawldb every time?
3. If I need to processing the crawled page in more flexible way, Is it
better I dump the document to process but not write back, or I write my
plugin on the some phase? If I need to write plugin, which pharse is the
best point for me to implement my own extension?
4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of the
crawling?
5. Is there any method to avoid nutch recrawl a page in recrawling script?
(e.g. not to crawl a page since no update from last time) Any information
can provided me to implement this?


Thank you for reading this long post, and any answer or suggestion
-- 
View this message in context: 
http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p15975729.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci


hi,

anwser by myself again:
the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this
exception

thank you for your attention,
Vinci


Vinci wrote:
> 
> Hi all, 
> 
> finally I make nutch can crawl and search, but when I click the cache
> page, it throw a http 500 to me:
> 
> 
> screen dump
> 
> type Exception report
> 
> message
> 
> description The server encountered an internal error () that prevented it
> from fulfilling this request.
> 
> exception
> 
> org.apache.jasper.JasperException: Exception in JSP: /cached.jsp:63
> 
> 60:   }
> 61: }
> 62: else 
> 63:   content = new String(bean.getContent(details));
> 64:   }
> 65: %>
> 66:  
> 
> 
> thing I found in log
> ---
> 2008-01-31 19:04:46,324 INFO  NutchBean - cache request from 127.0.0.1
> 2008-01-31 19:04:46,358 ERROR [jsp] - Servlet.service() for servlet jsp
> threw exception
> java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
> at
> org.apache.hadoop.io.WritableName.getClass(WritableName.java:72)
> at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1405)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1360)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1349)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1344)
> at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:254)
> at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:242)
> at
> org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:91)
> at
> org.apache.nutch.searcher.FetchedSegments$Segment.getReaders(FetchedSegments.java:90)
> at
> org.apache.nutch.searcher.FetchedSegments$Segment.getContent(FetchedSegments.java:68)
> at
> org.apache.nutch.searcher.FetchedSegments.getContent(FetchedSegments.java:139)
> at
> org.apache.nutch.searcher.NutchBean.getContent(NutchBean.java:347)
> at org.apache.jsp.cached_jsp._jspService(cached_jsp.java:107)
> at
> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331)
> at
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
> org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
> at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
> at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
> at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
> at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
> at java.lang.Thread.run(Thread.java:619)
> 

-- 
View this message in context: 
http://www.nabble.com/Error-when-request-cache-page-in-1.0-dev-tp15202557p15205147.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci


Hi all, 

finally I make nutch can crawl and search, but when I click the cache page,
it throw a http 500 to me:


screen dump

type Exception report

message

description The server encountered an internal error () that prevented it
from fulfilling this request.

exception

org.apache.jasper.JasperException: Exception in JSP: /cached.jsp:63

60:   }
61: }
62: else 
63:   content = new String(bean.getContent(details));
64:   }
65: %>
66:

Cannot parse atom feed with plugin feed installed

2008-01-30 Thread Vinci


Hi,

I already add the plugin name into nutch-default.xml, but it still throw
exception "ParseException: parser not found for
contentType=application/atom+xml" while the rss feed work fine after I added
parse-rss.
I checked the feed support atom feed with mime-type application/atom+xml,
did I miss any setting?
-- 
View this message in context: 
http://www.nabble.com/Cannot-parse-atom-feed-with-plugin-feed-installed-tp15191469p15191469.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Can Nutch use part of the url found for the next crawling?

2008-01-30 Thread Vinci


hi,

I have some trouble with a site that doing content redirection: nutch can't
crawl this site but can crawl its rss, but unfortunately the link in rss is
redirect to the site -- this is the bad thing, but I found the link i want
is appear in the link as an get parameter:
http://site/disallowpart?url=the_link_i_want

i see there is something call url-filter and regex-filter, which one can
help me to extract the_link_i_want?

Thank you.
-- 
View this message in context: 
http://www.nabble.com/Can-Nutch-use-part-of-the-url-found-for-the-next-crawling--tp15190975p15190975.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: What is that mean? robots_denied(18)

2008-01-30 Thread Vinci


hi,

I found the anwserthis is generated because robots.txt disallowed
crawling of the current url.

hope it can help.



Vinci wrote:
> 
> hi,
> 
> I finally make the crawler running without exception by build from trunk,
> but I found the linkdb cannot crawl anything...and then I dump the crawl
> db and seeing this in the metadata:
> 
> _pst_:robots_denied(18)
> 
> any idea?
> 

-- 
View this message in context: 
http://www.nabble.com/What-is-that-mean--robots_denied%2818%29-tp15188811p15189990.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetch issue with Feeds (SOLVED)

2008-01-30 Thread Vinci


Hi,

finally I figure out the solution:
go to conf/
rename the old mime-types.xml into anyting else,
then copy tika-mimetypes.xml into the same directory with name
mime-types.xml
the crawler should work now.

in short, this is because 1.0-dev using tika, but old-day mime detection
config file is loaded.


Vinci wrote:
> 
> Hi,
> 
> Here is the additional information: before the exception appear, nutch
> advertise 2 message:
> 
> fetching http://cnn.com
> org.apache.tika.mime.MimeUtils load
> INFO loading [mime-types.xml]
> fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException
> Fetcher: done
> 
> Seems mime-type has problemdid I need to config the file it loaded?
> 
> 
> 
> Vinci wrote:
>> 
>> Hi All,
>> 
>> I get the same exception when I trying with the nightly build for a
>> static page, any one can help?
>> 
>> 
>> Vicious wrote:
>>> 
>>> Hi All,
>>> 
>>> Using the latest nightly build I am trying to run a crawl. I have set
>>> the agent property and all relevant plugin. However as soon as I run the
>>> crawl I get the following error in hadoop.log. I read all the post here
>>> and the only suggestion was the http.agent property should not be empty.
>>> Well in my case it isnt and yet I see the error. Any help will be
>>> appreciated.
>>> 
>>> Thanks-
>>> 
>>>  fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed
>>> with: java.lang.NullPointerE
>>>  http.Http - java.lang.NullPointerException
>>>  http.Http - at
>>> org.apache.nutch.protocol.Content.getContentType(Content.java:327)
>>>  http.Http - at
>>> org.apache.nutch.protocol.Content.(Content.java:95)
>>>  http.Http - at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>>>  http.Http - at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164)
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189897.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci


Hi,

Here is the additional information: before the exception appear, nutch
advertise 2 message:

fetching http://cnn.com
org.apache.tika.mime.MimeUtils load
INFO loading [mime-types.xml]
fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException
Fetcher: done

Seems mime-type has problemdid I need to config the file it loaded?



Vinci wrote:
> 
> Hi All,
> 
> I get the same exception when I trying with the nightly build for a static
> page, any one can help?
> 
> 
> Vicious wrote:
>> 
>> Hi All,
>> 
>> Using the latest nightly build I am trying to run a crawl. I have set the
>> agent property and all relevant plugin. However as soon as I run the
>> crawl I get the following error in hadoop.log. I read all the post here
>> and the only suggestion was the http.agent property should not be empty.
>> Well in my case it isnt and yet I see the error. Any help will be
>> appreciated.
>> 
>> Thanks-
>> 
>>  fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed with:
>> java.lang.NullPointerE
>>  http.Http - java.lang.NullPointerException
>>  http.Http - at
>> org.apache.nutch.protocol.Content.getContentType(Content.java:327)
>>  http.Http - at org.apache.nutch.protocol.Content.(Content.java:95)
>>  http.Http - at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>>  http.Http - at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164)
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189590.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci


Hi All,

I get the same exception when I trying with the nightly build for a static
page, any one can help?


Vicious wrote:
> 
> Hi All,
> 
> Using the latest nightly build I am trying to run a crawl. I have set the
> agent property and all relevant plugin. However as soon as I run the crawl
> I get the following error in hadoop.log. I read all the post here and the
> only suggestion was the http.agent property should not be empty. Well in
> my case it isnt and yet I see the error. Any help will be appreciated.
> 
> Thanks-
> 
>  fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed with:
> java.lang.NullPointerE
>  http.Http - java.lang.NullPointerException
>  http.Http - at
> org.apache.nutch.protocol.Content.getContentType(Content.java:327)
>  http.Http - at org.apache.nutch.protocol.Content.(Content.java:95)
>  http.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226)
>  http.Http - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164)
> 

-- 
View this message in context: 
http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189123.html
Sent from the Nutch - User mailing list archive at Nabble.com.

What is that mean? robots_denied(18)

2008-01-30 Thread Vinci


hi,

I finally make the crawler running without exception by build from trunk,
but I found the linkdb cannot crawl anything...and then I dump the crawl db
and seeing this in the metadata:

_pst_:robots_denied(18)

any idea?
-- 
View this message in context: 
http://www.nabble.com/What-is-that-mean--robots_denied%2818%29-tp15188811p15188811.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Dedup: Job Failed and crawl stopped at depth 1

2008-01-29 Thread Vinci


I run the 0.9 crawler with parameter -depth 2  -threads 1, and I get the job
failed message for a dynamic-content site:
Dedup: starting
Dedup: adding indexes in: /var/crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
in the hadoop.log:
2008-01-30 15:08:12,402 INFO  indexer.Indexer - Optimizing index.
2008-01-30 15:08:12,601 INFO  indexer.Indexer - Indexer: done
2008-01-30 15:08:12,602 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-01-30 15:08:12,622 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes in: /var/crawl/indexes
2008-01-30 15:08:12,882 WARN  mapred.LocalJobRunner - job_b5nenb
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

Also the crawling stop at depth=1
2008-01-30 15:08:10,083 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2008-01-30 15:08:10,084 INFO  crawl.Crawl - Stopping at depth=1 - no more
URLs to fetch.

I checked the index is work in luke, it only fetch the page of url in the
list. I tried the search in luke and it seems work well, but the nutch
searcher return nothing to me..did I miss some setting or this is the
problem of aborted indexing?
-- 
View this message in context: 
http://www.nabble.com/Dedup%3A-Job-Failed-and-crawl-stopped-at-depth-1-tp15176806p15176806.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci


Hi,

thank you.:)
Seems I need to write a Java program to write out the file and do the
transformation.
Another question to the dumped linkdb: I find escaped html appear in the end
of the link, is it the fault of the parser (the html most likely not valid,
but I really don't need the chunk of the invalid code)? 
If I want to change the link parser, what do I need to do (especially I
prefer the change it by plugins)?


Martin Kuen wrote:
> 
> Hi there,
> 
> On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi,
>>
>> Thank you :)
>> One more question for the fetched page reading: I prefer I can dump the
>> fetched page into a single html file.
> 
> You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
> create a seperate file for each downloaded file.
> You could modify the SegmentReader class (
> org.apache.nutch.segment.SegmentReader) if you want to do that.
> 
> No other way besides invert the
>> inverted file?
>>
> The index is not inverted if you use the "readseg" command. The fetched
> content (e.g html pages) is stored in the "crawl/segments" folder. The
> lucene index is stored in "crawl/indexes". This (lucene) index is created
> after all crawling has finished. The readseg command (SegmentReader class)
> only accesses "crawl/segments", so the lucene index is not touched. lucene
> index --> the inverted index
> 
> Best Regards,
> 
> Martin
> 
> 
>>
>> Martin Kuen wrote:
>> >
>> > Hi,
>> >
>> > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I am new to nutch and I am trying to run a nutch to fetch something
>> from
>> >> specific websites. Currently I am running 0.9.
>> >>
>> >> As I have limited resources, I don't want nutch be too aggressive, so
>> I
>> >> want
>> >> to set some delay, but I am confused with the value of
>> http.max.delays,
>> >> does
>> >> it use milliseconds insteads of seconds? (Some people said it is in 3
>> >> second
>> >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
>> >>
>> >
>> > "http.max.delays" doesn't specify a timespan - read the description
>> more
>> > carefully. I think "fetcher.server.delay" is what you are looking for.
>> It
>> > is
>> > the amount of time the fetcher will at least wait until it fetches
>> another
>> > url from the same host. Keep in mind that the fetcher obeys robots.txt
>> > files
>> > (by default) - so if a robots.txt file is present the crawling will
>> occur
>> > "polite enough".
>> >
>> >
>> >> Also, I need to read the fetched page so that I can do some
>> modification
>> >> on
>> >> the html structure for future parsing, where is the files located? Are
>> >> they
>> >> store in pure html or they are breaken down into multiple file? if
>> this
>> >> is
>> >> not html file, how can I read the fetched page?
>> >>
>> >
>> > If you are looking for a way to programmatically read the fetched
>> content
>> > (
>> > e.g. html pages) have a look at the IndexReader class.
>> > If you are looking for a way to dump the whole downloaded content to a
>> > Text
>> > file or want to see some statistical information about it, try the
>> > "readseg"
>> > command.
>> > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
>> >
>> >>
>> >> And will the cached page losing all the original html attribute when
>> it
>> >> viewed in cached page?
>> >>
>> > The page will be stored character by character, including html tags.
>> >
>> >>
>> >> Also, how can I read the link that nutch found and how can I control
>> the
>> >> crawling sequence? (change it to breadth-first search at the top
>> level,
>> >> then
>> >> depth-first one by one)
>> >>
>> > Crawling always occurs breadth-first. If you want fine-grained control
>> > over
>> > the crawling sequence you should follow the procedure in the nutch
>> > tutorial
>> > for "whole internet crawling". Nevertheless the crawling occurs
>> > breath-first.
>> >
>> >>
>> >> Sorry for many questions.
>> >
>> >
>> > HTH,
>> >
>> > Martin
>> >
>> > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
>> > (nice semester abroad . . . hehe ;)
>> >
>> >
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Tomcat query

2008-01-29 Thread Vinci

Hi,

Here is the anwser for q1 and q3,

1. the tomcat is for the online search interface. If you won't include the
documentation in the release product, you don't need to include it in the
package, just setup the tomcat on the server where has the index file
located, modify the config file and deploy the war file in tomcat mnager.
Also, make sure the html file is here and everybody can read it by a url
started with http.

3. setup a tomcat server mentioned for everybody can enter the search page
and submit the query by their web browser, and find a correct places in the
shipped package to tell the user the url of online help :)

Jaya Ghosh wrote:
> 
> Hello,
> 
>  
> 
> I have a query. 
> 
>  
> 
> I have created an index of our online documentation files (htmls).
> Therefore
> it is more like an intranet search, that is, the search will be performed
> on
> static documents only. Now I need to test it. My machine does not have
> Tomcat installed. The IT department has informed me that as a Tomcat user
> I
> need to have root permissions and they need permission from the higher
> authority to assign me the same. 
> 
>  
> 
> My queries are:
> 
>  
> 
> 1. If I succeed implementing Nutch in our tool will we have to ship
> Tomcat/provide URL to the end-users?
> 
> 2. Is there an alternative to above?
> 
> 3. Am I right in assuming that in static documents the index is built only
> once and that is what we would be shipping with the tool? Therefore, the
> end
> user will not need any permissions as such to perform the search?
> 
>  
> 
> As mentioned earlier, I am a writer and hence not technical. 
> 
>  
> 
> Thanks in advance for any help/response.
> 
>  
> 
> Regards,
> 
> Ms.Jaya
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Tomcat-query-tp15131352p15164964.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci


Hi,

Thank you :) 
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?


Martin Kuen wrote:
> 
> Hi,
> 
> On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi,
>>
>> I am new to nutch and I am trying to run a nutch to fetch something from
>> specific websites. Currently I am running 0.9.
>>
>> As I have limited resources, I don't want nutch be too aggressive, so I
>> want
>> to set some delay, but I am confused with the value of http.max.delays,
>> does
>> it use milliseconds insteads of seconds? (Some people said it is in 3
>> second
>> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
>>
> 
> "http.max.delays" doesn't specify a timespan - read the description more
> carefully. I think "fetcher.server.delay" is what you are looking for. It
> is
> the amount of time the fetcher will at least wait until it fetches another
> url from the same host. Keep in mind that the fetcher obeys robots.txt
> files
> (by default) - so if a robots.txt file is present the crawling will occur
> "polite enough".
> 
> 
>> Also, I need to read the fetched page so that I can do some modification
>> on
>> the html structure for future parsing, where is the files located? Are
>> they
>> store in pure html or they are breaken down into multiple file? if this
>> is
>> not html file, how can I read the fetched page?
>>
> 
> If you are looking for a way to programmatically read the fetched content
> (
> e.g. html pages) have a look at the IndexReader class.
> If you are looking for a way to dump the whole downloaded content to a
> Text
> file or want to see some statistical information about it, try the
> "readseg"
> command.
> Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
> 
>>
>> And will the cached page losing all the original html attribute when it
>> viewed in cached page?
>>
> The page will be stored character by character, including html tags.
> 
>>
>> Also, how can I read the link that nutch found and how can I control the
>> crawling sequence? (change it to breadth-first search at the top level,
>> then
>> depth-first one by one)
>>
> Crawling always occurs breadth-first. If you want fine-grained control
> over
> the crawling sequence you should follow the procedure in the nutch
> tutorial
> for "whole internet crawling". Nevertheless the crawling occurs
> breath-first.
> 
>>
>> Sorry for many questions.
> 
> 
> HTH,
> 
> Martin
> 
> PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
> (nice semester abroad . . . hehe ;)
> 
> 
>> --
>> View this message in context:
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci


Hi,

I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.

As I have limited resources, I don't want nutch be too aggressive, so I want
to set some delay, but I am confused with the value of http.max.delays, does
it use milliseconds insteads of seconds? (Some people said it is in 3 second
by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)

Also, I need to read the fetched page so that I can do some modification on
the html structure for future parsing, where is the files located? Are they
store in pure html or they are breaken down into multiple file? if this is
not html file, how can I read the fetched page?

And will the cached page losing all the original html attribute when it
viewed in cached page?

Also, how can I read the link that nutch found and how can I control the
crawling sequence? (change it to breadth-first search at the top level, then
depth-first one by one)

Sorry for many questions.
-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
Sent from the Nutch - User mailing list archive at Nabble.com.

39 matches

Mail list logo