Re: Luke and Indexes

2005-12-08 Thread Andrzej Bialecki

Bryan Woliner wrote:


I have a couple very basic questions about Luke and indexes in
general. Answers to any of these questions are much appreciated:

1. In the Luke overview tab, what does Index version refer to?
 



It's the time (as in System.currentTimeMillis()) when the index was last 
modified.



2. Also in the overview tab, if Has Deletions? is equal to yes,
where are the possible sources of deletions? Dedup? Manual deletions
through luke?

 



Either. Both.


3. Is there any way (w/ Luke or otherwise) to get a file listing all
of the docs in an index. Basically is there an index equivalent of
this command (which outputs all the URLs in a segment):

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
 



You can browse through documents on the Document tab. But there is no 
option to dump all documents to a file. Besides, some fields which are 
not stored are no longer accessible, so you cannot retrieve them from 
the index (you may be able to reconstruct them, but it's a lossy operation).



4. Finally, my last question is the one I'm most perplexed by:

I called bin/nutch segread -list -dir for a particular segments
directory and found out that one directory had 93 entries. BUT, when I
opened up the index of that segment in Luke, there were only 23
documents (and 3 deletions)! Where did the rest of the URLs go??
 



Do a segread -dump and check what is the protocol status and parse 
status for the pages that didn't make it to the index. Most likely you 
encountered either protocol errors or parsing errors, so there was 
nothing to index from these entries.


In addition, if you ran the deduplication, some of the entries in your 
index may have been deleted because they were considered duplicates.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




how to

2005-12-08 Thread Riku | http://kukusky.8800.org
how to use nutch support chinese?

--

My Web: http://kukusky.8800.org
MSN: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]



Plugin path in Nutch web

2005-12-08 Thread Nguyen Ngoc Giang
  Hi everyone,

  I'm writing an JSP program to allow crawling via web. My JSP script
follows nutch.tools.CrawlTool, which try to create database, inject
database, fecth and index.

  I have difficulty of identifying the plugins. Creating database is fine,
because it doesn't require any plugin. But when it comes to injection, I got
this error:

  java.lang.ExceptionInInitializerError
org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
org.apache.jsp.crawl_jsp._jspService(crawl_jsp.java:151)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


   From the ExceptionInInitializerError, I guess the problem is in loading
the plugin (since loading is static). Could anyone suggest me how to set the
path to plugin in nutch-default.xml? Thanks a lot.

  Regards,
  Giang


Crawling - dynamically generate web pages with paginations

2005-12-08 Thread K.A.Hussain Ali
Hi all,

I need to crawl a page that has a pagination which is more that the depth i 
specify..
could i crawl all the pages listed in the pagination..
or is their any way in Nutch to change the depth if the page has paginations in 
that...

Any ideas would help greatly..
Thanks in advance
regards
-Hussain.

Too many open file error -while searching using Nutch

2005-12-08 Thread K.A.Hussain Ali
HI all,

I get an error like  Too Many open files  while i try to search my 
segments which is in hundreds in count.
Is there any way to solve this issue ?
Do Nutch dont close the segmens afte the search ?

kindly send your suggestion for the above issue ..

Thanks in advance.
Regards
-Hussain.

Crawling listing (pagination) pages.

2005-12-08 Thread K.A.Hussain Ali
HI all,

  Do Nutch crawl pages in any listing pages( pages with pagination as in search 
engines)

While crawling through nutch i need to get the pages that gets displayed by 
the pagination unless i increase the depth of the whole crawling.
Do nutch provide any plugin for the above issue ?
Is there anyway to solve the above issue ?

Any help is greatly appreciated
Thanks in advance
regards
-Hussain

Re: Crawling listing (pagination) pages.

2005-12-08 Thread Jack Tang
Hi

I am facing the same problem. However my crawl only focuses on some
website and I recognize the paganition url ursing regexp and inject
them in every fetch cycle.

/Jack

On 12/8/05, K.A.Hussain Ali [EMAIL PROTECTED] wrote:
 HI all,

   Do Nutch crawl pages in any listing pages( pages with pagination as in 
 search engines)

 While crawling through nutch i need to get the pages that gets displayed 
 by the pagination unless i increase the depth of the whole crawling.
 Do nutch provide any plugin for the above issue ?
 Is there anyway to solve the above issue ?

 Any help is greatly appreciated
 Thanks in advance
 regards
 -Hussain



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


How to refresh the application context - to use the merged index

2005-12-08 Thread K.A.Hussain Ali
HI all,

..while crawling using Nutch,i do segment merging and indexing ,but the search 
doesnt  look into the new mergedsegment unless i restart the server..

Is there any way to refresh the application context to look into the new 
mergedindex without stopping the server ?
Is there anyway to do the above issue ?
Do Nutch provide any option to solve the same ?

Any help would greatly help.
Thanks in advance.
regards
-Hussain.

After mergesegs

2005-12-08 Thread Goldschmidt, Dave
Hi, just wanted to be sure - after I merge segments via the mergesegs
tool, I need to use the updatedb tool before dropping the new indexes
in, correct?

 

And, as just posted, I need to shutdown and restart Tomcat, too, yes?

 

Thanks,

DaveG

 



Re: Luke and Indexes

2005-12-08 Thread Bryan Woliner
Thank you very much for the helpful answers. Most of the pages that
didn't make it into the index were indeed due to protocol errors
(mostly exceeding http.max.delay).

One quick side note. When I was looking at the Nutch wiki page for
bin/nutch segread, I noticed an error on the page and wasn't sure how
to go about fixing it, or alerting someone who can. The page currently
reads:

...

-nocontent

  ignore content data

-noparsedata

  ignore parse_data data

-nocontent

  ignore parse_text data

...

The 2nd -nocontent should probably be -noparsetext, right?

Thanks again for the help,
Bryan

On 12/8/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Bryan Woliner wrote:

 I have a couple very basic questions about Luke and indexes in
 general. Answers to any of these questions are much appreciated:
 
 1. In the Luke overview tab, what does Index version refer to?
 
 

 It's the time (as in System.currentTimeMillis()) when the index was last
 modified.

 2. Also in the overview tab, if Has Deletions? is equal to yes,
 where are the possible sources of deletions? Dedup? Manual deletions
 through luke?
 
 
 

 Either. Both.

 3. Is there any way (w/ Luke or otherwise) to get a file listing all
 of the docs in an index. Basically is there an index equivalent of
 this command (which outputs all the URLs in a segment):
 
 bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
 
 

 You can browse through documents on the Document tab. But there is no
 option to dump all documents to a file. Besides, some fields which are
 not stored are no longer accessible, so you cannot retrieve them from
 the index (you may be able to reconstruct them, but it's a lossy operation).

 4. Finally, my last question is the one I'm most perplexed by:
 
 I called bin/nutch segread -list -dir for a particular segments
 directory and found out that one directory had 93 entries. BUT, when I
 opened up the index of that segment in Luke, there were only 23
 documents (and 3 deletions)! Where did the rest of the URLs go??
 
 

 Do a segread -dump and check what is the protocol status and parse
 status for the pages that didn't make it to the index. Most likely you
 encountered either protocol errors or parsing errors, so there was
 nothing to index from these entries.

 In addition, if you ran the deduplication, some of the entries in your
 index may have been deleted because they were considered duplicates.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





Problem with fetching segment

2005-12-08 Thread Håvard W. Kongsgård
I have followed the media-style.com quick tutorial, but when I try to 
fetch my segment the fetch is killed!


Have tried to set the system timer + 30 days, no anti-virus is running 
on the systems.

System SUSE 9.2 and SUSE 10

# bin/nutch fetch segments/20060109014654/
060109 014714 parsing 
file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-default.xml

060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-site.xml
060109 014715 No FS indicated, using default:local
060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/plugins
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-more
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-site/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-html/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-text/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-ext
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-pdf
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-rss
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-more

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-js
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
060109 014715 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-ftp
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-msword
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/creativecommons

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ontology
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-file
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-http/plugin.xml
060109 014715 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/clustering-carrot2
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/language-identifier
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-prefix
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-url/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-httpclient

060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null
060109 014715 http.proxy.port = 8080
060109 014715 http.timeout = 1
060109 014715 http.content.limit = -1
060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 
bytes, 8309 ms

060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page



Re: Plugin path in Nutch web

2005-12-08 Thread Arun Kaundal
Make sure that you have specify correct name/path for plugins directory(
plugin.folders). Also both plugin and nutch source must be from same
version.
Another solution is download latest trunk version of nutch from svn.


On 12/8/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:

 Hi everyone,

 I'm writing an JSP program to allow crawling via web. My JSP script
 follows nutch.tools.CrawlTool, which try to create database, inject
 database, fecth and index.

 I have difficulty of identifying the plugins. Creating database is fine,
 because it doesn't require any plugin. But when it comes to injection, I
 got
 this error:

 java.lang.ExceptionInInitializerError
org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java
 :378)
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
org.apache.jsp.crawl_jsp._jspService(crawl_jsp.java:151)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
org.apache.jasper.servlet.JspServletWrapper.service(
 JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java
 :292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)


   From the ExceptionInInitializerError, I guess the problem is in loading
 the plugin (since loading is static). Could anyone suggest me how to set
 the
 path to plugin in nutch-default.xml? Thanks a lot.

 Regards,
 Giang