Re: Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs

Hey, thanks.  My problem was that I also wanted the nutch conf out of
the nutch install dir. So, I did set the NUTCH_CONF_DIR variable in my
.bashrc and couldn't understand why it was never picking it up.  Well,
as it happens, that was the one variable I forgot to export!  Doh!

So, it wasn't hard at all. Though, I needed to replace
hadoop-12.whatever.jar to the lastest within the nutch build.  It
seems to be working. yay.


Thanks.




On 7/11/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> I am currently trying to figure out how to deploy Nutch and Hadoop
> separately.  I want to configure Hadoop outside of Nutch and have
> Nutch use that service, rather than configuring hadoop within nutch.
> I would think all that Nutch should need to know is the urls to
> connect to Hadoop, but can't figure out how to get this to work.
>
> Is this possible?  If so, is there some sort of document, or archive
> of another list post for this?
>
> Sorry for the ignorance.

If you have a clean hadoop installation up and running (made e.g. from
one of the official Hadoop builds), it should be enough to put the
nutch*.job file in ${hadoop.dir}, and copy bin/nutch (possibly with some
minor modifications - my memory is a little vague on this ...).


--
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
"Conscious decisions by conscious minds are what make reality real"


Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs

I am currently trying to figure out how to deploy Nutch and Hadoop
separately.  I want to configure Hadoop outside of Nutch and have
Nutch use that service, rather than configuring hadoop within nutch.
I would think all that Nutch should need to know is the urls to
connect to Hadoop, but can't figure out how to get this to work.

Is this possible?  If so, is there some sort of document, or archive
of another list post for this?

Sorry for the ignorance.


--
"Conscious decisions by conscious minds are what make reality real"


Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-09 Thread Briggs

Thanks for the answer. That was helpful.

I was sooo wrong.

On 7/7/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> Please keep this thread going as I am also curious to know why this
> has been 'forked'.   I am sure that most of this lies within the
> original OPIC filter but I still can't understand why straight forward
> lucene queries have not been used within the application.

No, this has actually almost nothing to do with the scoring filters
(which were added much later).

The decision to use a different query syntax than the one from Lucene
was motivated by a few reasons:

* to avoid the need to support low-level index and searcher operations,
which the Lucene API would require us to implement.

* to keep the Nutch core largely independent of Lucene, so that it's
possible to use Nutch with different back-end searcher implementations.
This started to materialize only now, with the ongoing effort to use
Solr as a possible backend.

* to limit the query syntax to those queries that provide best tradeoff
between functionality and performance, in a large-scale search engine.


> On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

>> Ok, so I guess what I don't understand is what is the "Nutch query
>> syntax"?

Query syntax is defined in an informal way on the Help page in
nutch.war, or here:

http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from
org.apache.nutch.analysis.NutchAnalysis.jj.



>>
>> The main discussion I found on nutch-user is this:
>> http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
>> I was wondering why the query syntax is so limited.
>> There are no OR queries, there are no fielded queries,
>> or fuzzy, or approximate... Why? The underlying index
>> supports all these operations.


Actually, it's possible to configure Nutch to allow raw field queries -
you need to add a raw field query plugin for this. Pleae see
RawFieldQueryFilter class, and existing plugins that use fielded
queries: query-site, and query-more. Query-more / DateQueryFilter is
especially interesting, because it shows how to use raw token values
from a parsed query to build complex Lucene queries.


>>
>> I notice by looking at the or.patch file
>> (https://issues.apache.org/jira/secure/attachment/12360659/or.patch)
>> that one of the programs under consideration is:
>> nutch/searcher/Query.java
>> The code for this is distinct from
>> lucene/search/Query.java


See above - they are completely different classes, with completely
different purpose. The use of the same class name is unfortunate and
misleading.

Nutch Query class is intended to express queries entered by search
engine users, in a tokenized and parsed way, so that the rest of Nutch
may deal with Clauses, Terms and Phrases instead of plain String-s.

On the other hand, Lucene Query is intended to express arbitrarily
complex Lucene queries - many of these queries would be prohibitively
expensive for a large search engine (e.g. wildcard queries).


>>
>> It looks like this is an architecture issue that I don't understand.
>> If nutch is an "extension" of lucene, why does it define a different
>> Query class?

Nutch is NOT an extension of Lucene. It's an application that uses
Lucene as a library.


>>  Why don't we just use the Lucene code to query the
>> indexes?  Does this have something to do with the nutch webapp
>> (nutch.war)?  What is the historical genesis of this issue (or is that
>> even relevant)?

Nutch webapp doesn't have anything to do with it. The limitations in the
query syntax have different roots (see above).

--
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
"Conscious decisions by conscious minds are what make reality real"


Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-07 Thread Briggs

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.



On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

I've been reading up on NUTCH-479 "Support for OR queries" but I must be 
missing something obvious because I don't understand what the JIRA is about:

https://issues.apache.org/jira/browse/NUTCH-479

   Description:
   There have been many requests from users to extend Nutch query syntax

   to add support for OR queries,
   in addition to the implicit AND and NOT
queries supported now.

Ok, so I guess what I don't understand is what is the "Nutch query syntax"?

The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.

I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

It looks like this is an architecture issue that I don't understand.  If nutch is an 
"extension" of lucene, why does it define a different Query class?  Why don't 
we just use the Lucene code to query the indexes?  Does this have something to do with 
the nutch webapp (nutch.war)?  What is the historical genesis of this issue (or is that 
even relevant)?







We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265



--
"Conscious decisions by conscious minds are what make reality real"


Re: Reload index

2007-06-20 Thread Briggs

Strange... Here is the quoted, unedited, partially incorrect post... ;-)


"I would say that the best thing to do is to create a new nutch bean.

I never cared much for the nutch bean containing logic to store itself
in a servlet context.  I do not believe that this is the place for
such logic.  It should be up to the user to place the nutch bean into
the servlet context and not the bean.  My implementation of a "nutch
bean" has no knowledge of a servlet context and I believe this
dependency should be removed.  Why should nutch care about such
details?

Anyway, enough with my tiny rant.

You could just create a 'reload.jsp' (or any servlet, or whatever you
want that can get ahold of the servlet context) and do the work...

The current way nutch finds an instance of the search bean is within
the static method
get(ServletContext, Configuration) within the NutchBean class.

So, in your java class, jsp or whatever, just replace the instance
with something like:

servletContext.setAttribute("nutchBean", new NutchBean(yourConfiguration));

Hope that gets you on your way.

You could always edit, or subclass the nutch bean with a
'reload/reinit' method too that could just do the same thing."

On 6/20/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:

Thanks, Briggs.

I will try to create a new NutchBean to se if that solves reloading
issue.

By the way. Your former mail do not seem to have reached the
mailinglist. I can't seem to find it anyway.

-Ronny

-Opprinnelig melding-
Fra: Briggs [mailto:[EMAIL PROTECTED]
Sendt: 20. juni 2007 01:22
Til: nutch-user@lucene.apache.org
Emne: Re: Reload index

By the way, I was wrong about one thing, you can't override the 'get'
method of nutch bean because it's static. Doh, that was a silly
oversight.

But again, if you are using nutch and you need to 'reload' the index,
you need only to create a new NutchBean (that is if the NutchBean is
what you are using).

On 6/19/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> This will reload the application, is'nt this correct? This is
> something I do not want as spesified below.
>
> Is it possible to maybe manupulate the IndexReader part of the nutch
> web client to read whenever i tell it to, or something like that?
>
> Or do I have to write my own client bottom up?
>
> Regards,
> Ronny
>
> -Opprinnelig melding-
> Fra: Susam Pal [mailto:[EMAIL PROTECTED]
> Sendt: 18. juni 2007 17:33
> Til: nutch-user@lucene.apache.org
> Emne: Re: Reload index
>
> touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml
>
> $CATALINA_HOME is the top level directory of Tomcat. It works for most

> cases.
>
> Regards,
> Susam Pal
> http://susam.in/
>
> On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> >
> > Is there a way to reload index without restarting application server

> > or reloading application?
> >
> > I have integrated Nutch into our app but we can not restart or
> > reload the app everytime we have created a new index.
> >
> >
> > Regards,
> > Ronny
> >
>
>
>
>


--
"Conscious decisions by conscious minds are what make reality real"

!DSPAM:46786552232131573131950!





--
"Conscious decisions by conscious minds are what make reality real"


Re: Reload index

2007-06-19 Thread Briggs

By the way, I was wrong about one thing, you can't override the 'get'
method of nutch bean because it's static. Doh, that was a silly
oversight.

But again, if you are using nutch and you need to 'reload' the index,
you need only to create a new NutchBean (that is if the NutchBean is
what you are using).

On 6/19/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:

This will reload the application, is'nt this correct? This is something
I do not want as spesified below.

Is it possible to maybe manupulate the IndexReader part of the nutch web
client to read whenever i tell it to, or something like that?

Or do I have to write my own client bottom up?

Regards,
Ronny

-Opprinnelig melding-
Fra: Susam Pal [mailto:[EMAIL PROTECTED]
Sendt: 18. juni 2007 17:33
Til: nutch-user@lucene.apache.org
Emne: Re: Reload index

touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml

$CATALINA_HOME is the top level directory of Tomcat. It works for most
cases.

Regards,
Susam Pal
http://susam.in/

On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
>
> Is there a way to reload index without restarting application server
> or reloading application?
>
> I have integrated Nutch into our app but we can not restart or reload
> the app everytime we have created a new index.
>
>
> Regards,
> Ronny
>

!DSPAM:4676a5d0153227818312239!





--
"Conscious decisions by conscious minds are what make reality real"


Re: Reload index

2007-06-18 Thread Briggs

I would say that the best thing to do is to create a new nutch bean.

I never cared much for the nutch bean containing logic to store itself
in a servlet context.  I do not believe that this is the place for
such logic.  It should be up to the user to place the nutch bean into
the servlet context and not the bean.  My implementation of a "nutch
bean" has no knowledge of a servlet context and I believe this
dependency should be removed.  Why should nutch care about such
details?

Anyway, enough with my tiny rant.

You could just create a 'reload.jsp' (or any servlet, or whatever you
want that can get ahold of the servlet context) and do the work...

The current way nutch finds an instance of the search bean is within
the static method
get(ServletContext, Configuration) within the NutchBean class.

So, in your java class, jsp or whatever, just replace the instance
with something like:

servletContext.setAttribute("nutchBean", new NutchBean(yourConfiguration));

Hope that gets you on your way.

You could always edit, or subclass the nutch bean with a
'reload/reinit' method too that could just do the same thing.

On 6/18/07, Susam Pal <[EMAIL PROTECTED]> wrote:

touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml

$CATALINA_HOME is the top level directory of Tomcat. It works for most cases.

Regards,

Susam Pal
http://susam.in/

On 6/18/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
>
> Is there a way to reload index without restarting application server or
> reloading application?
>
> I have integrated Nutch into our app but we can not restart or reload
> the app everytime we have created a new index.
>
>
> Regards,
> Ronny
>




--
"Conscious decisions by conscious minds are what make reality real"


Re: fetch failing while crawling

2007-06-15 Thread Briggs

Oh and as for the web interface, take a look at the wiki page:

http://wiki.apache.org/nutch/NutchTutorial

The bottom of the page has a section on searching.

On 6/15/07, Briggs <[EMAIL PROTECTED]> wrote:

Yeah, you still don't have the agent configured.  All your values for
the agent (the  needs a value) are blank.  So, you need
to at least confugure an agent name.



On 6/15/07, karan thakral <[EMAIL PROTECTED]> wrote:
> i m using crawl on the cygwin while working on windows
>
> but the crawl output is not proper
>
> during fetch its says fetch: the document could not be fetched java runtime
> exception  agent not configured
>
> my nutch-site.xml is  as follows
>
> 
> 
>
> 
>
> 
> 
>   http.agent.name
>   
>   HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>   http.robots.agents
>   http.agent.description
>   http.agent.url
>   http.agent.email
>   http.agent.version
>
>   and set their values appropriately.
>
>   
> 
>
> 
>   http.agent.description
>   
>   Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   
> 
>
> 
>   http.agent.url
>   
>   A URL to advertise in the User-Agent header.  This will
>appear in parenthesis after the agent name. Custom dictates that this
>should be a URL of a page explaining the purpose and behavior of this
>crawler.
>   
> 
>
> 
>   http.agent.email
>   
>   An email address to advertise in the HTTP 'From' request
>header and User-Agent header. A good practice is to mangle this
>address (e.g. 'info at example dot com') to avoid spamming.
>   
> 
> 
>
>   but still thrs error
>
> also please throw some light on the searching of info through the web
> interface after the crawl is made successful
> --
> With Regards
> Karan Thakral
>


--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by conscious minds are what make reality real"


Re: fetch failing while crawling

2007-06-15 Thread Briggs

Yeah, you still don't have the agent configured.  All your values for
the agent (the  needs a value) are blank.  So, you need
to at least confugure an agent name.



On 6/15/07, karan thakral <[EMAIL PROTECTED]> wrote:

i m using crawl on the cygwin while working on windows

but the crawl output is not proper

during fetch its says fetch: the document could not be fetched java runtime
exception  agent not configured

my nutch-site.xml is  as follows








  http.agent.name
  
  HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  



  http.agent.description
  
  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  



  http.agent.url
  
  A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  



  http.agent.email
  
  An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  



  but still thrs error

also please throw some light on the searching of info through the web
interface after the crawl is made successful
--
With Regards
Karan Thakral




--
"Conscious decisions by conscious minds are what make reality real"


Re: Explanation of topN

2007-06-08 Thread Briggs

Well, the quick/simple exlanation is:

If you have 5 urls with their associate nutch score:

http://a.com/something1 = 5.0
http://b.com/something2 = 4.0
http://c.com/something3 = 3.0
http://d.com/something4 = 2.0
http://e.com/something5 = 1.0

Then you set nutch to crawl with topN = 3 then a,b,c will be fetched
and d and e will not.  It just means "give me the 3 best ranking URLs"
from the current crawl database.

On 6/8/07, monkeynuts84 <[EMAIL PROTECTED]> wrote:


Can someone give me an explanation of what topN does? I've seen various
pieces of info but some of them seem to be conflicting. I've noticed in my
crawls that certain sites are crawled more than other in each iteration of a
fetch. Is this caused by topN?

Thanks.
--
View this message in context: 
http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11033441
Sent from the Nutch - User mailing list archive at Nabble.com.





--
"Conscious decisions by conscious minds are what make reality real"


Re: indexing only special documents

2007-06-07 Thread Briggs

Ronny, your way is probably better.  See, I was only dealing with the
fetched properties.  But, in your case, you don't fetch it, which gets rid
of all that wasted bandwidth.

For dealing with types that can be dealt with via the file extension, this
would probably work better.


On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:



Hi.

Configure crawl-urlfilter.txt
Thus you want to add something like +\.pdf$ I guess another way would be
to exclude all others

Try expanding the line below with html, doc, xls, ppt, etc
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$

Or try including
+\.pdf$
#
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
Followd by
-.

Have'nt tried it myself, but experiment some and I guess you figure it
out pretty soon.

Regards,
Ronny

-Opprinnelig melding-
Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED]

Sendt: 6. juni 2007 20:30
Til: nutch-user@lucene.apache.org
Emne: indexing only special documents



hi!

I have a question. If I have for example the seed urls and do a crawl
based o that seeds. If I want to index then only pages that contain for
example pdf documents, how can I do that?

cheers
martin



!DSPAM:4666ff05259891293215062!





--
"Conscious decisions by conscious minds are what make reality real"


Re: indexing only special documents

2007-06-06 Thread Briggs

All the plugins are in the nutch source distribution and are found in:
/src/plugins

There is nothing that really provides near real-time statistics other
than the logging.  I am planning on writing a few analysis plugins,
perhaps just using aspects, to allow a jmx client monitor the process
(and trying to not be too invasive to affect performance).   I haven't
done it yet, but I don't see plugin creation "too difficult" (if you
are comfortable with parsing).

There are some processes that you could run that can dump metadata and
other useful info for looking into your segments and url databases.
just run:

/bin/nutch

It will show you the options to run for reading the data.  You can
find out how many urls were successfully fetched, how many failed and
total number of urls etc.  Look at the nutch 0.8 wiki entry
http://wiki.apache.org/nutch/08CommandLineOptions .  It just shows the
shell output for the nutch options to run.  It will give you and idea
of what is available.

For finding how many documents were fetched of specific types you
would be better off just using the search bean and basically, using
lucene to find out those things.  Otherwise you would have to  write
your own implementation to read the data.

I am learning more about nutch everyday so, I can't claim everything I
have said is 100% correct.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:

Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this
out right tomorrow..bit late now here.

Another 2 additonal questions:

1.Those "parse" plugins where do I find them in the nutch source code? Is it
possible and easy going to write a own parser plugin...cause I think I'm gonna
need some additional non standard parser plugin(s).

2. When I do a crawl. Is it possible that I can activate or see some statistics
in nutch for that. I mean that at the end of indexing process it shows me how
many urls nutch had parsed and how much of them contained i.e. pdfs and
additionally how long the crawling and indexing process tooked and so on?

thx for support
martin



Zitat von Briggs <[EMAIL PROTECTED]>:

> You set that up in your nutch-site.xml file. Open the
> nutch-default.xml file (located in the /conf. Look
> for this element:
>
> 
>   plugin.includes
>
protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>   Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   
> 
>
>
> You'll notice the "parse" plugins that uses the regex
> "parse-(text|html|pdf|msword|rss)".  You remove/add the available
> parsers here. So, if you only wanted pdfs, you only use the pdf
> parser, "parse-(pdf)" or just "parse-pdf".
>
> Don't edit the nutch-default file. Create a new nutch-site.xml file
> for your cusomizations.  So, basically copy the nutch-default.xml
> file, remove everything you don't need to override, and there ya go.
>
> I believe that is the correct way.
>
>
> On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]>
> wrote:
> >
> >
> > hi!
> >
> > I have a question. If I have for example the seed urls and do a crawl based
> o
> > that seeds. If I want to index then only pages that contain for example pdf
> > documents, how can I do that?
> >
> > cheers
> > martin
> >
> >
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>







--
"Conscious decisions by conscious minds are what make reality real"


Re: indexing only special documents

2007-06-06 Thread Briggs

You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the /conf. Look
for this element:


 plugin.includes
protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
 Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins. In order to use HTTPS please enable
 protocol-httpclient, but be aware of possible intermittent problems with the
 underlying commons-httpclient library.
 



You'll notice the "parse" plugins that uses the regex
"parse-(text|html|pdf|msword|rss)".  You remove/add the available
parsers here. So, if you only wanted pdfs, you only use the pdf
parser, "parse-(pdf)" or just "parse-pdf".

Don't edit the nutch-default file. Create a new nutch-site.xml file
for your cusomizations.  So, basically copy the nutch-default.xml
file, remove everything you don't need to override, and there ya go.

I believe that is the correct way.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:



hi!

I have a question. If I have for example the seed urls and do a crawl based o
that seeds. If I want to index then only pages that contain for example pdf
documents, how can I do that?

cheers
martin






--
"Conscious decisions by conscious minds are what make reality real"


Re: urls/nutch in local is invalid

2007-06-06 Thread Briggs

I haven't heard of an IRC channel for it, but that would be cool.


On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]>
wrote:


I see now whats causing the error. /urls/nutch is a file...but you have to
give
as input only the urls folder not the file as i did ;)

ps: is there an irc channel for nutch or 'only' mailing list?

thx
martin

Zitat von Briggs <[EMAIL PROTECTED]>:

> is urls/nutch a file or directory?
>
> On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]>
> wrote:
> > Hi
> >
> > I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
> > Unfortunately I get the following error:
> >
> > [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir 
crawl.test-depth 10
> > crawl started in: crawl.test
> > rootUrlDir = urls/nutch
> > threads = 10
> > depth = 10
> > Injector: starting
> > Injector: crawlDb: crawl.test/crawldb
> > Injector: urlDir: urls/nutch
> > Injector: Converting injected urls to crawl db entries.
> > Exception in thread "main" java.io.IOException: Input directory
> > /scratch/nutch-0.8.1/urls/nutch in local is invalid.
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
:274)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
:327)
> > at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
> >
> > Any ideas what is causing that?
> >
> > regards
> > martin
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>







--
"Conscious decisions by conscious minds are what make reality real"


Re: urls/nutch in local is invalid

2007-06-06 Thread Briggs

is urls/nutch a file or directory?

On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote:

Hi

I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
Unfortunately I get the following error:

[EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test 
-depth 10
crawl started in: crawl.test
rootUrlDir = urls/nutch
threads = 10
depth = 10
Injector: starting
Injector: crawlDb: crawl.test/crawldb
Injector: urlDir: urls/nutch
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Input directory
/scratch/nutch-0.8.1/urls/nutch in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

Any ideas what is causing that?

regards
martin




--
"Conscious decisions by conscious minds are what make reality real"


Re: Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

This is all I did (and from what I have read, double checked locking is
works correctly in jdk 5)

private static volatile IndexingFilters INSTANCE;

public static IndexingFilters getInstance(final Configuration configuration)
{
 if(INSTANCE == null) {
   synchronized(IndexingFilters.class) {
 if(INSTANCE == null) {
   INSTANCE = new IndexingFilters(configuration);
 }
   }
 }
 return INSTANCE;
}

So, I just updated all the code that calls "new IndexingFilters(..)" to call
IndexingFilters.getInstance(...).  This works for me, perhaps not everyone.
I think that the filter interface should be refitted to allow the
configuration instance to be passed along the filters too, or allow a way
for the thread to obtain it's current configuration, rather than
instantiating these things over and over again.  If a filter is designed to
be thread-safe, there is no need for all this unnecessary object creation.


On 6/6/07, Briggs <[EMAIL PROTECTED]> wrote:


FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney <[EMAIL PROTECTED] > wrote:
>
> Hi,
>
> It seems that plugin-loading code is somehow broken. There is some
> discussion going on about this on
> http://www.nabble.com/forum/ViewPost.jtp?post=10844164&framed=y .
>
> On 6/5/07, Enzo Michelangeli < [EMAIL PROTECTED]> wrote:
> > I have a question about the loading mechanism of plugin classes. I'm
> working
> > with a custom URLFilter, and I need a singleton object loaded and
> > initialized by the first instance of the URLFilter, and shared by
> other
> > instances (e.g., instantiated by other threads). I was assuming that
> the
> > URLFilter class was being loaded only once even when the filter is
> used by
> > multiple threads, so I tried to use a static member variable of my
> URLFilter
> > class to hold a reference to the object to be shared: but it appears
> that
> > the supposed singleton, actually, isn't, because the method
> responsible for
> > its instantiation finds the static field initialized to null. So: are
> > URLFilter classes loaded multiple times by their classloader in Nutch?
> The
> > wiki page at
> >
> 
http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
> > seems to suggest otherwise:
> >
> > Until Nutch runtime, only one instance of such a plugin
> > class is alive in the Java virtual machine.
> >
> > (By the way, what does "Until Nutch runtime" mean here? Before Nutch
> > runtime, no class whatsoever is supposed to be alive in the JVM, is
> it?)
> >
> > Enzo
> >
> >
>
> --
> Doğacan Güney
>



--
"Conscious decisions by conscious minds are what make reality real"





--
"Conscious decisions by conscious minds are what make reality real"


Re: Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


Hi,

It seems that plugin-loading code is somehow broken. There is some
discussion going on about this on
http://www.nabble.com/forum/ViewPost.jtp?post=10844164&framed=y .

On 6/5/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
> I have a question about the loading mechanism of plugin classes. I'm
working
> with a custom URLFilter, and I need a singleton object loaded and
> initialized by the first instance of the URLFilter, and shared by other
> instances (e.g., instantiated by other threads). I was assuming that the
> URLFilter class was being loaded only once even when the filter is used
by
> multiple threads, so I tried to use a static member variable of my
URLFilter
> class to hold a reference to the object to be shared: but it appears
that
> the supposed singleton, actually, isn't, because the method responsible
for
> its instantiation finds the static field initialized to null. So: are
> URLFilter classes loaded multiple times by their classloader in Nutch?
The
> wiki page at
>
http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
> seems to suggest otherwise:
>
> Until Nutch runtime, only one instance of such a plugin
> class is alive in the Java virtual machine.
>
> (By the way, what does "Until Nutch runtime" mean here? Before Nutch
> runtime, no class whatsoever is supposed to be alive in the JVM, is it?)
>
> Enzo
>
>

--
Doğacan Güney





--
"Conscious decisions by conscious minds are what make reality real"


Re: Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

Doh!  Again, I missed that.  Thanks... Just wish it had a better
explanation.


On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


Hi,

On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> Here is another example that keeps saying it can't parse it...
>
> SegmentReader: get '
> http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir'
> Content::
> Version: 2
> url:
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
> base:
> http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
> contentType:
> metadata: nutch.segment.name=20070601050840
nutch.crawl.score=3.5455807E-5
> Content:
>
> These are the headers:
>
> HTTP/1.1 200 OK
> Date: Fri, 01 Jun 2007 15:38:15 GMT
> Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> Window-Target: _top
> X-Highwire-SessionId: nh2ukcdpv1.JS1
> Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
> Transfer-Encoding: chunked
> Content-Type: text/html
>
>
>
> So, that's it. any ideas?

In both examples nutch wasn't able to fetch the page. When a url can't
be fetched, fetcher creates an empty content for it. That's why you
can't parse them, there is nothing to parse:).

You can't fetch http://hea.sagepub.com/cgi/alerts and
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
because both hosts have robots.txt files that disallow access to your
urls.

>
>
>
> On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> >
> >
> > So, here is one:
> >
> > http://hea.sagepub.com/cgi/alerts
> >
> > Segment Reader reports:
> >
> > Content::
> > Version: 2
> > url: http://hea.sagepub.com/cgi/alerts
> > base: http://hea.sagepub.com/cgi/alerts
> > contentType:
> > metadata: nutch.segment.name=20070601045920
nutch.crawl.score=0.04168
> > Content:
> >
> > So, I notice when I try to crawl that url specifically, I get a job
failed
> > (array index out of bounds -1 exception).
> >
> > But if I use curl like:
> >
> > curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt
> >
> > I get content and the headers are:
> >
> > HTTP/1.1 200 OK
> > Date: Fri, 01 Jun 2007 15:03:28 GMT
> > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > Cache-Control: no-store
> > X-Highwire-SessionId: xlz2cgcww1.JS1
> > Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
> > Transfer-Encoding: chunked
> > Content-Type: text/html
> >
> > So, I'm lost.
> >
> >
> > On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> > > > So, I have been having huge problems with parsing.  It seems that
many
> > > > urls are being ignored because the parser plugins throw and
exception
> > > > saying there is no parser found for, what is reportedly, and
> > > > unresolved contentType.  So, if you look at the exception:
> > > >
> > > >   org.apache.nutch.parse.ParseException: parser not found for
> > > > contentType= url=
> > > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
> > > >
> > > > You can see that it says the contentType is "".  But, if you look
at
> > > > the headers for this request you can see that the Content-Type
header
> > > > is set at "text/html":
> > > >
> > > > HTTP/1.1 200 OK
> > > > Date: Fri, 01 Jun 2007 13:54:19 GMT
> > > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > > > Cache-Control: no-store
> > > > X-Highwire-SessionId: y1851mbb91.JS1
> > > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> > > > Transfer-Encoding: chunked
> > > > Content-Type: text/html
> > > >
> > > > Is there something that I have set up wrong?  This happens on a
LOT of
> > >
> > > > pages/sites.  My current plugins are set at:
> > > >
> > > >
> > >
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
> > >
> > > >
> > > >
> > > > Here is another URL:
> > > >
> > > > http://www.bionews.org.uk/
> > > >
> > > >
> > > > Same issue with parsing (parrser not found for contentType=
> > > &g

Re: Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

Here is another example that keeps saying it can't parse it...

SegmentReader: get '
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir'
Content::
Version: 2
url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
base:
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
contentType:
metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5
Content:

These are the headers:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:38:15 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Window-Target: _top
X-Highwire-SessionId: nh2ukcdpv1.JS1
Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html



So, that's it. any ideas?



On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:



So, here is one:

http://hea.sagepub.com/cgi/alerts

Segment Reader reports:

Content::
Version: 2
url: http://hea.sagepub.com/cgi/alerts
base: http://hea.sagepub.com/cgi/alerts
contentType:
metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168
Content:

So, I notice when I try to crawl that url specifically, I get a job failed
(array index out of bounds -1 exception).

But if I use curl like:

curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt

I get content and the headers are:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:03:28 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

So, I'm lost.


On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> > So, I have been having huge problems with parsing.  It seems that many
> > urls are being ignored because the parser plugins throw and exception
> > saying there is no parser found for, what is reportedly, and
> > unresolved contentType.  So, if you look at the exception:
> >
> >   org.apache.nutch.parse.ParseException: parser not found for
> > contentType= url=
> http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
> >
> > You can see that it says the contentType is "".  But, if you look at
> > the headers for this request you can see that the Content-Type header
> > is set at "text/html":
> >
> > HTTP/1.1 200 OK
> > Date: Fri, 01 Jun 2007 13:54:19 GMT
> > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > Cache-Control: no-store
> > X-Highwire-SessionId: y1851mbb91.JS1
> > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> > Transfer-Encoding: chunked
> > Content-Type: text/html
> >
> > Is there something that I have set up wrong?  This happens on a LOT of
>
> > pages/sites.  My current plugins are set at:
> >
> >
> 
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> >
> >
> > Here is another URL:
> >
> > http://www.bionews.org.uk/
> >
> >
> > Same issue with parsing (parrser not found for contentType=
> > url= http://www.bionews.org.uk/), but the header says:
> >
> > HTTP/1.0 200 OK
> > Server: Lasso/3.6.5 ID/ACGI
> > MIME-Version: 1.0
> > Content-type: text/html
> > Content-length: 69417
> >
> >
> > Any clues?  Does nutch look at the headers or not?
>
> Can you do a
> bin/nutch readseg -get   -noparse -noparsetext
> -noparsedata -nofetch -nogenerate
>
> And send the result? This should show use what nutch fetched as content.
>
>
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
> --
> Doğacan Güney
>



--
"Conscious decisions by conscious minds are what make reality real"





--
"Conscious decisions by conscious minds are what make reality real"


Re: Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

So, here is one:

http://hea.sagepub.com/cgi/alerts

Segment Reader reports:

Content::
Version: 2
url: http://hea.sagepub.com/cgi/alerts
base: http://hea.sagepub.com/cgi/alerts
contentType:
metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168
Content:

So, I notice when I try to crawl that url specifically, I get a job failed
(array index out of bounds -1 exception).

But if I use curl like:

curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt

I get content and the headers are:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:03:28 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

So, I'm lost.


On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


Hi,

On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> So, I have been having huge problems with parsing.  It seems that many
> urls are being ignored because the parser plugins throw and exception
> saying there is no parser found for, what is reportedly, and
> unresolved contentType.  So, if you look at the exception:
>
>   org.apache.nutch.parse.ParseException: parser not found for
> contentType= url=
http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
>
> You can see that it says the contentType is "".  But, if you look at
> the headers for this request you can see that the Content-Type header
> is set at "text/html":
>
> HTTP/1.1 200 OK
> Date: Fri, 01 Jun 2007 13:54:19 GMT
> Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> Cache-Control: no-store
> X-Highwire-SessionId: y1851mbb91.JS1
> Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> Transfer-Encoding: chunked
> Content-Type: text/html
>
> Is there something that I have set up wrong?  This happens on a LOT of
> pages/sites.  My current plugins are set at:
>
>
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
>
> Here is another URL:
>
> http://www.bionews.org.uk/
>
>
> Same issue with parsing (parrser not found for contentType=
> url=http://www.bionews.org.uk/), but the header says:
>
> HTTP/1.0 200 OK
> Server: Lasso/3.6.5 ID/ACGI
> MIME-Version: 1.0
> Content-type: text/html
> Content-length: 69417
>
>
> Any clues?  Does nutch look at the headers or not?

Can you do a
bin/nutch readseg -get   -noparse -noparsetext
-noparsedata -nofetch -nogenerate

And send the result? This should show use what nutch fetched as content.

>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


--
Doğacan Güney





--
"Conscious decisions by conscious minds are what make reality real"


Re: Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

Looking into the first URL.. Don't look at the second, I screwed up on
that.  It's a Disallow bad example... But working on finding the segment
for the first thanks for your quick response, I'll be getting right back
to you.


<http://www.bionews.org.uk/>
On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:


Hi,

On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> So, I have been having huge problems with parsing.  It seems that many
> urls are being ignored because the parser plugins throw and exception
> saying there is no parser found for, what is reportedly, and
> unresolved contentType.  So, if you look at the exception:
>
>   org.apache.nutch.parse.ParseException: parser not found for
> contentType= url=
http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
>
> You can see that it says the contentType is "".  But, if you look at
> the headers for this request you can see that the Content-Type header
> is set at "text/html":
>
> HTTP/1.1 200 OK
> Date: Fri, 01 Jun 2007 13:54:19 GMT
> Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> Cache-Control: no-store
> X-Highwire-SessionId: y1851mbb91.JS1
> Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> Transfer-Encoding: chunked
> Content-Type: text/html
>
> Is there something that I have set up wrong?  This happens on a LOT of
> pages/sites.  My current plugins are set at:
>
>
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
>
> Here is another URL:
>
> http://www.bionews.org.uk/
>
>
> Same issue with parsing (parrser not found for contentType=
> url=http://www.bionews.org.uk/), but the header says:
>
> HTTP/1.0 200 OK
> Server: Lasso/3.6.5 ID/ACGI
> MIME-Version: 1.0
> Content-type: text/html
> Content-length: 69417
>
>
> Any clues?  Does nutch look at the headers or not?

Can you do a
bin/nutch readseg -get   -noparse -noparsetext
-noparsedata -nofetch -nogenerate

And send the result? This should show use what nutch fetched as content.

>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


--
Doğacan Güney





--
"Conscious decisions by conscious minds are what make reality real"


Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

So, I have been having huge problems with parsing.  It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType.  So, if you look at the exception:

 org.apache.nutch.parse.ParseException: parser not found for
contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl

You can see that it says the contentType is "".  But, if you look at
the headers for this request you can see that the Content-Type header
is set at "text/html":

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 13:54:19 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: y1851mbb91.JS1
Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

Is there something that I have set up wrong?  This happens on a LOT of
pages/sites.  My current plugins are set at:

"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


Here is another URL:

http://www.bionews.org.uk/


Same issue with parsing (parrser not found for contentType=
url=http://www.bionews.org.uk/), but the header says:

HTTP/1.0 200 OK
Server: Lasso/3.6.5 ID/ACGI
MIME-Version: 1.0
Content-type: text/html
Content-length: 69417


Any clues?  Does nutch look at the headers or not?


--
"Conscious decisions by conscious minds are what make reality real"


Speed up indexing....

2007-05-30 Thread Briggs

Anyone have any good configuration ideas for indexing/merging with 0.9
using hadoop on a local fs?  Our segment merging is taking an
extremely long time compared with nutch 0.7.  Currently, I am trying
to merge 300 segments, which amounts to about 1gig of data.  It has
taken hours to merge, and it's still not done. This box has dual zeon
2.8ghz processors with 4 gigs of ram.

So, I figure there must be a better setup in the mapred-default.xml
for a single machine.  Do I increase the file size for I/O buffers,
sort buffers, etc.?  Do I reduce the number of tasks or increase them?
I'm at a loss.

Any advice would be greatly appreciated.


--
"Conscious decisions by conscious minds are what make reality real"


Re: Nutch on Windows. ssh: command not found

2007-05-30 Thread Briggs

so, when in cygwin, if you type 'ssh' (without the quotes, do you get
the same error? If so, then you need to go back into the cygwin setup
and install ssh.


On 5/30/07, Ilya Vishnevsky <[EMAIL PROTECTED]> wrote:

Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
installed cygwin. When I execute bin/start-all.sh, I get following
messages:

localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
command not found

localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
command not found

Could you help me with this problem?




--
"Conscious decisions by conscious minds are what make reality real"


Re: Problem crawling in Nutch 0.9

2007-05-14 Thread Briggs

Just curious, did you happen to limit the number of urls using the
"topN" switch?

On 5/14/07, Annona Keene <[EMAIL PROTECTED]> wrote:

I recently upgraded to 0.9, and I've started encountering a problem. I began 
with a single url and crawled with a depth of 10, assuming I would get every 
page on my site. This same configuration worked for me in 0.8.  However, I 
noticed a particular url that I was especially interested in was not in the 
index. So I added the url explicitly and crawled again. And it still was not in 
the index. So I checked the logs, and it is being fetched. So I tried a lower 
depth, and it worked. With a depth of 6, the url does appear in the index. Any 
ideas on what would be causing this? I'm very confused.

Thanks,
Ann





Pinpoint
 customers who are looking for what you sell.
http://searchmarketing.yahoo.com/



--
"Conscious decisions by conscious minds are what make reality real"


Re: Nutch Indexer

2007-05-01 Thread Briggs

Man, I should proofread this stuff before I send them. That is all I
have to say.

On 5/1/07, Briggs <[EMAIL PROTECTED]> wrote:

I would assume that it need these for handling the indexing of the
link scores.  Lucene puts no scoring weight on things such as urls,
page rank and such. Since lucene only indexes documents, and
calculates its keyword/query relevancy based only on term vectors (or
whatever) nutch needs to add the url scoring and such to the index.



On 5/1/07, hzhong <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> In Indexer.java,  index(Path indexDir, Path crawlDb, Path linkDb, Path[]
> segments), can someone explain to me why crawlDB and linkDB is needed for
> indexing?
>
> In Lucene, there's no crawlDB and linkDB for indexing.
>
> Thank you very much
>
> Hanna
> --
> View this message in context: 
http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by conscious minds are what make reality real"


Re: Nutch Indexer

2007-05-01 Thread Briggs

I would assume that it need these for handling the indexing of the
link scores.  Lucene puts no scoring weight on things such as urls,
page rank and such. Since lucene only indexes documents, and
calculates its keyword/query relevancy based only on term vectors (or
whatever) nutch needs to add the url scoring and such to the index.



On 5/1/07, hzhong <[EMAIL PROTECTED]> wrote:


Hello,

In Indexer.java,  index(Path indexDir, Path crawlDb, Path linkDb, Path[]
segments), can someone explain to me why crawlDB and linkDB is needed for
indexing?

In Lucene, there's no crawlDB and linkDB for indexing.

Thank you very much

Hanna
--
View this message in context: 
http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625
Sent from the Nutch - User mailing list archive at Nabble.com.





--
"Conscious decisions by conscious minds are what make reality real"


Re: Nutch and running crawls within a container.

2007-04-30 Thread Briggs

I'll look around the code to make sure I am creating only one instance
of Configuration in my classes, and will play around with the
maxpermgen settings.

Any other input from people that have attempted this sort of setup
would be appreciated.

On 4/30/07, Briggs <[EMAIL PROTECTED]> wrote:

Well, in nutch 0.7 it was all due to NGramEntry instances held within
hashmaps that never get cleaned up. This code was in the language
plugin, but it has been moved into the nutch codebase.

That wasn't the only problem, but that was a big one.  I though
removing it would solve the problem, but then another creeped up.

On 4/30/07, Sami Siren <[EMAIL PROTECTED]> wrote:
> Briggs wrote:
> > Version:  Nutch 0.9 (but this applies to just about all versions)
> >
> > I'm really in a bind.
> >
> > Is anyone crawling from within a web application, or is everyone
> > running Nutch using the shell scripts provided?  I am trying to write
> > a web application around the Nutch crawling facilities, but it seems
> > that there is are huge memory issues when trying to do this.   The
> > container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
> > on the stack) runs out of memory in less that an hour.  When profiling
> > version 0.7.2 we can see that there is a constant pool of objects that
> > grow, but never get garbage collected.  So, even when the crawl is
> > finished, these objects tend to just hang around forever, until we get
> > the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
> > the application to use Nutch 0.9 and the problem got about 80x worse
>
> Have you analyzed in any level of detail what is causing this memory
> wasting?  Have you tried tweaking jvms XX:MaxPermSize?
>
> I believe that all the classes required by plugins need to be loaded
> multiple times (every time you execute a command where Configuration
> object is created) because of the design of plugin system where every
> plugin has it's own class loader (per configuration).
>
> > So, the current design is/was to have an event happen within the
> > system, which would fire off a crawler (currently just calls
> > org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
> > but grief.  We need to have several crawlers running concurrently. We
>
> You should perhaps use and call the classes directly and take control of
> managing the Configuration object, this way PermGen size is not wasted
> by loading same classes over and over again.
>
> --
>  Sami Siren
>


--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by conscious minds are what make reality real"


Re: Nutch and running crawls within a container.

2007-04-30 Thread Briggs

Well, in nutch 0.7 it was all due to NGramEntry instances held within
hashmaps that never get cleaned up. This code was in the language
plugin, but it has been moved into the nutch codebase.

That wasn't the only problem, but that was a big one.  I though
removing it would solve the problem, but then another creeped up.

On 4/30/07, Sami Siren <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> Version:  Nutch 0.9 (but this applies to just about all versions)
>
> I'm really in a bind.
>
> Is anyone crawling from within a web application, or is everyone
> running Nutch using the shell scripts provided?  I am trying to write
> a web application around the Nutch crawling facilities, but it seems
> that there is are huge memory issues when trying to do this.   The
> container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
> on the stack) runs out of memory in less that an hour.  When profiling
> version 0.7.2 we can see that there is a constant pool of objects that
> grow, but never get garbage collected.  So, even when the crawl is
> finished, these objects tend to just hang around forever, until we get
> the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
> the application to use Nutch 0.9 and the problem got about 80x worse

Have you analyzed in any level of detail what is causing this memory
wasting?  Have you tried tweaking jvms XX:MaxPermSize?

I believe that all the classes required by plugins need to be loaded
multiple times (every time you execute a command where Configuration
object is created) because of the design of plugin system where every
plugin has it's own class loader (per configuration).

> So, the current design is/was to have an event happen within the
> system, which would fire off a crawler (currently just calls
> org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
> but grief.  We need to have several crawlers running concurrently. We

You should perhaps use and call the classes directly and take control of
managing the Configuration object, this way PermGen size is not wasted
by loading same classes over and over again.

--
 Sami Siren




--
"Conscious decisions by conscious minds are what make reality real"


Nutch and running crawls within a container.

2007-04-30 Thread Briggs

Version:  Nutch 0.9 (but this applies to just about all versions)

I'm really in a bind.

Is anyone crawling from within a web application, or is everyone
running Nutch using the shell scripts provided?  I am trying to write
a web application around the Nutch crawling facilities, but it seems
that there is are huge memory issues when trying to do this.   The
container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
on the stack) runs out of memory in less that an hour.  When profiling
version 0.7.2 we can see that there is a constant pool of objects that
grow, but never get garbage collected.  So, even when the crawl is
finished, these objects tend to just hang around forever, until we get
the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
the application to use Nutch 0.9 and the problem got about 80x worse
(it use to run for about 16 hours, now it runs out of memory in 20
minutes).  We were using 5 concurrent crawlers, meaning we have
Crawl.man running 5 times within the application.

So, the current design is/was to have an event happen within the
system, which would fire off a crawler (currently just calls
org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
but grief.  We need to have several crawlers running concurrently. We
didn't want large 'batch' jobs.  The requirement is to crawl a domain
as it comes into the system and not wait for days or hours to run the
job.

Has anyone else attempted to run the crawl in this manner?  Have you
run into the same problems?  Does controlling the fetcher and all the
other instances needed for crawling solve this issue?  There is
nothing in the org.apache.nutch.crawl.Crawl instance, from what I had
seen in the past, that would cause such a memory leak.  This must be
way down somewhere else in the code.

Since Nutch handles so much of its threading, could this be causing the problem?

I am not sure if I should x-post this to the dev group or not.

Anyway, thanks.

Briggs



--
"Conscious decisions by conscious minds are what make reality real"


Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs

Well, it looks like the link I sent you goes to the 0.9 version of the
nutch api.  There is a link error on the nutch project site because
the 0.7.2 doc link points to the 0.9 docs.



On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:

Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html

You would then need to create a filter of 'pruned' urls to ignore if
they are discovered again.  This list can get quite large, but I
really don't know how else to do it.  It would be cool if we could
hack the crawldb (or webdb I believe in your version) to include a
flag of 'good/bad' or something.


On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:
> Isn't this what you are looking for?
>
> org.apache.nutch.tools.PruneIndexTool.
>
>
>
> On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
> >
> > hi Enis,
> > This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> > indexing for my search engine...
> > i read from ur message that u can delete a particular index directly?if so
> > how its possible..i m desperately searching for a clue to do this one...
> > my requirement is to delete the porn site's index from my crawled data...
> > ur help is highly needed
> >
> > expecting u to help me in this regards ..
> >
> > Thanks in advance..
> > Franklin.S
> >
> >
> > ogjunk-nutch wrote:
> > >
> > > Hi Enis,
> > >
> > > Right, I can easily delete the page from the Lucene index, though I'd
> > > prefer to follow the Nutch protocol and avoid messing something up by
> > > touching the index directly.  However, I don't want that page to re-appear
> > > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > > will remain missing, but it would be great to be able to tell Nutch to
> > > "forget it" "from everywhere".  Is that doable?
> > > I could read and re-write the *Db Maps, but that's a lot of IO... just to
> > > get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> > > flags a given page as "forget this page as soon as possible" and it just
> > > happens later on.
> > >
> > > Thanks,
> > > Otis
> > >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > >
> > > - Original Message 
> > > From: Enis Soztutar <[EMAIL PROTECTED]>
> > > To: nutch-user@lucene.apache.org
> > > Sent: Thursday, April 5, 2007 3:29:55 AM
> > > Subject: Re: [Nutch-general] Removing pages from index immediately
> > >
> > > Since hadoop's map files are write once, it is not possible to delete
> > > some urls from the crawldb and linkdb. The only thing you can do is to
> > > create the map files once again without the deleted urls. But running
> > > the crawl once more as you suggested seems more appropriate. Deleting
> > > documents from the index is just lucene stuff.
> > >
> > > In your case it seems that every once in a while, you crawl the whole
> > > site, and create the indexes and db's and then just throw the old one
> > > out. And between two crawls you can delete the urls from the index.
> > >
> > > [EMAIL PROTECTED] wrote:
> > >> Hi,
> > >>
> > >> I'd like to be able to immediately remove certain pages from Nutch
> > >> (index, crawldb, linkdb...).
> > >> The scenario is that I'm using Nutch to index a single site or a set of
> > >> internal sites.  Once in a while editors of the site remove a page from
> > >> the site.  When that happens, I want to update at least the index and
> > >> ideally crawldb, linkdb, so that people searching the index don't get the
> > >> missing page in results and end up going there, hitting the 404.
> > >>
> > >> I don't think there is a "direct" way to do this with Nutch, is there?
> > >> If there really is no direct way to do this, I was thinking I'd just put
> > >> the URL of the recently removed page into the first next fetchlist and
> > >> then somehow get Nutch to immediately remove that page/URL once it hits a
> > >> 404.  How does that sound?
> > >>
> > >> Is there a way to configure Nutch to delete the page after it gets a 404
> > >> for it even just once?  I thought I saw the setting for that somewhere

Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs

Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html

You would then need to create a filter of 'pruned' urls to ignore if
they are discovered again.  This list can get quite large, but I
really don't know how else to do it.  It would be cool if we could
hack the crawldb (or webdb I believe in your version) to include a
flag of 'good/bad' or something.


On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:

Isn't this what you are looking for?

org.apache.nutch.tools.PruneIndexTool.



On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
>
> hi Enis,
> This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> indexing for my search engine...
> i read from ur message that u can delete a particular index directly?if so
> how its possible..i m desperately searching for a clue to do this one...
> my requirement is to delete the porn site's index from my crawled data...
> ur help is highly needed
>
> expecting u to help me in this regards ..
>
> Thanks in advance..
> Franklin.S
>
>
> ogjunk-nutch wrote:
> >
> > Hi Enis,
> >
> > Right, I can easily delete the page from the Lucene index, though I'd
> > prefer to follow the Nutch protocol and avoid messing something up by
> > touching the index directly.  However, I don't want that page to re-appear
> > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > will remain missing, but it would be great to be able to tell Nutch to
> > "forget it" "from everywhere".  Is that doable?
> > I could read and re-write the *Db Maps, but that's a lot of IO... just to
> > get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> > flags a given page as "forget this page as soon as possible" and it just
> > happens later on.
> >
> > Thanks,
> > Otis
> >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >
> > - Original Message 
> > From: Enis Soztutar <[EMAIL PROTECTED]>
> > To: nutch-user@lucene.apache.org
> > Sent: Thursday, April 5, 2007 3:29:55 AM
> > Subject: Re: [Nutch-general] Removing pages from index immediately
> >
> > Since hadoop's map files are write once, it is not possible to delete
> > some urls from the crawldb and linkdb. The only thing you can do is to
> > create the map files once again without the deleted urls. But running
> > the crawl once more as you suggested seems more appropriate. Deleting
> > documents from the index is just lucene stuff.
> >
> > In your case it seems that every once in a while, you crawl the whole
> > site, and create the indexes and db's and then just throw the old one
> > out. And between two crawls you can delete the urls from the index.
> >
> > [EMAIL PROTECTED] wrote:
> >> Hi,
> >>
> >> I'd like to be able to immediately remove certain pages from Nutch
> >> (index, crawldb, linkdb...).
> >> The scenario is that I'm using Nutch to index a single site or a set of
> >> internal sites.  Once in a while editors of the site remove a page from
> >> the site.  When that happens, I want to update at least the index and
> >> ideally crawldb, linkdb, so that people searching the index don't get the
> >> missing page in results and end up going there, hitting the 404.
> >>
> >> I don't think there is a "direct" way to do this with Nutch, is there?
> >> If there really is no direct way to do this, I was thinking I'd just put
> >> the URL of the recently removed page into the first next fetchlist and
> >> then somehow get Nutch to immediately remove that page/URL once it hits a
> >> 404.  How does that sound?
> >>
> >> Is there a way to configure Nutch to delete the page after it gets a 404
> >> for it even just once?  I thought I saw the setting for that somewhere a
> >> few weeks ago, but now I can't find it.
> >>
> >> Thanks,
> >> Otis
> >>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >>
> >>
> >>
> >>
> >
> >
> > -
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > opinions on IT & business topics through brief surveys-and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > ___
> > Nutch-general mailing list
> > [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> >
> >
> >
> >
> >
>
> --
> View this message in context: 
http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by conscious minds are what make reality real"


Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs

Isn't this what you are looking for?

org.apache.nutch.tools.PruneIndexTool.



On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:


hi Enis,
This is franklin ..currently i m using nutch 0.7.2 for my crawling and
indexing for my search engine...
i read from ur message that u can delete a particular index directly?if so
how its possible..i m desperately searching for a clue to do this one...
my requirement is to delete the porn site's index from my crawled data...
ur help is highly needed

expecting u to help me in this regards ..

Thanks in advance..
Franklin.S


ogjunk-nutch wrote:
>
> Hi Enis,
>
> Right, I can easily delete the page from the Lucene index, though I'd
> prefer to follow the Nutch protocol and avoid messing something up by
> touching the index directly.  However, I don't want that page to re-appear
> in one of the subsequent fetches.  Well, it won't re-appear, because it
> will remain missing, but it would be great to be able to tell Nutch to
> "forget it" "from everywhere".  Is that doable?
> I could read and re-write the *Db Maps, but that's a lot of IO... just to
> get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> flags a given page as "forget this page as soon as possible" and it just
> happens later on.
>
> Thanks,
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> - Original Message 
> From: Enis Soztutar <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, April 5, 2007 3:29:55 AM
> Subject: Re: [Nutch-general] Removing pages from index immediately
>
> Since hadoop's map files are write once, it is not possible to delete
> some urls from the crawldb and linkdb. The only thing you can do is to
> create the map files once again without the deleted urls. But running
> the crawl once more as you suggested seems more appropriate. Deleting
> documents from the index is just lucene stuff.
>
> In your case it seems that every once in a while, you crawl the whole
> site, and create the indexes and db's and then just throw the old one
> out. And between two crawls you can delete the urls from the index.
>
> [EMAIL PROTECTED] wrote:
>> Hi,
>>
>> I'd like to be able to immediately remove certain pages from Nutch
>> (index, crawldb, linkdb...).
>> The scenario is that I'm using Nutch to index a single site or a set of
>> internal sites.  Once in a while editors of the site remove a page from
>> the site.  When that happens, I want to update at least the index and
>> ideally crawldb, linkdb, so that people searching the index don't get the
>> missing page in results and end up going there, hitting the 404.
>>
>> I don't think there is a "direct" way to do this with Nutch, is there?
>> If there really is no direct way to do this, I was thinking I'd just put
>> the URL of the recently removed page into the first next fetchlist and
>> then somehow get Nutch to immediately remove that page/URL once it hits a
>> 404.  How does that sound?
>>
>> Is there a way to configure Nutch to delete the page after it gets a 404
>> for it even just once?  I thought I saw the setting for that somewhere a
>> few weeks ago, but now I can't find it.
>>
>> Thanks,
>> Otis
>>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>>
>>
>>
>
>
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> Nutch-general mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
>
>
>

--
View this message in context: 
http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
Sent from the Nutch - User mailing list archive at Nabble.com.





--
"Conscious decisions by conscious minds are what make reality real"


Re: Case Sensitive

2007-04-26 Thread Briggs

Am am not 100% sure, but I am 99.99% sure that case does matter.  In
regard to domain name, I would say no, but anything after should be.
If not, then there is a bug.

On 4/26/07, karthik085 <[EMAIL PROTECTED]> wrote:


Hi,

Does the URL case sensitivity matter? In my crawl-urlfilter.txt, I want to
'skip special urls'
-Test

Does that mean it will ignore URLs that contain Test or test?

Thanks.

--
View this message in context: 
http://www.nabble.com/Case-Sensitive-tf3654858.html#a10210667
Sent from the Nutch - User mailing list archive at Nabble.com.





--
"Conscious decisions by conscious minds are what make reality real"


Re: Using nutch just for the crawler/fetcher

2007-04-25 Thread Briggs

If you are just looking to have a seed list of domains, and would like
to mirror their content for indexing, why not just use the unix tool
'wget'?  It will mirror the site on your system and then you can just
index that.




On 4/25/07, John Kleven <[EMAIL PROTECTED]> wrote:

Hello,

I am hoping crawl about 3000 domains using the nutch crawler +
PrefixURLFilter, however, I have no need to actually index the html.
Ideally, I would just like each domain's raw html pages saved into separate
directories.  We already have a parser that converts the HTML into indexes
for our particular application.

Is there a clean way to accomplish this?

My current idea is to create a python script (similar to the one already on
the wiki) that essentially loops through the fetch, update cycles until
depth is reached, and then simply never actually does the real lucene
indexing and merging.  Now, here's the "there must be a better way" part ...
I would then simply execute the "bin/nutch readseg -dump" tool via python to
extract all the html and headers (for each segment) and then, via a regex,
save each html output back into an html file, and store it in a directory
according to the domain it came from.

How stupid/slow is this?  Any better ideas?  I saw someone previously
mentioned something like what I want to do, and someone responded that it
was better to just roll your own crawler or something?  I doubt that for
some reason.  Also, in the future we'd like to take advantage of the
word/pdf downloading/parsing as well.

Thanks for what appears to be a great crawler!

Sincerely,
John




--
"Conscious decisions by conscious minds are what make reality real"


Re: Index

2007-04-24 Thread Briggs

Perhaps someone else can chime in on this.  I am not sure of exactly
what you are asking.  The indexing is based on Lucene. So, if you need
to understand how the indexing works you will need to look into the
Lucene documentation.   If you are only looking to add custom fields
and such to the index, you could look into the indexing filters of
Nutch.  There are examples on the wiki for that too.



On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote:

Thanks for your help but i think there is a misunderstanding. I was talking
about creating a new index class in java based on specific parameters that i
will defined.

Do you if there is any web page which can give me more information in order
to implement in Java this index ?

E

> On the nutch wiki there is this tutorial:
>
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> There is also (it is for version 0.8, but can still work with 0.9):
>
> http://lucene.apache.org/nutch/tutorial8.html
>
>
> On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote:
>> Hi Guys,
>>
>> I would like to create a new custom index.
>> Do you know if there is any tutorial, document or web page which can
>> help me
>> ?
>>
>> Thanks,
>> E
>>
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>




--
"Conscious decisions by conscious minds are what make reality real"


Re: Index

2007-04-24 Thread Briggs

On the nutch wiki there is this tutorial:

http://wiki.apache.org/nutch/NutchHadoopTutorial

There is also (it is for version 0.8, but can still work with 0.9):

http://lucene.apache.org/nutch/tutorial8.html


On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote:

Hi Guys,

I would like to create a new custom index.
Do you know if there is any tutorial, document or web page which can help me
?

Thanks,
E




--
"Conscious decisions by conscious minds are what make reality real"


Re: How to dump all the valid links which has been crawled?

2007-04-20 Thread Briggs

That one is a bit more complicated because it has to do with
complexities of the underlying scoring algorithm(s).  But, basically,
that means "give me the top 35 links within the crawl db and put them
in the file called 'test'".  Top links are calculated by their
relevance when dealing with how many other other links, from other
pages/sites point to them.

Basically, when the crawler crawls, it stores all discovered links
within the db.   If the crawler finds the same link from multiple
resources (other pages) then that link's score goes up.

That is just a simple explanation, but I think it is close enough.

You may want to look more into the OPIC filter and how that algorithm
works, if you really want to get into the grit of the code.   You can
see how scoring is calculated by running the nutch example web
application and clicking on the 'explain' link on a result.




On 4/19/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:

Can you please tell me what is the meaning of this command? what is
the top 35 links? how  nutch rank the top 35 links?

"bin/nutch readdb crawl/crawldb -topN 35 test"

On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote:
> Those links are links that were discovered. It does not mean that they
> were fetched, they weren't.
>
> On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> > I think I find out the answer to my previous question by doing this:
> >
> >  bin/nutch readlinkdb crawl/linkdb/ -dump test
> >
> >
> > But my next question is why the result shows URLs with 'gif', 'js', etc,etc
> >
> > I have this line in my craw-urlfilter.txt, so i don't except I will
> > crawl things like images, javascript files,
> >
> > # skip image and other suffixes we can't yet parse
> > 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$
> >
> >
> > Can you please tell me how to fix my problem?
> >
> > Thank you.
> >
> > On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > I read this article about nutch crawling:
> > > http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
> > >
> > > How can I dumped out the valid links which has been crawled?
> > > This command described in the article does not work in nutch 0.9. What
> > > should I use instead?
> > >
> > > bin/nutch readdb crawl-tinysite/db -dumplinks
> > >
> > > Thank you for any help.
> > >
> >
>
>
> --
> "Conscious decisions by concious minds are what make reality real"
>




--
"Conscious decisions by concious minds are what make reality real"


Re: How to delete already stored indexed fields???

2007-04-20 Thread Briggs

If you look into the BasicIndexingFilter.java plugin source you will
see that this is where those default fields get indexed.  So, you can
either create a new plugin that is configurable for the properties you
want to index, or remove this plugin.   Here is the snippet of code
that is in the filter:


  if (host != null) {
   // add host as un-stored, indexed and tokenized
   doc.add(new Field("host", host, Field.Store.NO,
Field.Index.TOKENIZED));
   // add site as un-stored, indexed and un-tokenized
   doc.add(new Field("site", host, Field.Store.NO,
Field.Index.UN_TOKENIZED));
   }

   // url is both stored and indexed, so it's both searchable and returned
   doc.add(new Field("url", url.toString(), Field.Store.YES,
Field.Index.TOKENIZED));

   // content is indexed, so that it's searchable, but not stored in index
   doc.add(new Field("content", parse.getText(), Field.Store.NO,
Field.Index.TOKENIZED));

   // anchors are indexed, so they're searchable, but not stored in index
   try {
   String[] anchors = (inlinks != null ? inlinks.getAnchors()
: new String[0]);
   for (int i = 0; i < anchors.length; i++) {
   doc.add(new Field("anchor", anchors[i],
Field.Store.NO, Field.Index.TOKENIZED));
   }
   } catch (IOException ioe) {
   if (LOG.isWarnEnabled()) {
   LOG.warn("BasicIndexingFilter: can't get anchors for "
+ url.toString());
   }
   }


On 4/3/07, Ratnesh,V2Solutions India
<[EMAIL PROTECTED]> wrote:


exactly offcourse ,

I want this only, Do you have any solution for this??

looking forwards for your reply

Thnx


Siddharth Jonathan wrote:
>
> Do you mean how do you get rid of some of the fields that are indexed by
> default? eg. content, anchor text etc.
>
> Jonathan
> On 4/2/07, Ratnesh,V2Solutions India
> <[EMAIL PROTECTED]>
> wrote:
>>
>>
>> Hi,
>> I have written a plugin , which finds no. of Object tags in a html and
>> corresponding urls.
>> I am storing "objects" as fields and page url as values.
>>
>> And finally interested in seeing the search realted with "objects"
>> indexed
>> fields not those which is already stored as indexed fields.
>>
>> So how shall I delete those index fields which is already stored
>>
>> Looking forward towards your reply(Valuable
>> inputs).
>>
>> Thnx to Nutch Community
>> --
>> View this message in context:
>> 
http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9786377
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>

--
View this message in context: 
http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9803792
Sent from the Nutch - User mailing list archive at Nabble.com.





--
"Conscious decisions by concious minds are what make reality real"


Re: How to dump all the valid links which has been crawled?

2007-04-19 Thread Briggs

Those links are links that were discovered. It does not mean that they
were fetched, they weren't.

On 4/12/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:

I think I find out the answer to my previous question by doing this:

 bin/nutch readlinkdb crawl/linkdb/ -dump test


But my next question is why the result shows URLs with 'gif', 'js', etc,etc

I have this line in my craw-urlfilter.txt, so i don't except I will
crawl things like images, javascript files,

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$


Can you please tell me how to fix my problem?

Thank you.

On 4/11/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I read this article about nutch crawling:
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
> How can I dumped out the valid links which has been crawled?
> This command described in the article does not work in nutch 0.9. What
> should I use instead?
>
> bin/nutch readdb crawl-tinysite/db -dumplinks
>
> Thank you for any help.
>




--
"Conscious decisions by concious minds are what make reality real"


Re: Forcing update of some URLs

2007-04-19 Thread Briggs

From what I have gathered is that you may want to keep multiple

crawldbs for your crawls.  So, you could have a crawldb for more
frequent crawls and fire off nutch and read that db with the
appropriate configs for that job.   I was hoping for the same
mechanism, but it looks like we need to write this for ourselves.


On 4/12/07, Arie Karhendana <[EMAIL PROTECTED]> wrote:

Hi all,

I'm a new user of Nutch. I use Nutch primarily to crawl blog and news
sites. But I noticed that Nutch fetches pages only on some refresh
interval (30 days default).

Blog and news sites have unique characteristic that some of their
pages are updated very frequently (e.g. the main page) so they have to
be refetched often, while other pages don't need to be refreshed /
refetched at all (e.g. the news article pages, which eventually will
become 'obsolete').

Is there any way to force update some URLs? Can I just 're-inject' the
URLs to set the next fetch date to 'immediately'?

Thank you,
--
Arie Karhendana




--
"Conscious decisions by concious minds are what make reality real"


Re: Nutch and Crawl Frequency

2007-04-19 Thread Briggs

Cool, cool.  Thanks!

On 4/19/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:

As it is right now... You answered the question yourself :-) ...

Separate db's and the whole ceremony...


> -Original Message-
> From: Briggs [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 19, 2007 10:02 PM
> To: nutch-user@lucene.apache.org
> Subject: Nutch and Crawl Frequency
>
> Nutch 0.9
>
> Anyone know if it is possible to be more granular regarding crawl
> frequency?  Meaning, that I would like some sites to be crawled more
> often then others. Like, a news site should be crawled every day, but
> your average business website should be crawled every 30 days.  So, is
> it possible to specify a crawl frequency for specific urls, or is it
> only global for within the crawl db?  I suppose I could have several
> crawldbs or something like that, and deal with it.. but, just curious.
>
> Thanks
> --
> "Conscious decisions by conscious minds are what make reality real"






--
"Conscious decisions by concious minds are what make reality real"


Nutch and Crawl Frequency

2007-04-19 Thread Briggs

Nutch 0.9

Anyone know if it is possible to be more granular regarding crawl
frequency?  Meaning, that I would like some sites to be crawled more
often then others. Like, a news site should be crawled every day, but
your average business website should be crawled every 30 days.  So, is
it possible to specify a crawl frequency for specific urls, or is it
only global for within the crawl db?  I suppose I could have several
crawldbs or something like that, and deal with it.. but, just curious.

Thanks
--
"Conscious decisions by conscious minds are what make reality real"


Re: Classpath and plugins question

2007-04-19 Thread Briggs

I'll add that the PluginRespository is the class that recurses through
your plugins directory, and loads each plugin's descriptor file then
loads all dependencies for each plugin within their own classloader.

On 4/19/07, Briggs <[EMAIL PROTECTED]> wrote:

Look into org.apache.nutch.plugin.  The custom plugin classloader, and
the resource loadeer reside in there.

On 4/18/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:
> I'm looking to use the Nutch parsing framework in a separate Lucene project.
> I'd like to be able to use the existing plugins directory structure as-is, so
> wondered Nutch sets up the class loading environment to find all the jar files
> in the plugins directories.
>
> Any pointers to the Nutch class(es) that do the work?
>
> Thanks
> Antony
>
>
>
>


--
"Conscious decisions by concious minds are what make reality real"




--
"Conscious decisions by concious minds are what make reality real"


Re: Classpath and plugins question

2007-04-19 Thread Briggs

Look into org.apache.nutch.plugin.  The custom plugin classloader, and
the resource loadeer reside in there.

On 4/18/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:

I'm looking to use the Nutch parsing framework in a separate Lucene project.
I'd like to be able to use the existing plugins directory structure as-is, so
wondered Nutch sets up the class loading environment to find all the jar files
in the plugins directories.

Any pointers to the Nutch class(es) that do the work?

Thanks
Antony







--
"Conscious decisions by concious minds are what make reality real"


Re: Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs

I am adding more info to my post from what I have been looking into...

So, I have found the LinkDbReader and it seems to be able to dump text
out to a file. But, unfortunately, it dumps to a file and I need to
parse it (or I might have missed something).  So, if this is the
correct class, that will have to work... Here is a snippet of the
output of the LinkDbReader from a page that I crawled on one of my
test machines, which has apache documentation installed. The output of
the reader is:


http://httpd.apache.org/Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server

http://httpd.apache.org/docs-project/   Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Documentation
fromUrl: http://nutchdev-1/manual/ anchor:

http://www.apache.org/  Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Apache

http://www.apache.org/foundation/preFAQ.htmlInlinks:
fromUrl: http://nutchdev-1/ anchor: Apache web server

http://www.apache.org/licenses/LICENSE-2.0  Inlinks:
fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0



So, am I to assume that the format shows outlinks first, then the
Inlinks are where the links were found?  I'll just have to figure out
the format here so I can parse it.  I'll probably write a wrapper that
exports to xml or something to make transformation of this easier.

Anyway, am I on the right track?

Briggs.




On 4/18/07, Briggs <[EMAIL PROTECTED]> wrote:

Is it possible to determine from which domain(s) an outlink was
located?  The only way I know how is to limit the crawl to a single
domain (so, I would know where the outlink came from). Also, I am
having difficultly trying to figure out how in 0.9 (probably the same
in 0.8) to easily get the outlinks for my segments.  In nutch 0.7.* we
use to do something like:



segmentReader = createSegmentReader(segment);

final FetcherOutput fetcherOutput = new FetcherOutput();
final Content content   = new Content();
final ParseData indexParseData   = new ParseData();
final ParseText parseText= new ParseText();

while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) {
extractOutlinksFromParseData(indexParseData, outlinks);
}




private void extractOutlinksFromParseData(final ParseData
indexParseData, finalSet outlinks) {

for (final Outlink outlink : indexParseData.getOutlinks()) {
if (null != outlink  && outlink.getToUrl() != null) {
outlinks.add(outlink.getToUrl());
}
}
}


I am finally making the plunge and attempting to get this thing (my
application) up to date with the latest and greatest!

Thanks for your time!  And once I really get through this code I
promise to start posting answers.

Briggs.

--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by concious minds are what make reality real"


Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs

Is it possible to determine from which domain(s) an outlink was
located?  The only way I know how is to limit the crawl to a single
domain (so, I would know where the outlink came from). Also, I am
having difficultly trying to figure out how in 0.9 (probably the same
in 0.8) to easily get the outlinks for my segments.  In nutch 0.7.* we
use to do something like:



segmentReader = createSegmentReader(segment);

final FetcherOutput fetcherOutput = new FetcherOutput();
final Content content   = new Content();
final ParseData indexParseData   = new ParseData();
final ParseText parseText= new ParseText();

while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) {
   extractOutlinksFromParseData(indexParseData, outlinks);
}




private void extractOutlinksFromParseData(final ParseData
indexParseData, finalSet outlinks) {

   for (final Outlink outlink : indexParseData.getOutlinks()) {
   if (null != outlink  && outlink.getToUrl() != null) {
   outlinks.add(outlink.getToUrl());
   }
   }
   }


I am finally making the plunge and attempting to get this thing (my
application) up to date with the latest and greatest!

Thanks for your time!  And once I really get through this code I
promise to start posting answers.

Briggs.

--
"Conscious decisions by conscious minds are what make reality real"


Re: Wildly different crawl results depending on environment...

2007-04-02 Thread Briggs

Thanks, I'll look into it. Though, I have never really tried that
level of granularity.  So, i'll have to figure out what you just told
me to do!  hah.



On 4/2/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> nutch 0.7.2
>
> I have 2 scenarios (both using the exact same configurations):
>
> 1) Running the crawl tool from the command line:
>
>./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5
>
> 2) Running the crawl tool from a web app somewhere in code like:
>
>final String[] args = new String[]{
>"-local", "/tmp/urlfile.txt",
>"-dir", "/tmp/somedir",
>"-depth", "5"};
>
>CrawlTool.main(args);
>
>
> When I run the first scenario, I may get thousands of pages, but when
> I run the second scenario my results vary wildly.  I mean, I get
> perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
> within a web application.  So, there are many things that could be
> going wrong here
>
> 1) Is there some sort of parsing issue?  An xml parser, regex,
> timeouts... something?  Not sure.  But, it just won't crawl as well as
> the 'standalone mode'.
>
> 2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
> a crawl tool (more than once) within a instance of a JVM?  It seems to
> have problems doing this. I am thinking there are some static
> references that don't really like handling such use. But this is just
> a wild accusation that I am not sure of.
>
>
>
Checking out the logs might help in this case. From my experience, i can
say that there can be some classloading problem with the crawl running
in a servlet container. I suggest you also try running the crawl step
wise, by first running inject, generate, fetch. etc.







--
"Concious decisions by concious minds are what make reality real"


Wildly different crawl results depending on environment...

2007-03-31 Thread Briggs

nutch 0.7.2

I have 2 scenarios (both using the exact same configurations):

1) Running the crawl tool from the command line:

   ./bin/nutch crawl -local urlfile.txt -dir /tmp/somedir -depth 5

2) Running the crawl tool from a web app somewhere in code like:

   final String[] args = new String[]{
   "-local", "/tmp/urlfile.txt",
   "-dir", "/tmp/somedir",
   "-depth", "5"};

   CrawlTool.main(args);


When I run the first scenario, I may get thousands of pages, but when
I run the second scenario my results vary wildly.  I mean, I get
perhaps 0,1,10+, 100+.  But, I rarely ever get a good crawl from
within a web application.  So, there are many things that could be
going wrong here

1) Is there some sort of parsing issue?  An xml parser, regex,
timeouts... something?  Not sure.  But, it just won't crawl as well as
the 'standalone mode'.

2) Is it a bad idea to use many concurrent CrawlTools, or even reusing
a crawl tool (more than once) within a instance of a JVM?  It seems to
have problems doing this. I am thinking there are some static
references that don't really like handling such use. But this is just
a wild accusation that I am not sure of.



--
"Conscious decisions by conscious minds are what make reality real"


Re: Logger duplicates entries by the thousands

2007-03-23 Thread Briggs

Status update...
So, I have the logging 'fixed', removed appenders and such. But I can
see that the logging issue was just a result of something else
happening underneath.  The memory consumption of the application still
grows until an OutOfHeapSpace error is thrown.

So, still trying to find where that is happening...  It's either Nutch
or ActiveMQ stuff.

Anyway,

Have fun and Cheers!

On 3/23/07, Briggs <[EMAIL PROTECTED]> wrote:

Currently using 0.7.2.

We have a process that runs crawltool from within an application,
perhaps hundreds of times during the course of the day.  The problem I
am seeing is that over time the log statements from my application (I
am using commons logging and Log4j) are also being logged within the
nutch log.  But, the real problem is that over time each log statement
gets repeated by some factor that increases over time/calls.  So,
currently, if I have a debug statement after I call CrawlTool.main(),
I will get 7500 entries in the log for that one statement.  I see a
'memory leak' in the application as this happens because I eventually
run out of it (1.5GB).  Has anyone else seen this problem?  I have to
keep shutting down the app so I can continue.

Any clues?  Does nutch create log appenders in the crawler code, and
is this causing the problem?





--
"Concious decisions by concious minds are what make reality real"




--
"Concious decisions by concious minds are what make reality real"


Logger duplicates entries by the thousands

2007-03-23 Thread Briggs

Currently using 0.7.2.

We have a process that runs crawltool from within an application,
perhaps hundreds of times during the course of the day.  The problem I
am seeing is that over time the log statements from my application (I
am using commons logging and Log4j) are also being logged within the
nutch log.  But, the real problem is that over time each log statement
gets repeated by some factor that increases over time/calls.  So,
currently, if I have a debug statement after I call CrawlTool.main(),
I will get 7500 entries in the log for that one statement.  I see a
'memory leak' in the application as this happens because I eventually
run out of it (1.5GB).  Has anyone else seen this problem?  I have to
keep shutting down the app so I can continue.

Any clues?  Does nutch create log appenders in the crawler code, and
is this causing the problem?





--
"Concious decisions by concious minds are what make reality real"


Re: Plugin ClassLoader issues...

2007-01-31 Thread Briggs

Well, I found this:

http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading

Arrrgh.  Well, looks like I am going to use JMX to have my plugin talk
to my application.  That way I won't have have several copies of my
"business" jars around.



On 1/31/07, Briggs <[EMAIL PROTECTED]> wrote:

So, I am having ClassLoader issues with plugins. It seems that the
PluginRepository does some wierd class loading (PluginClassLoader)
when it starts up. Does this mean that my plugin will not inherit the
classpath of my web application that it is loaded within?

A simple example is that my webapp contains spring-2.0.jar. But when I
try to call a spring class from within my plugin, I get a
"NoClassDefFound" error.  So

But the real issue is that I need to have my plugins to have access to
some business classes that are deployed within my web application.
How does one go about this in a nice way?

--
"Concious decisions by concious minds are what make reality real"




--
"Concious decisions by concious minds are what make reality real"


Plugin ClassLoader issues...

2007-01-31 Thread Briggs

So, I am having ClassLoader issues with plugins. It seems that the
PluginRepository does some wierd class loading (PluginClassLoader)
when it starts up. Does this mean that my plugin will not inherit the
classpath of my web application that it is loaded within?

A simple example is that my webapp contains spring-2.0.jar. But when I
try to call a spring class from within my plugin, I get a
"NoClassDefFound" error.  So

But the real issue is that I need to have my plugins to have access to
some business classes that are deployed within my web application.
How does one go about this in a nice way?

--
"Concious decisions by concious minds are what make reality real"


List Domains and adding Boost Values for Custom Fields

2007-01-31 Thread Briggs

So,

(nutch 0.7.2)

Does anyone know if there is such a query in nutch that I could
somehow return a full list of all unique domains that have been
crawled?  I was originally storing each domain's segment separately,
but that ended up being a nightmare when it came to creating search
beans, since the bean opens up each segment on init. So, I am working
on an incremental segment merge tool to handle the thousands of
segments I have and get em down to a few.

Also... What I really need is a pointer at how to do the following:

I have several custom attributes/fields, say "business" and
"confidential", " added to a document when it was indexed.  I want to
assign a boost value to the custom fields and have nutch use those
values when it is searching.  Where might I look to find such a thing?
I do not want to search by those fields, I only want them as part of
nutch's scoring so that if  there are high boost values for those
fields, they will be pushed to the top.

Thanks again!

Briggs




--
"Concious decisions by concious minds are what make reality real"


Re: Merging large sets of segments, help.

2007-01-24 Thread Briggs

Cool, thanks for your responses!

Next time I should probably mention that we are using 0.7.2.  Not
quite sure if we can even think about moving to something 'more
current' as I don't really know the reasons to.




Most of this information is already available on the Nutch Wiki. All I
can say is that there is certainly a limit to what you can do using the
"local" mode - if you need to handle large numbers of pages you will
need to migrate to the distributed setup.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






--
"Concious decisions by concious minds are what make reality real"


Re: Merging large sets of segments, help.

2007-01-24 Thread Briggs

Are you running this in a distributed setup, or in "local" mode? Local
mode is not designed to cope with such large datasets, so it's likely
that you will be getting OOM errors during sorting ... I can only
recommend that you use a distributed setup with several machines, and
adjust RAM consumption with the number of reduce tasks.


Currently we are running in local mode.  We do not have the setup for
distributing. That is why I want to merge these segments.  Would that
not help?  Insteand of having potentially tens of thousands of
segments, I want to create several large segments and index those.

Sorry for my ignorance, but not really sure how to scale nutch
correctly.  Do you know of a document, or some pointers as to how
segment/index data should be stored?



"Concious decisions by concious minds are what make reality real"


Merging large sets of segments, help.

2007-01-24 Thread Briggs

Has anyone written an API that can merge thousands of segments?  The current
segment merge tool cannot handle this much data as there just isn't enough
RAM available on the box. So, I was wondering if there was a better,
incremental way to handle this.

Currently I have 1 segment for each domain that was crawled and I want to
merge them all into several large segments.  So, if anyone has any pointers
I would appreciate it.  Has anyone else attempted to keep segments at this
granularity?  This doesn't seem to work so well.




"Concious decisions by concious minds are what make reality real"