Metadata and feeds with nutc

2010-08-17 Thread Israel
Hello, anyone know if nutch supports different feed formats (RSS, Atom) or different metadata schemes (DC, IEEE-LOM )

Metadata and feeds with

2010-08-17 Thread Israel
Hola, alguien sabe si nutch soporta los diferentes formatos de feeds (RSS, Atom, ) o los diferentes esquemas de metadatos (DC, IEEE-LOM)

"Open" a nutchbean after it's been closed

2010-08-17 Thread Roger Marin
Hello, Is it posible to reset or reopen a nutchbean after calling it's closed method?. I'm trying to avoid creating/closing nutchbeans for every request in my application and having only one bean per index and updating it whenever there's a new updated index after a recrawl. Thanks.

Re: performance for small cluster

2010-08-17 Thread AJ Chen
Hi Andrzej, During updatedb, reduce tasks (as seen in log) take most of the time. There are lots of messages (below) indicating some problems, but I'm not sure. How to prevent these slowdown? 2010-08-17 09:31:54,564 INFO mapred.ReduceTask - attempt_201008141418_0023_r_04_0 Scheduled 1 output

Re: performance for small cluster

2010-08-17 Thread Andrzej Bialecki
On 2010-08-17 23:16, AJ Chen wrote: Scott, thanks again for your insights. My 4 cheap linux boxes is now crawling selected sites at about 1M pages per day. The fetch itself is reasonable fast. But, when crawl db has>10M urls, lots of time is spend in generating segment (2-3 hours) and update craw

tool for domain stats from crawldb or segments

2010-08-17 Thread AJ Chen
For vertical crawling (e.g. crawling a large number of selected sites), it's important to get quick stats for url structures and fetched page counts per domain and subdomains. Does nutch have tools to help with this? For a large crawldb, the tool should also work fast on the whole crawldb or all se

Re: performance for small cluster

2010-08-17 Thread AJ Chen
Scott, thanks again for your insights. My 4 cheap linux boxes is now crawling selected sites at about 1M pages per day. The fetch itself is reasonable fast. But, when crawl db has >10M urls, lots of time is spend in generating segment (2-3 hours) and update crawldb (4-5 hours after each segment).

Re: Indexing Tika xmpDM properties

2010-08-17 Thread André Ricardo
Hello Julien, Thank you for your help, using IndexingFilter I am now indexing the tika properties :) But now I can't get Nutch search.jsp to query the fields indexed like "album:dirty", I've followed both methods to search data in http://wiki.apache.org/nutch/HowToMakeCustomSearch#Now.2C_how_do_I

how to get a map from nutch crawled result?

2010-08-17 Thread Alex Luya
Hello: I want to use a webextrator (webharvest) to extract content form html page base on a map.so first: 1,How can I read link graphic database out? 2,How to convert result of nutch crawling back to html? 3,How to link them to construct the map?

Re: Not getting all documents

2010-08-17 Thread Markus Jelsma
Check logs/hadoop.log for connection time out errors. On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote: > There are 128 entries in url/nutch formatted as so: > http://server.example.com/docs/DF-09/ > http://server.example.com/docs/DF-10/ > http://server.example.com/docs/EG-02/ > http://server

Re: Not getting all documents

2010-08-17 Thread Bill Arduino
There are 128 entries in url/nutch formatted as so: http://server.example.com/docs/DF-09/ http://server.example.com/docs/DF-10/ http://server.example.com/docs/EG-02/ http://server.example.com/docs/EG-03/ http://server.example.com/docs/EG-04/ There are 428 directories in http://server.example.com/d

Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
On Tuesday 17 August 2010 13:47:32 Jeroen van Vianen wrote: > > Yes. I have lots of similar results because of these URLs occurring many > times for the same original URL. You can use deduplication [1]. It generates signatures for (near) exact content depending on configuration. It can then opt

Re: Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
On 17-8-2010 13:35, Markus Jelsma wrote: I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tok

Re: Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
On 17-8-2010 13:35, Alex McLintock wrote: I happen to have accumulated a lot of URLs in my index with the following layout: http://www.company.com/directory1;if(T.getElementsByClassName( http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case Hmmm, This may be thinkin

Re: Removing URLs from index

2010-08-17 Thread Alex McLintock
On 17 August 2010 12:04, Jeroen van Vianen wrote: > Hi, > > I happen to have accumulated a lot of URLs in my index with the following > layout: > > http://www.company.com/directory1;if(T.getElementsByClassName( > http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case Hmm

Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
Hi, I assume it's about your Solr index again (for which you should mail to the Solr mailinglist). It features deleteById and deleteByQuery methods but in your case it's going to be rather hard. Your URL field is, using the stock schema, analyzed and has a tokenizer that strips characters such

Re: Querying case-sensitive fields

2010-08-17 Thread Markus Jelsma
This makes little sense. Can you post the output of both the query and index analyzers of analysis.jsp? Set the field type to text or name to title and the index and query values to Jobs. No need for verbose output. It should output the following: Index Analyzer Jobs Jobs Jobs jobs job job

Removing URLs from index

2010-08-17 Thread Jeroen van Vianen
Hi, I happen to have accumulated a lot of URLs in my index with the following layout: http://www.company.com/directory1;if(T.getElementsByClassName( http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case There seem to be errors in the discovery of links from one page

Re: Querying case-sensitive fields

2010-08-17 Thread Jeroen van Vianen
On 17-8-2010 11:41, Markus Jelsma wrote: l, we would need to see the declaration of the field type of your title field. However, if you use the shipped schema.xml it doesn't make any sense because it declares a field type that lowercases on both query time and index time. You should use the analy

Re: Querying case-sensitive fields

2010-08-17 Thread Markus Jelsma
Hi, l, we would need to see the declaration of the field type of your title field. However, if you use the shipped schema.xml it doesn't make any sense because it declares a field type that lowercases on both query time and index time. You should use the analysis.jsp in your Solr admin. Then,

Querying case-sensitive fields

2010-08-17 Thread Jeroen van Vianen
Hi, I have an index with the following solr mapping: id I now noticed that if I

Re: Not getting all documents

2010-08-17 Thread Markus Jelsma
Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the seeding didn't go too well? Make sure that all your Apache directory listings are injected into the CrawlDB. If you then generate, fetch, parse and update the DB, you should have all URL's in your DB. How many directory l