Hello, anyone know if nutch supports different feed formats (RSS, Atom) or
different metadata schemes (DC, IEEE-LOM )
Hola, alguien sabe si nutch soporta los diferentes formatos de feeds (RSS,
Atom, ) o los diferentes esquemas de metadatos (DC, IEEE-LOM)
Hello,
Is it posible to reset or reopen a nutchbean after calling it's closed
method?. I'm trying to avoid creating/closing nutchbeans for every request
in my application and having only one bean per index and updating it
whenever there's a new updated index after a recrawl.
Thanks.
Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of the time. There
are lots of messages (below) indicating some problems, but I'm not sure.
How to prevent these slowdown?
2010-08-17 09:31:54,564 INFO mapred.ReduceTask -
attempt_201008141418_0023_r_04_0 Scheduled 1 output
On 2010-08-17 23:16, AJ Chen wrote:
Scott, thanks again for your insights. My 4 cheap linux boxes is now
crawling selected sites at about 1M pages per day. The fetch itself is
reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
generating segment (2-3 hours) and update craw
For vertical crawling (e.g. crawling a large number of selected sites), it's
important to get quick stats for url structures and fetched page counts per
domain and subdomains. Does nutch have tools to help with this? For a large
crawldb, the tool should also work fast on the whole crawldb or all
se
Scott, thanks again for your insights. My 4 cheap linux boxes is now
crawling selected sites at about 1M pages per day. The fetch itself is
reasonable fast. But, when crawl db has >10M urls, lots of time is spend in
generating segment (2-3 hours) and update crawldb (4-5 hours after each
segment).
Hello Julien,
Thank you for your help, using IndexingFilter I am now indexing the tika
properties :)
But now I can't get Nutch search.jsp to query the fields indexed like
"album:dirty", I've followed both methods to search data in
http://wiki.apache.org/nutch/HowToMakeCustomSearch#Now.2C_how_do_I
Hello:
I want to use a webextrator (webharvest) to extract content form html
page base on a map.so first:
1,How can I read link graphic database out?
2,How to convert result of nutch crawling back to html?
3,How to link them to construct the map?
Check logs/hadoop.log for connection time out errors.
On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote:
> There are 128 entries in url/nutch formatted as so:
> http://server.example.com/docs/DF-09/
> http://server.example.com/docs/DF-10/
> http://server.example.com/docs/EG-02/
> http://server
There are 128 entries in url/nutch formatted as so:
http://server.example.com/docs/DF-09/
http://server.example.com/docs/DF-10/
http://server.example.com/docs/EG-02/
http://server.example.com/docs/EG-03/
http://server.example.com/docs/EG-04/
There are 428 directories in http://server.example.com/d
On Tuesday 17 August 2010 13:47:32 Jeroen van Vianen wrote:
>
> Yes. I have lots of similar results because of these URLs occurring many
> times for the same original URL.
You can use deduplication [1]. It generates signatures for (near) exact
content depending on configuration. It can then opt
On 17-8-2010 13:35, Markus Jelsma wrote:
I assume it's about your Solr index again (for which you should mail to the
Solr mailinglist). It features deleteById and deleteByQuery methods but in
your case it's going to be rather hard. Your URL field is, using the stock
schema, analyzed and has a tok
On 17-8-2010 13:35, Alex McLintock wrote:
I happen to have accumulated a lot of URLs in my index with the following
layout:
http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
Hmmm,
This may be thinkin
On 17 August 2010 12:04, Jeroen van Vianen wrote:
> Hi,
>
> I happen to have accumulated a lot of URLs in my index with the following
> layout:
>
> http://www.company.com/directory1;if(T.getElementsByClassName(
> http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
Hmm
Hi,
I assume it's about your Solr index again (for which you should mail to the
Solr mailinglist). It features deleteById and deleteByQuery methods but in
your case it's going to be rather hard. Your URL field is, using the stock
schema, analyzed and has a tokenizer that strips characters such
This makes little sense. Can you post the output of both the query and index
analyzers of analysis.jsp? Set the field type to text or name to title and the
index and query values to Jobs. No need for verbose output. It should output
the following:
Index Analyzer
Jobs
Jobs
Jobs
jobs
job
job
Hi,
I happen to have accumulated a lot of URLs in my index with the
following layout:
http://www.company.com/directory1;if(T.getElementsByClassName(
http://www.company.com/directory2;this.bottomContainer.appendChild(u);break;case
There seem to be errors in the discovery of links from one page
On 17-8-2010 11:41, Markus Jelsma wrote:
l, we would need to see the declaration of the field type of your title field.
However, if you use the shipped schema.xml it doesn't make any sense because
it declares a field type that lowercases on both query time and index time.
You should use the analy
Hi,
l, we would need to see the declaration of the field type of your title field.
However, if you use the shipped schema.xml it doesn't make any sense because
it declares a field type that lowercases on both query time and index time.
You should use the analysis.jsp in your Solr admin.
Then,
Hi,
I have an index with the following solr mapping:
id
I now noticed that if I
Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the
seeding didn't go too well? Make sure that all your Apache directory listings
are injected into the CrawlDB. If you then generate, fetch, parse and update
the DB, you should have all URL's in your DB.
How many directory l
22 matches
Mail list logo