,
-Original message-
From: Brian Tingle brian.tin...@ucop.edu
Sent: Tue 25-05-2010 20:47
To: user@nutch.apache.org; Markus Jelsma markus.jel...@buyways.nl;
Subject: RE: Solr integration in nutch-1.1dev
Update the solr schema.xml so that it allows multiple values for that field?
|-Original
Confirmed! It was the old schema.xml file. Next time i'd better check for
differences :)
On Tuesday 25 May 2010 21:38:45 Markus Jelsma wrote:
Hi Brian,
Again, thanks for the help. I have looked up the schema file from the trunk
and 1.0 tag using web svn. It seems you are right
Hi list,
Fields created by the subcollection plugin end up with a prefixed space in my
Solr index but the name and id fields in my subcollection.xml don't have that
same space prefixed, i checked it three times just to be certain i didn't mess
up the configuration. I am unsure where the
Hi,
Wonderful, i'll check it out! But, i could only find it through your
announcement. It cannot be found on the old lucene release URL - probably
because Nutch' a TLP now, but i cannot find it on the Nutch' download page as
well [1], that's a 404!
[1]: http://nutch.apache.org/release/
Well, where is it now? The parse-plugins.xml still refers to it, but it's not
present in the plugins/ directory.
Hi,
I sent my first update command to Solr with 1.1 and an earlier problem persists:
SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values
encountered for non multiValued copy field id:
http://HOST/index.php/2009/December/30/
Well, i didn't load Nutch' shipped
Hi,
I'm wondering why this value never exceeds 500? While watching the fetch log, i
cannot determine the number of remaining fetches because as long as there are
more than 500 due, the threads just wiggle between 490 and 500.
Is there a way to configure this? I haven't found a setting
Hi,
This is what you're looking for:
property
namegenerate.max.per.host/name
value100/value
/property
Cheers
-Original message-
From: brad b...@bcs-mail.net
Sent: Thu 08-07-2010 02:24
To: user@nutch.apache.org;
Subject: Host or domain www.abc123.com has more
crawl/crawldb $SEGMENT -filter -normalize
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
,
Jeroen
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
is, how do I remove these documents from the index?
Regards,
Jeroen
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
then optionally overwrite (delete)
duplicates.
[1]: http://wiki.apache.org/solr/Deduplication
Thanks and best regards,
Jeroen
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
and just taking whatever it gets before moving on.
Maybe I should increase my wait times...
On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma
markus.jel...@buyways.nlwrote:
Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the
seeding didn't go too well? Make sure
In small crawls, you could parse the documentright away. For large crawls,
however, there may not be enough resources to fetch and parse at the same time.
-Original message-
From: Nayanish Hinge nayanish.hi...@gmail.com
Sent: Thu 02-09-2010 07:39
To: user@nutch.apache.org;
Subject: Why
the crawling right from where we left?
I mean, starting with only the unfetched urls.
Thanks
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
://issues.apache.org/jira/browse/NUTCH-716
Cheers
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
,
Thanks. I would prefer that you don't reopen an already resolved issue. Just
file a new issue and link it back to this one.
Thanks for the heads up!
Cheers,
Chris
On 9/6/10 4:57 AM, Markus Jelsma markus.jel...@buyways.nl wrote:
Hi,
It seems the NUTCH-716 [1] patch does not really produce a multi
the warning. Should i create
a new ticket? At least i couldn't find a corresponding issue as of yet.
Cheers,
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
to find out the url which is
causing the problem sp that we can reproduce the issue. Could be another
case of a file trimmed to the max size allowed during the fetching which
puts the parser in trouble. We'll see.
Best,
Julien
Markus Jelsma - Technisch Architect - Buyways BV
http
of a file trimmed to the max size allowed during the
fetching which puts the parser in trouble. We'll see.
Best,
Julien
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Markus Jelsma - Technisch Architect - Buyways
Issue has been solved in the current branch-1.2. I was using a too old nightly
build.
Thanks!
On Monday 06 September 2010 20:00:52 Mattmann, Chris A (388J) wrote:
Thanks!
On 9/6/10 9:44 AM, Markus Jelsma markus.jel...@buyways.nl wrote:
Done
https://issues.apache.org/jira/browse
Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Markus Jelsma - Technisch Architect - Buyways BV
http
,
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050
the
tokenization myself and i don't want my index to be polluted with this
non- information =)
Anyone knows how to configure the index-more plug-in? The wiki isn't very
helpful.
Cheers,
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620
) improvement? Just adding the option to
disable the split? Or also add an option that spits out up to three distinct
fields?
Cheers
M.
Cheers,
Chris
On 9/8/10 2:27 AM, Markus Jelsma markus.jel...@buyways.nl wrote:
Hi,
I'm testing the index-more plug-in but, to my surprise
90089 USA
++
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
for unclear
reasons. Madness!
Can anyone try to explain what's really going on and why so many users suffer
from this issue?
FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.
Cheers,
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050
And now there's also a PDF giving this kind of trouble:
http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
-Original message-
From: Markus Jelsma markus.jel...@buyways.nl
Sent: Thu 09-09-2010 18:06
To: user@nutch.apache.org;
Subject: multiple values
more than one value.
For example the creative commons field has a lot of values for the same
document (by, nc, us, etc...)
I have
field name=cc multiValued=true type=string stored=true
indexed=true/
Hope this helps,
André Ricardo
On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma markus.jel
processes' tmp data.
So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir
setting.
Cheers,
-Original message-
From: Markus Jelsma markus.jel...@buyways.nl
Sent: Fri 10-09-2010 15:52
To: user@nutch.apache.org;
Subject: RE: Input path does not exist revisited
--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
/office/www.microsoft.com/office/antivirus
This doesn't look good at all. Anyone got a suggestion or some pointer?
-Original message-
From: Markus Jelsma markus.jel...@buyways.nl
Sent: Wed 22-09-2010 12:12
To: user@nutch.apache.org;
Subject: Funky duplicate url's
Hi
@nutch.apache.org;
Subject: Re: Funky duplicate url's
the conf/regex-urlfilter.txt file has an exclusion rule that should skip
these viral urls.
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-aj
On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
* doing there? It shouldn't.
Thanks for all your help
Raj
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Thursday, September 23, 2010 4:52 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs
bin/nutch solrdedup
Usage: SolrDeleteDuplicates
Is there something else I need to do? Some change to the Solr or Tomcat
config I have missed.
Config:
Nutch Release 1.2 - 08/07/2010
CentOS Linux 5.5
Linux 2.6.18-194.3.1.el5 on x86_64
Intel(R) Xeon(R) CPU X3220 @ 2.40GHz
8gb of ram
Thanks
Brad
Markus Jelsma - Technisch
not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.
Brad
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent: Friday, September 24, 2010 5:27 AM
To: user@nutch.apache.org
Subject: Re: Nutch 1.2
by an exact hashing
algoritm such as MD5, it won't allow you do use the TextProfileSignature
algoritm in Solr for fuzzy matching.
-Original message-
From: Nemani, Raj raj.nem...@turner.com
Sent: Fri 24-09-2010 23:18
To: user@nutch.apache.org; Markus Jelsma markus.jel...@buyways.nl
to
try running Nutch on a Hadoop cluster (which i don't have) or try to let
Hadoop take advantage of my multiple cores?
Cheers,
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
?It
used to be:# accept hosts in
MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
very slow) it becomes a
rocket...
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
, will it then still make use of
multiple cores?
Cheers,
On Tuesday 28 September 2010 14:20:02 Andrzej Bialecki wrote:
On 2010-09-28 14:02, Markus Jelsma wrote:
Hi,
My test setup (only local) now has just over 20 million URL's, i fetched
3m already and the rest needs to be fetched. It's
I see, complex indeed. I'll manage for now. Thanks for your answer.
On Tuesday 28 September 2010 14:18:06 Andrzej Bialecki wrote:
On 2010-09-28 13:55, Markus Jelsma wrote:
Thanks. Could we modify the code so it will only output the info before
the tasks are initialized? If so, how to proceed
understand. How do I update your DB's?, What should I
do about crawl-urlfilter.txt? Thanks
Dennis
--- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote:
From: Markus Jelsma markus.jel...@buyways.nl
Subject: Re: crawl www
To: user@nutch.apache.org
Date: Tuesday, September
:
Sorry for interrupting, Markus,
But I'm not quite understand. How do I update your DB's?, What should I
do about crawl-urlfilter.txt? Thanks
Dennis
--- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote:
From: Markus Jelsma markus.jel...@buyways.nl
Subject: Re: crawl
September 2010 15:51:11 Dennis wrote:
Thanks, Markus,
Another question, the script will stop, right? I mean, I am not going to
crawl for 100 days, I need it finish it's job. Dennis
--- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote:
From: Markus Jelsma markus.jel
Thanks for your comments. I'll consult this thread later when i've got the
time to test the distributed mode and possibly set up HDFS immediately as i'm
going to need it anyway.
On Tuesday 28 September 2010 16:26:48 Andrzej Bialecki wrote:
On 2010-09-28 14:27, Markus Jelsma wrote:
Thanks
://digitalpebble.blogspot.com/2010/09/similarpages-is-out.html and of
course http://www.similarpages.com/ itself.
Best,
Julien Nioche
Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
/overijssel
fetching
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
-Original message-
From: Markus Jelsma markus.jel...@buyways.nl
Sent: Wed 22-09-2010 20:47
To: user@nutch.apache.org;
Subject: RE: Re: Funky duplicate url's
Thanks! I've
troubles with honoring the page's BASE tag when
resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so
that's might not be it.
Mathijs
On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
Anyone? Where is a proper solution for this issue
that going through a single huge file
J.
On 29 September 2010 10:11, Markus Jelsma markus.jel...@buyways.nl
wrote:
Yes but i need a little more testing. Anyone knows how i can only test
that
class? I currently use ant -v test -l logfile and need to dig through
It seems you're trying to fetch 0 url's. Inject correct url's or adjust your
url filters as not to filter out your injected url's.
-Original message-
From: Yavuz Selim YILMAZ yvzslmyilm...@gmail.com
Sent: Tue 05-10-2010 13:16
To: user user@nutch.apache.org;
Subject: Nutch-Eclipse
I
Hi,
I've finally fetched the latest trunk, added Gora as described in
NUTCH-873 but i'm getting the following exception
Exception in thread main java.lang.ClassNotFoundException:
org.gora.sql.store.SqlStore
It can't find the class configured in storage.data.store.class. Is it
perhaps the
Storing content will take up about as much disk space as the content
you are fetching. If you don't store, there is nothing to parse.
On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977
webdev1...@gmail.com wrote:
Could someone please clarify the relationship between these two
properties?
I
I suppose you would create an URL filter. It, as i understand, filters
URL's that are about to enter the CrawlDB (during UpdateDB) as well as
read from the CrawlDB (the generator). The LinkDB just holds a list of
anchor's for URL's that are in the CrawlDB.
Be sure to have a local DNS cache
On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977
webdev1...@gmail.com wrote:
So how is it that one is able to crawl huge websites with the crawl
script
and not use the parse = false? You would have to have enormous
amounts of
disk space to run the parse later.
You can run smaller batches
,
Where are you seeing this ClassNotFoundException? When you look at it in an
IDE (e.g., Eclipse), or at runtime? Or building using Ant/Ivy? It seems
like it built OK, so just trying to figure out how you are running Nutch.
Cheers,
Chris
On 10/11/10 4:24 AM, Markus Jelsma markus.jel
to keep the old project as is for now.
Of course for production I will have it on two different server, as you can
not run multiple instances on nutch on the same sever/cluster.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
!
Cheers!
Hi Folks,
A while back I nominated Markus Jelsma for Nutch committership and PMC
membership. The VOTE tallies in Nutch PMC-ville have occurred and I'm happy
to announce that Markus is now an Nutch committer!
Markus, feel free to say a little bit about yourself, and, welcome aboard
Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB.
Then generate a new fetch list, fetch and parse and index the most recent
item.
The remaining problem is how to know which is the most recent. Maybe you
should create a plugin that will only add the most recent URL
Well, you could set a fake user agent.
As I crawl more websites I finding I'm encountering more and more websites
that reject the crawl by basically redirecting the crawl to an HTML page
that that states something along the lines of:
HTTP 602 Unsupported Browser The browser you are using
wrong somewhere so I'm going over the whole set-up again, slowly.
Joe
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, October 26, 2010 3:04 PM
To: user@nutch.apache.org
Subject: RE: Any changes to setting up solr with nutch 1.2?
I
and solrindex-mapping.xml/schema.xml file in conf is so that nutch can
index the solr database, while the separate solr instance is used to
search the result?
Thanks,
Steve
On Tue, Oct 26, 2010 at 3:43 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi,
You'll need a 1.3
Hi.
I'm using nutch 1.2 to crawl a site and after that I want to index to solr
using solrIndexer command.
The problem is that the solr server needs a Digest authentication, Is there
a way to authenticate from nutch?
Never tried the authentication part, but i guess
help me to know the
following:
1) What types of things you would want explained in a book / videos on
Nutch?
2) What are the biggest problems you face using Nutch?
3) Anything special you would like answered or explained?
Thanks in advance for any responses.
Dennis
--
Markus Jelsma
but what about
the document itself? I believe not, but need to make sure.
Cheers,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
sets 0.0f
boosts on some documents or something else it going wrnog.
On Thursday 04 November 2010 14:26:17 Markus Jelsma wrote:
Hi,
Quick question: does Nutch set document boosts on documents that i send to
Solr? I've got some trouble with fieldNorms which are calculated from
document/field
What version of Solr are you indexing to and what is the Solr log telling you?
Hello I am trying to get nutch to work after upgrading from nutch 1.0 to
1.2:
solrindex map is working but as soon as I hit the reduce stage I start
getting errors. I fixed a couple of the errors but I don't
Found the problem. The boost field was removed but it seems the dedup job needs
it. I haven't tested it but since i recently removed the field it makes sense.
Why would i need the boost field anyway and why does the dedup job needs it?
Hi all,
For some reason i get an exception on this job.
-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Thursday, November 04, 2010 11:41 AM
To: user@nutch.apache.org
Cc: Eric Martin
Subject: Re: Stop Nutch
How did you start it? Are you running it on Hadoop at all?
I can't find a way to stop Nutch 1.2 via command line. I use
Regards
Alexander Aristov
On 4 November 2010 22:00, Markus Jelsma markus.jel...@openindex.io wrote:
Kill it! I guess it just runs standalone, just like executed jobs from
the command line which you can just terminate with CTRL+C.
I don't recommend stopping Nutch while executing jobs
Regards
Alexander Aristov
On 4 November 2010 21:40, Markus Jelsma markus.jel...@openindex.io wrote:
What version of Solr are you indexing to and what is the Solr log telling
you?
Hello I am trying to get nutch to work after upgrading from nutch 1.0
to 1.2:
solrindex map is working
Yes!
But i wouldn't recommend it if you're using Solr as your search server as it
can index e-mail boxes [1] via its data import handler [2].
However, using Nutch is possible too but it depends on your set up whether its
easy or not. Nutch can crawl and index your file system and the mail is
elements afterwards is only a temporary work-around in this
case.
Cheers,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
at
Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
headers
value be correct. To do this, I modified the nutch-default.xml file:
name http.agent.name / name
value Mozilla/4.0 / value
Is it enough?
Thanks
2010/11/16 Markus Jelsma-2 [via Lucene]
ml-node+1912155-1367979579-224...@n3.nabble.comml-node%2B1912155-136797957
9-224...@n3
Perhaps something for the Tika list?
On Monday 15 November 2010 17:57:13 Markus Jelsma wrote:
Hi,
A quite awful issue just occurred and i traced it back down the line.
Apparently the parser seems to translate HTML entities back to their
original form, lt; to and gt; to etc
how to run Solr as the
search engine along with Nutch. I've downloaded the latest stable
release of Nutch (1.2) and I see that Solr in already integrated with it
out-of-the-box. Question is: How we are supposed to use Solr within
Nutch?
Thanks
--
Markus Jelsma - CTO - Openindex
http
are appreciated.
Thanks
Guido
By the way:
Where comes nutch/conf/schema.xml into play? I assume that it is just a
template to replace solr/conf/schema.xml.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
As reference to other readers:
https://issues.apache.org/jira/browse/NUTCH-939
On Friday 26 November 2010 11:59:26 Claudio Martella wrote:
Hello list,
I'm porting recrawl script to use hadoop (on an already existing hadoop
cluster). I attach my version.
What i found out is that Indexer
2.0 and if possible 1.3, although the latter might not see daylight.
Thanks for the patch!
On Friday 26 November 2010 16:19:58 Claudio Martella wrote:
Markus, with trunk you mean 1.3 or 2.0? The patches should apply to all
1.x.
On 11/26/10 3:15 PM, Markus Jelsma wrote:
As reference
invertlinks
fail because of the missing of
directories in my merged segment
is there anything i can do ?
mehdi
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
help me?
Klaus.
--
e-Mail : kl...@tachtler.net
Homepage: http://www.tachtler.net
DokuWiki: http://www.dokuwiki.tachtler.net
--
Markus Jelsma - CTO
Seems to be a carriage return issue. Remove them first.
Hello List
Using windows XP
Cygwin to execute
Nutch-1.2
Currently trying out various crawl scripts and have hit a problem using the
one located here
http://wiki.apache.org/nutch/Crawl
I made some minor adjustments, however
Hi,
Check out the readseg command.
Cheers,
Hi
I am new to Nutch. I just started to use Nutch to crawl an intranet and
extract a certain field from the html pages. The first step I would like
to do is to dump all the html pages to a directory. I guess I should add a
filter class to do it,
One is being used by the crawl command.
Their contents are very similar, are they being used by two difference
plugin? Why there are two files?
Nobin Mathew
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Use the hadoop.tmp.dir setting in nutch-site.conf to point to a disk where
plenty is space is available.
Other users have previously reported similar problems which were due to a
lack on space on disk as suggested by this
*Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
on Twitested Lib.
And I want to learn more :-).
I didn't know about different url filters for fetching, updating etc,
¿Where can I change those filters?
Thank you,
2011/1/12 Markus Jelsma markus.jel...@openindex.io:
Hi,
This is rather tricky. You can crawl a lot but index a little if you
the SolrWriter.java class and place my mysql connecter
their but nothing
happens so can please explain a little more with example of code that
exactly which
part of SolrWriter class is going to be replace by mysql connecter.
-Thanks You Very Much
On 1/13/11, Markus Jelsma markus.jel
IOException ioe = new IOException();
ioe.initCause(e);
return ioe;
}
}
-Thanks you very much
On 1/13/11, Markus Jelsma markus.jel...@openindex.io wrote:
public void write gets called for each NutchDocument and collects them in
inputDocs. You could, after line 60, call a customer
It seems this is the root of the problem.
Caused by: java.lang.OutOfMemoryError: Java heap space
Nutch can detect 404's by recrawling existing URL's. The mutation, however, is
not pushed to Solr at the moment.
As far as I know, Nutch can only discover new URLs to crawl and send the
parsed content to Solr. But what about maintaining the index? Say that
you have a daily Nutch script that
byte STATUS_DB_GONE = 0x03;
http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
Where is that information stored? it could be then easily used to issue
deletes on solr.
On 1/23/11 10:32 PM, Markus Jelsma wrote:
Nutch can detect
point where it has been interrupted.Is there any
way that i can resume the crawl after interruption from the same
point.
Regards
Amna Waqar
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
These values come from the CrawlDB and have the following meaning.
db_unfetched
This is the number of URL's that are to be crawled when the next batch is
started. This number is usually limited with the generate.max.per.host
setting. So, if there are 5000 unfetched and generate.max.per.host is
Reading a URL from the DB returns the HTTP response of that URL, some header
information and body. Crawling a URL with a HTTP redirect won't result in the
HTTP response of the redirection target for that redirecting URL.
Hi,
My application needs to crawl a set of urls which I give to the
for those entries.
Is that what you guys have in mind? Should i file a JIRA?
On 1/24/11 10:26 AM, Markus Jelsma wrote:
Each item in the CrawlDB carries a status field. Reading the CrawlDB will
return this information as well, the same goes for a complete dump with
which you could create
and it
issues a delete to solr for those entries.
Is that what you guys have in mind? Should i file a JIRA?
On 1/24/11 10:26 AM, Markus Jelsma wrote:
Each item in the CrawlDB carries a status field. Reading the CrawlDB
will return this information as well, the same goes for a complete dump
1 - 100 of 1614 matches
Mail list logo