Hi all,
Does anyone have experience with designing web information extraction such
as shopbots/pricebots? I'm currently doing research on this topic and want
to integrate Nutch. A few guidelines from anyone who has designed this type
of systems will really be helpful to me.
Regards,
In the list of public nutch servers you find the following, which might
be interesting:
http://www.betherebesquare.com/
Matthias
Hi
Does anyone know of a way that you can get the real
number of documents shwoing/returned which are displayed
to the user for a particular search when the persite
variable is active (not 0). As opposed to total documents
returned.
Can anyone can understand what I mean?
Hi Matthias.
The website is interesting but any document about the implementation avaiable?
Cuong.
I notice a lot paper mentioned HMM is great for information
extraction. But I cannot find one demo in opensource way:(
What's your thoughts?
Regards
/Jack
On 7/26/05, Matthias Jaekle [EMAIL
Hi,
the author of this system announced he would like to contribute some of
his modifications. Here is his post to list from 2005-06-10:
Hello,
I'd like to announce the launch of a new search engine that uses the
Nutch engine.
http://betherebesquare.com is an Event Search Engine for the San
Thanks for very useful information Matthias. I just wrote the an email.
Regards,
Cuong Hoang
-Original Message-
From: Matthias Jaekle [mailto:[EMAIL PROTECTED]
Sent: Tuesday, 26 July 2005 8:34 PM
To: nutch-user@lucene.apache.org
Subject: Re: Information extraction
Hi,
the author of
Jack,
So far, I found two demos online:
http://eso.vse.cz/~labsky/cgi-bin/client/
http://iit.demokritos.gr/skel/crossmarc/
On these websites, there are several documents that maybe useful. I don't
think they will release the source code.
Regards,
Cuong Hoang
-Original Message-
From:
Yes, thanks, you seem to be right. If I use more
threads on the same host although the process seems to
go faster I get alot more http errors so it ends up
being slower (and probably more disruptive to the
site.)
--- EM [EMAIL PROTECTED] wrote:
Go with 1 thread per host.
For my small area
hi,
I checked my log file, found crawler generates error
when met a page with word file and pdf file inside.
Any configuration file I have to change to let crawler
could fetch linkage with word and pdf files?
thanks,
Michael,
__
Do You Yahoo!?
Michael
Error logs helps. pls post them on the email. Thanks
/Jack
On 7/27/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote:
hi,
I checked my log file, found crawler generates error
when met a page with word file and pdf file inside.
Any configuration file I have to change to let crawler
Thanks all!
It is running again and seems to be doing a lot more.
On 7/26/05, Howie Wang [EMAIL PROTECTED] wrote:
I think Praveen is right. Another thing that you might have to
look out for is that most of the links on theserverside seem to
have query strings in them with a '?'. So you
Hi Cuong
Thanks for the demo.
I agree with that
Infomation Extraction = Segmenation + Classfication + Clustering +
Association.
I am going to extends HtmlParseFilter and do text mining on
parse.getText(). Is that a good way?
Thoughts?
And I'd like to share some resources I am reading now
One other thing which I thing might be the case (I'm not sure though)
If you are fetching a segments with 1000 links let say, and 50% is error
when you finish the segment. These pages won't be placed in the next
segments for fetching, but will instead wait on the next refetch date
(default 30
13 matches
Mail list logo