Information extraction

2005-07-26 Thread Cuong Hoang
Hi all, Does anyone have experience with designing web information extraction such as shopbots/pricebots? I'm currently doing research on this topic and want to integrate Nutch. A few guidelines from anyone who has designed this type of systems will really be helpful to me. Regards,

Re: Information extraction

2005-07-26 Thread Matthias Jaekle
In the list of public nutch servers you find the following, which might be interesting: http://www.betherebesquare.com/ Matthias

Search Script

2005-07-26 Thread quovadis
Hi Does anyone know of a way that you can get the real number of documents shwoing/returned which are displayed to the user for a particular search when the persite variable is active (not 0). As opposed to total documents returned. Can anyone can understand what I mean?

Re: Information extraction

2005-07-26 Thread Jack Tang
Hi Matthias. The website is interesting but any document about the implementation avaiable? Cuong. I notice a lot paper mentioned HMM is great for information extraction. But I cannot find one demo in opensource way:( What's your thoughts? Regards /Jack On 7/26/05, Matthias Jaekle [EMAIL

Re: Information extraction

2005-07-26 Thread Matthias Jaekle
Hi, the author of this system announced he would like to contribute some of his modifications. Here is his post to list from 2005-06-10: Hello, I'd like to announce the launch of a new search engine that uses the Nutch engine. http://betherebesquare.com is an Event Search Engine for the San

RE: Information extraction

2005-07-26 Thread Cuong Hoang
Thanks for very useful information Matthias. I just wrote the an email. Regards, Cuong Hoang -Original Message- From: Matthias Jaekle [mailto:[EMAIL PROTECTED] Sent: Tuesday, 26 July 2005 8:34 PM To: nutch-user@lucene.apache.org Subject: Re: Information extraction Hi, the author of

RE: Information extraction

2005-07-26 Thread Cuong Hoang
Jack, So far, I found two demos online: http://eso.vse.cz/~labsky/cgi-bin/client/ http://iit.demokritos.gr/skel/crossmarc/ On these websites, there are several documents that maybe useful. I don't think they will release the source code. Regards, Cuong Hoang -Original Message- From:

RE: fetch bandwidth settings

2005-07-26 Thread Raymond Creel
Yes, thanks, you seem to be right. If I use more threads on the same host although the process seems to go faster I get alot more http errors so it ends up being slower (and probably more disruptive to the site.) --- EM [EMAIL PROTECTED] wrote: Go with 1 thread per host. For my small area

crawling Doc and Pdf

2005-07-26 Thread Feng \(Michael\) Ji
hi, I checked my log file, found crawler generates error when met a page with word file and pdf file inside. Any configuration file I have to change to let crawler could fetch linkage with word and pdf files? thanks, Michael, __ Do You Yahoo!?

Re: crawling Doc and Pdf

2005-07-26 Thread Jack Tang
Michael Error logs helps. pls post them on the email. Thanks /Jack On 7/27/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote: hi, I checked my log file, found crawler generates error when met a page with word file and pdf file inside. Any configuration file I have to change to let crawler

Re: [Nutch-general] query returns no results

2005-07-26 Thread blackwater dev
Thanks all! It is running again and seems to be doing a lot more. On 7/26/05, Howie Wang [EMAIL PROTECTED] wrote: I think Praveen is right. Another thing that you might have to look out for is that most of the links on theserverside seem to have query strings in them with a '?'. So you

Re: Information extraction

2005-07-26 Thread Jack Tang
Hi Cuong Thanks for the demo. I agree with that Infomation Extraction = Segmenation + Classfication + Clustering + Association. I am going to extends HtmlParseFilter and do text mining on parse.getText(). Is that a good way? Thoughts? And I'd like to share some resources I am reading now

RE: fetch bandwidth settings

2005-07-26 Thread EM
One other thing which I thing might be the case (I'm not sure though) If you are fetching a segments with 1000 links let say, and 50% is error when you finish the segment. These pages won't be placed in the next segments for fetching, but will instead wait on the next refetch date (default 30