date:20050726

Information extraction

2005-07-26 Thread Cuong Hoang

Hi all, Does anyone have experience with designing web information extraction such as shopbots/pricebots? I'm currently doing research on this topic and want to integrate Nutch. A few guidelines from anyone who has designed this type of systems will really be helpful to me. Regards,

Re: Information extraction

2005-07-26 Thread Matthias Jaekle

In the list of public nutch servers you find the following, which might be interesting: http://www.betherebesquare.com/ Matthias

Search Script

2005-07-26 Thread quovadis

Hi Does anyone know of a way that you can get the real number of documents shwoing/returned which are displayed to the user for a particular search when the persite variable is active (not 0). As opposed to total documents returned. Can anyone can understand what I mean?

Re: Information extraction

2005-07-26 Thread Jack Tang

Hi Matthias. The website is interesting but any document about the implementation avaiable? Cuong. I notice a lot paper mentioned HMM is great for information extraction. But I cannot find one demo in opensource way:( What's your thoughts? Regards /Jack On 7/26/05, Matthias Jaekle [EMAIL

Re: Information extraction

2005-07-26 Thread Matthias Jaekle

Hi, the author of this system announced he would like to contribute some of his modifications. Here is his post to list from 2005-06-10: Hello, I'd like to announce the launch of a new search engine that uses the Nutch engine. http://betherebesquare.com is an Event Search Engine for the San

RE: Information extraction

2005-07-26 Thread Cuong Hoang

Thanks for very useful information Matthias. I just wrote the an email. Regards, Cuong Hoang -Original Message- From: Matthias Jaekle [mailto:[EMAIL PROTECTED] Sent: Tuesday, 26 July 2005 8:34 PM To: nutch-user@lucene.apache.org Subject: Re: Information extraction Hi, the author of

RE: Information extraction

2005-07-26 Thread Cuong Hoang

Jack, So far, I found two demos online: http://eso.vse.cz/~labsky/cgi-bin/client/ http://iit.demokritos.gr/skel/crossmarc/ On these websites, there are several documents that maybe useful. I don't think they will release the source code. Regards, Cuong Hoang -Original Message- From:

RE: fetch bandwidth settings

2005-07-26 Thread Raymond Creel

Yes, thanks, you seem to be right. If I use more threads on the same host although the process seems to go faster I get alot more http errors so it ends up being slower (and probably more disruptive to the site.) --- EM [EMAIL PROTECTED] wrote: Go with 1 thread per host. For my small area

crawling Doc and Pdf

2005-07-26 Thread Feng \(Michael\) Ji

hi, I checked my log file, found crawler generates error when met a page with word file and pdf file inside. Any configuration file I have to change to let crawler could fetch linkage with word and pdf files? thanks, Michael, __ Do You Yahoo!?

Re: crawling Doc and Pdf

2005-07-26 Thread Jack Tang

Michael Error logs helps. pls post them on the email. Thanks /Jack On 7/27/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote: hi, I checked my log file, found crawler generates error when met a page with word file and pdf file inside. Any configuration file I have to change to let crawler

Re: [Nutch-general] query returns no results

2005-07-26 Thread blackwater dev

Thanks all! It is running again and seems to be doing a lot more. On 7/26/05, Howie Wang [EMAIL PROTECTED] wrote: I think Praveen is right. Another thing that you might have to look out for is that most of the links on theserverside seem to have query strings in them with a '?'. So you

Re: Information extraction

2005-07-26 Thread Jack Tang

Hi Cuong Thanks for the demo. I agree with that Infomation Extraction = Segmenation + Classfication + Clustering + Association. I am going to extends HtmlParseFilter and do text mining on parse.getText(). Is that a good way? Thoughts? And I'd like to share some resources I am reading now

RE: fetch bandwidth settings

2005-07-26 Thread EM

One other thing which I thing might be the case (I'm not sure though) If you are fetching a segments with 1000 links let say, and 50% is error when you finish the segment. These pages won't be placed in the next segments for fetching, but will instead wait on the next refetch date (default 30

Information extraction

Re: Information extraction

Search Script

Re: Information extraction

Re: Information extraction

RE: Information extraction

RE: Information extraction

RE: fetch bandwidth settings

crawling Doc and Pdf

Re: crawling Doc and Pdf

Re: [Nutch-general] query returns no results

Re: Information extraction

RE: fetch bandwidth settings

13 matches

Site Navigation

Mail list logo

Footer information