Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Jack Tang
Maybe you can filter javascript files(*.js) using url filter.. /Jack On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote: Hello Sorry my little English I use nutch-0.7.1 and have issue with html parser I got in summary javascript code and don't know how to remove it. For example

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Jack Tang
remove it :) -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 11:14 AM To: nutch-user@lucene.apache.org Subject: Re: javascript in summaries [nutch-0.7.1] On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote: This script present in html

Re: Buggy fetchlist' urls

2006-03-15 Thread Jack Tang
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote: I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the javascipt function to get the result rather than extracting pure text now.

Re: Buggy fetchlist' urls

2006-03-14 Thread Jack Tang
Hi Andrzej. In my previous projects, I bound javascript functions with center url. And I knew the idea does not fit for nutch. I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the

Re: Language Profiling Problem

2006-03-13 Thread Jack Tang
pls put hadoop-0.1-dev.jar into your classpath On 3/14/06, Tolga Erkal [EMAIL PROTECTED] wrote: I am trying to use NGramProfile to create a profile and getting the following error. It is probably related with classpath setting but could not figure out how will I make it work. Any help?

Re: Language Profiling Problem

2006-03-13 Thread Jack Tang
PROTECTED] mailto:[EMAIL PROTECTED] Magnotia.com | www.magnotia.com http://www.magnotia.com/ My Blog | x.magnotia.com http://x.magnotia.com/ 917 495 1938 -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2006 11:59 PM To: nutch-user

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: Mr Tang: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: Weird! You are running nutch on local file system or distributed file system? Local file system And can you find the same query hasan via luke? Nope ok. As stepan said, can you get

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: ok. As stepan said, can you get any hit when you try to search http or www? No Hey, can you zip the index and send it to me directly? -- Cheers, Hasan Diwan [EMAIL PROTECTED] -- Keep

Re: NullPointerException

2006-03-05 Thread Jack Tang
Tang [EMAIL PROTECTED] wrote: On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: ok. As stepan said, can you get any hit when you try to search http or www? No Hey, can you zip the index and send it to me directly? -- Cheers, Hasan

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: You can still build it on local file system:) Build, yes, but what of deployment? Can I use it in the same way? Of course yes. At present, I don't have enough resources to run a distributed

Re: limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Jack Tang
On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote: hi, I tried this, actually in my case, one site ends with .net and the other is .org so I modified it to +^http://([a-z0-9]*\.)*(abc.net|def.org)/ I guess '.' is metadata in regexp, so pls try +^http://([a-z0-9]*\.)*(abc\.net|def\.org)/ Good

Re: About regex in the crawl-urlfilter.txt config file

2006-02-23 Thread Jack Tang
Hi I think in the url-filter it uses contain rather than match. /Jack On 2/23/06, Elwin [EMAIL PROTECTED] wrote: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/? I think it's not, but in

Re: Search Particulars

2006-02-23 Thread Jack Tang
Hey One simplest way is copy BasicQueryFilter class and rename it, then modify the FIELDS/FIELD_BOOSTS by replacing them with you meta tags from nutch config. And don't forget the configuration in your query filter's plugin.xml. Good luck! /Jack On 2/24/06, Vanderdray, Jacob [EMAIL PROTECTED]

Re: Admin GUI

2006-02-23 Thread Jack Tang
Hi Stefan The GUI looks great! My idea is to add ajax tech. to reduce the page reload and show the job progress in realtime. If contribution is welcome and no one is working on this, I'd like to take this. Regards /Jack On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Daniel, thanks

Re: Admin GUI

2006-02-23 Thread Jack Tang
On 2/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Jack, The GUI looks great! I will forward this to Frank Henze he had done the design and sample. :) Thanks. I'll prepare some utility and debug javascript classes from now on:) My idea is to add ajax tech. to reduce the page reload

Re: Nutch 0.8 version required..

2006-02-23 Thread Jack Tang
On 2/24/06, sudhendra seshachala [EMAIL PROTECTED] wrote: Thanks Stefan. But when I compiled, the jar size was just 318kB for 0.8-dev where as the 0.7.1 release was 718KB. Am I missing something ? I guess no. All classes about mapreduce were sperated from nutch and hosted in hadoop proj.

Re: retrieve data from index file

2006-02-23 Thread Jack Tang
Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/st ore/FSDirectory ^^^ Why one blank here? On 2/24/06, Wong Ting Kiong [EMAIL PROTECTED] wrote: hi, I had tried some java codes calling lucene lib lucene-1.9-rc1-dev.jar, but got error, my

Single NutchBean and multiple indices support

2006-02-15 Thread Jack Tang
Hi there. I am facing the same the question and looking for same solution. Your solution seems easy:) My question is what file system the application runs on? LocalFileSystem or DistributedFileSystem? Thanks /Jack On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote: Hi David, Thanks for your

Re: Plugins: directory not found: plugins

2006-02-07 Thread Jack Tang
Please specify plugin.folders(in nutch-default/site.xml) to the real plugin built destination dir. Of course, you can use absolutely path. /Jack On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi Do you mean I should create a dir called build and move dir plugins in? It seems it doesn't work either

Re: Categorizing content

2006-02-07 Thread Jack Tang
Hi Byron I am thinking will it be faster to do this offline? I mean you can re-visit webdb and link db and generate the index. /Jack On 2/8/06, Byron Miller [EMAIL PROTECTED] wrote: Is there an easy way to categorize content on parse? I have an extensive list of adult terms and i would like

Re: MD5Hash

2006-01-18 Thread Jack Tang
Hi Thomas I suppose the only unique key of contents in web db is page' url. So why not retrieve the content by url directly? /Jack On 1/8/06, Thomas Delnoij [EMAIL PROTECTED] wrote: I am working with Nutch 0.7.1. As far as I understand the current implementation (please correct me if I am

Re: Background color searched word

2006-01-11 Thread Jack Tang
Hi Jérôme Is it better add id to the hits span so that we can highlight it or not in the javascript? /** Returns an HTML representation of this fragment. */ public String toString() { return span id=nutch-hl class=\highlight\ + super.toString() + /span; } Thanks /Jack On 1/12/06,

Re: document markup to control indexing

2005-12-28 Thread Jack Tang
Hi I am sorry, it should be getTextHelper() method. Say i want to index the content in this block: !--indexware-- This is not Ads !--/indexware-- The code may look like this: boolean contentStart; boolean contentEnd; if (node.getNodeType() == Node.COMMENT_NODE) { // you can move the value

Re: document markup to control indexing

2005-12-27 Thread Jack Tang
Hi Jeff Pls refer to getText() method in org.apache.nutch.parse.html.DOMContentUtils class (of course parse-html plugin). You can add your filter easily;) /Jack On 12/27/05, Jeff Breidenbach jeff@jab.org wrote: Hi all, Another open source search engine, HtDig, allows web page authors to

Re: Live updating an intranet index

2005-12-16 Thread Jack Tang
Hi Doug proposed the solution before. http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200511.mbox/[EMAIL PROTECTED] Hope it helps. /Jack On 12/16/05, Bostjan [EMAIL PROTECTED] wrote: Hi, I'm using nutch 0.7. If I understend correctly, touching the web.xml of my webapp helps my

Re: How to get page content given URL only?

2005-12-09 Thread Jack Tang
Hi Nguyen I am going to face this problem too. Here is my thoughts. One field will be add in the index, saying uid, and the value of uid will be generate from URL. Say the url is http://www.a.com/x/y/z.hml uid = md5_hash(http://www.a.com;).append(md5_hash(/x/y/z.html)); Is that ok? When i query

Re: Crawling listing (pagination) pages.

2005-12-08 Thread Jack Tang
Hi I am facing the same problem. However my crawl only focuses on some website and I recognize the paganition url ursing regexp and inject them in every fetch cycle. /Jack On 12/8/05, K.A.Hussain Ali [EMAIL PROTECTED] wrote: HI all, Do Nutch crawl pages in any listing pages( pages with

Re: Class Not Found

2005-12-01 Thread Jack Tang
Hi Could you pls post your ant logging? Thanks /Jack On 12/1/05, Vanderdray, Jacob [EMAIL PROTECTED] wrote: I'm able to get things to build if I run just 'ant' or 'ant jar'. I only get the error when I do 'ant war'. I've written a plugin that extends the HTMLParseFilter,

Re: Error: searching for 20 raw hits with stack trace

2005-11-30 Thread Jack Tang
Hi, I have no idea about wether something wrong with NutchBean or the web application container. I suggest you running NutchBean out of web container first. Yeah, run it as application. Regards /Jack On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi, The following is th exception

Re: Help require in local hard-disk crawling with Nutch

2005-11-29 Thread Jack Tang
Hi I hope this helps http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch /Jack On 11/30/05, Arun Kumar Sharma [EMAIL PROTECTED] wrote: Nutch Geeks- I want to do local hard-disk crawling. I want to know what I need to do for this.I find this article helpful

Re: Error: searching for 20 raw hits

2005-11-29 Thread Jack Tang
Hi More logging info and exceptions will help dealing with the problem quickly;) /Jack On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi all, I am getting the following error, while i use the nutch to search the crawled nutch database, what would be the problem, Error:

Re: Images

2005-11-22 Thread Jack Tang
On 11/23/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Stefan Groschupf wrote: Can it do an image search like google? No. ;-/ Yes ;-) That is, if you have the image parser... which indeed is not so difficult, what with JAI and other libraries. You could index the image metadata.

Re: SessionIDs and forums are killing my fetch

2005-09-28 Thread Jack Tang
Hi Jon I think you can revise the URL by discarding sid param before putting it into fetchlist. Regards /Jack On 9/28/05, Jon Shoberg [EMAIL PROTECTED] wrote: Gal Nitzan wrote: Jon Shoberg wrote: I'm getting a ton of duplicate content from a forum with sessionIDs. Its a phpBB which

Re: SessionIDs and forums are killing my fetch

2005-09-28 Thread Jack Tang
[EMAIL PROTECTED] wrote: Hi Jack, How can you discard URL from fetchlist? Regards, Gal Jack Tang wrote: Hi Jon I think you can revise the URL by discarding sid param before putting it into fetchlist. Regards /Jack On 9/28/05, Jon Shoberg [EMAIL PROTECTED] wrote: Gal

Re: Map Reduce

2005-09-27 Thread Jack Tang
Hi Gal You can get the orignal paper from google labs http://labs.google.com/papers/mapreduce.html and some presentations in nutch wiki http://wiki.apache.org/nutch/Presentations Hope these resources help. Regards /Jack On 9/27/05, Gal Nitzan [EMAIL PROTECTED]

Re: JavaScript Urls

2005-09-07 Thread Jack Tang
Hi Andrzej I think javascript-function-and-url mapping is a good solution. Say domainName.javascript:go = http://www.a.com/b.jsp?id={0} go is the javascipt function and it contains one param. And http://www.a.com/b.jsp?id={0}; is the URL template for go function. and {0} is the exactly param, it

Re: Recrawling

2005-09-07 Thread Jack Tang
Hi Jake Basic, but pretty hard issue. Now, we re-crawling website by running crawl command, and put index into temp dir. I think the core issue is how to swap index on the fly. Some index maybe are referenced by NutchBean. Should we shutdown it? Mapreduce will solve the problem? I mean can we

Re: Problem Starting Nutch (Tutorial like)

2005-07-30 Thread Jack Tang
Hi Nils Make sure the Adpater configuration is right in your linux box. And you can search thread nutch and linux box in nutch maillist. I think I posted the problem before. Regards /Jack On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything

Re: Text extraction from HTML

2005-07-29 Thread Jack Tang
Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote: Hello, I'm working to the development of a multi-agents

Re: Problem Starting Nutch (Tutorial like)

2005-07-28 Thread Jack Tang
Hi Nils Make sure the Adpater configuration is right in your linux box. And you can search thread nutch and linux box in nutch maillist. I think I posted the problem before. Regards /Jack On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as

Re: Information extraction

2005-07-26 Thread Jack Tang
Hi Matthias. The website is interesting but any document about the implementation avaiable? Cuong. I notice a lot paper mentioned HMM is great for information extraction. But I cannot find one demo in opensource way:( What's your thoughts? Regards /Jack On 7/26/05, Matthias Jaekle [EMAIL

Re: crawling Doc and Pdf

2005-07-26 Thread Jack Tang
Michael Error logs helps. pls post them on the email. Thanks /Jack On 7/27/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote: hi, I checked my log file, found crawler generates error when met a page with word file and pdf file inside. Any configuration file I have to change to let crawler

Re: Information extraction

2005-07-26 Thread Jack Tang
, there are several documents that maybe useful. I don't think they will release the source code. Regards, Cuong Hoang -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Tuesday, 26 July 2005 8:29 PM To: nutch-user@lucene.apache.org Subject: Re: Information extraction Hi

Re: where to pull indexed files?

2005-07-24 Thread Jack Tang
Hi Please search the maillist. I remembered one thread talked about dynamically delete index some days ago. Maybe it helps. Index-hot-swap is not supported in Nutch by default now. Regards /Jack On 7/25/05, blackwater dev [EMAIL PROTECTED] wrote: So there is no way to set up different

Re: Newbie questions

2005-07-05 Thread Jack Tang
Hi Vacuum I hope nutch wiki will help you much:) http://wiki.apache.org/nutch/ Regards /Jack On 7/6/05, Vacuum Joe [EMAIL PROTECTED] wrote: Hello Nutch-gurus, I have some very straightforward and yet totally newbie questions which I hope some kind person would answer. First of all,