Distributed Matrix Computering on Hadoop

2006-07-21 Thread Jack Tang
Hi list, I am now facing one problem on scientific computering. there exist > 5G datum (maily matrix/vector) that we collected for some surveys. And now we plan to do some datamining on these. And honestly, I am not every well know Hadoop/Mapreduce. The question seems quite simple to you experts

Much faster RegExp lib needed in nutch?

2006-03-11 Thread Jack Tang
Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. /Jack -- Keep Discovering ... ... http://www.jr

Duplicate Content Issues

2006-02-28 Thread Jack Tang
Hi How to avoid duplicate content? 1. Mirror sites: 1 website, 2 domains. 2. Confusing the bot: dynamic URL's. As robots find dynamic content, the site may be returning a different URL with the same content… 3. Print friendly pages? Will nutch enhanced the dedup code? /Jack -- Keep Discovering ..

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
Yes, you're right:) i find the answer. Thanks. On 2/24/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Isn't HitDetails.length == hitsPerPage? > This happens in search.jsp. > > > Am 24.02.2006 um 03:09 schrieb Jack Tang: > > > I dont think s

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
r result page. > Does this answer the question? > > Am 24.02.2006 um 02:51 schrieb Jack Tang: > > > Hi Stefan > > > > Can you explain a little more? I mean I cannot find some evidence in > > the source code... > > Thanks > > > > /Jack > > > &g

Re: Summarier threads in nutch

2006-02-23 Thread Jack Tang
06 um 02:45 schrieb Jack Tang: > > > On 2/23/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > >> Jack Tang wrote: > >>> In FetchedSegments class, below code shows how to get the hit > >>> summaries. > >>> > >>&

Re: Summarier threads in nutch

2006-02-22 Thread Jack Tang
On 2/23/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > In FetchedSegments class, below code shows how to get the hit summaries. > > > > public String[] getSummary(HitDetails[] details, Query query) > > throws IOException { >

Thread in nutch

2006-02-20 Thread Jack Tang
Hi All I don't know will nutch only support JDK1.5 or both JDK1.4 and 1.5 in the future. If the former, is it better to adopt JDK1.5 concurrency framework for thread (say fetcher and summaries thread)? And here is ibm tutorial on the new classes in tiger. /Jack -- Keep Discovering ... ... http:

Re: Summarier threads in nutch

2006-02-20 Thread Jack Tang
Hi Can someone explain the original design? And I suggest to refactor the API (FetchedSegments.class) to public String[] getSummary(HitDetails[] details, int hitStart, int hitEnd, Query query) { } Does this make sense? /Jack On 2/20/06, Jack Tang <[EMAIL PROTECTED]> wrote: >

Summarier threads in nutch

2006-02-19 Thread Jack Tang
Hi Guys In FetchedSegments class, below code shows how to get the hit summaries. public String[] getSummary(HitDetails[] details, Query query) throws IOException { SummaryThread[] threads = new SummaryThread[details.length]; for (int i = 0; i < threads.length; i++) { threads[i

How to supprt multi-fields highlight?

2006-02-16 Thread Jack Tang
Hi All Now nutch only supports "content" field highlight. Any suggestion to enable multi-fields highlighting? say some hits in anchor text and url (like google), and etc.. I know one simplest but stupid way is get the hitdetails first then invoke summarier threads, any smarter ideas? Thanks. /Jac

Re: process/create/hand over: crawl meta data

2006-02-08 Thread Jack Tang
On 2/9/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Folks, > > I hope and it looks like we are close to get meta data support for > crawlDatum (CrawlDB) into the sources soon. > At this point we can store and read but not 'process' (means creation > or inheritance etc. [some one knows a bet

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jack Tang
Hi Is it reasonable to guess language info. from target servers geographical info.? /Jack On 1/23/06, Jérôme Charron <[EMAIL PROTECTED]> wrote: > > Any plan to implement this ? I mean move LanguageIdentifier class > > intto nutch core. > > As I already suggested it on this list, I really would l

Re: lang identifier and nutch analyzer in trunk

2006-01-21 Thread Jack Tang
Hi Jérôme On 1/21/06, Jérôme Charron <[EMAIL PROTECTED]> wrote: > > I am wondering Analyzer of nutch in svn trunk is chosen by > > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). > > It's not really choosen by the languageidentifier, but coosen regarding the > value of the lan

Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jack Tang
On 1/21/06, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi All > > I am wondering Analyzer of nutch in svn trunk is chosen by > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). > > In org.apache.nutch.indexer.Indexer.class line 104 > > writer.addDocum

lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jack Tang
Hi All I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). In org.apache.nutch.indexer.Indexer.class line 104 writer.addDocument((Document)((ObjectWritable)value).get()); It should be NutchAnalyzer analyzer = AnalyzerF

Where is org.apache.nutch.protocol.http.api.HttpBase?

2006-01-12 Thread Jack Tang
Hi Guys I update the source code from svn head version now. However I cannot find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss it? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

PluginManifestParser should be NutchConfigurable

2006-01-11 Thread Jack Tang
Hi I think it is reasonable that PluginManifestParser should implement NutchConfigurable interface. As the NutchConfigurable interface described, PluginManifestParser need NutchConf. /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

XmlInputFortmat ?

2006-01-10 Thread Jack Tang
Hi I am going to feed nutch-0.8-dev crawler with seeds in xml format. And I have read nutch TextInputFormat/InputFormatBase. It seems now nutch breaks the plain text files into chars and parses on them. My question is how to support XmlInputFormat, in my eye, xml format is not character-based but

Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?), TextBasedInformat

Re: Per-page crawling policy

2006-01-05 Thread Jack Tang
BTW: if nutch is going to support vertical searching, I think page urls should be grouped in three type: fetchable url(just fetching it), extractable url(fetch it and extract information from this page) and pagination url. /Jack On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote: > H

Re: nutch and google suggestion

2005-12-20 Thread Jack Tang
gt; (sorted for query frequency). > > Am 20.12.2005 um 10:29 schrieb Jack Tang: > > > Hi Guys > > > > Is it possible to dump suggestion list from nutch index in order to > > implement ajax auto-complete? > > > > Google suggestion: http://www.google.co

nutch and google suggestion

2005-12-20 Thread Jack Tang
Hi Guys Is it possible to dump suggestion list from nutch index in order to implement ajax auto-complete? Google suggestion: http://www.google.com/webhp?complete=1&hl=en Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

Re: Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

2005-12-12 Thread Jack Tang
ce the current live index with > the new copy. > > Good luck, > Fredrik > > > On 12/12/05, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi > > > > The approach is great for one sigle query field. How about multi-fields? > > Say I want do some recomme

Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

2005-12-11 Thread Jack Tang
, and pick the most frequent > query for suggestion. > > Fredrik > > On 9/29/05, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi > > > > I am very like Google's "Did you mean" and I notice that nutch now > > does not provider this func

Re: [jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Jack Tang
Stefan It seemed your patch missing org.apache.nutch.protocol.ContentProperties class, right? /Jack On 12/10/05, Stefan Groschupf (JIRA) <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] > > Stefan Groschupf commented on NUTCH-13

parse.getData().getMetadata().get("propName") is NULL?

2005-12-09 Thread Jack Tang
Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes parse.getData().getMetadata().get("propertyName") is NULL. In fact when i stepped in the source code, the value of propertyName is not NULL. So can someone explain this? Thanks /Jack -- Keep

Re: Nutch 0.8 update issue

2005-12-07 Thread Jack Tang
Guys My fault! I miss copying the segments dir. Sorry for that. Pls ignore this messgae. /Jack On 12/8/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi All > > Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come > across one question on searcher. > > I

Nutch 0.8 update issue

2005-12-07 Thread Jack Tang
Hi All Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come across one question on searcher. I wrote my own indexer and searcher based on nutch-0.7 and they both worked fine. However, without luck, searcher is failed in nutch-0.8-dev. Here are the exceptions: Total hits: 26 Ex

Re: NDFS Connection reset

2005-12-06 Thread Jack Tang
aning? the id of DataNode? why the scoket connectio will reset? Thanks /Jack On 12/6/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi > > I checked out latest source code from svn, and played NDFS according > the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem).

NDFS Connection reset

2005-12-05 Thread Jack Tang
Hi I checked out latest source code from svn, and played NDFS according the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem). And I tested my NDFS using TestClient. It was odd that when I input every command, the NameNode would throw exception: 051206 003714 Server connection on

Re: incremental crawling

2005-12-01 Thread Jack Tang
Hi Doug 1. How to deal with "dead urls"? If I remove the url after nutch 1st crawling. Should nutch keeps the "dead urls" and never fetches them again? 2. should nutch export dedup as one extension point? In my project, we add information extraction layer to nutch, I think it is good idea export d

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Doug On 11/10/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > Below is google architecture in my brain: > > > > DataNode A > > Master DataNode B GoogleCrawler > > DataNode C > >

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > >Hi Andrzej > > > >In document, Michael sa

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
ied to DataNode B and C. Commnets? Regards /Jack On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > >Hi Stefan > > > >Deleting is totally OK if there is NO references to the chunks(segments). > >Also, Will master balance th

Re: Index update and Google Dance

2005-11-08 Thread Jack Tang
ch new segments. > Stefan > > Am 08.11.2005 um 18:38 schrieb Jack Tang: > > > Hi > > > > I read GFS document and NFS document on the wiki. One interesting > > question here: does NFS support updating index on the fly? > > > > As you known, google updat

Index update and Google Dance

2005-11-08 Thread Jack Tang
Hi I read GFS document and NFS document on the wiki. One interesting question here: does NFS support updating index on the fly? As you known, google updats its index via google dance. It is said that replicator in GFS placed three copies of chunks in different datanode. During index updating, the

[jira] Created: (NUTCH-104) Nutch query parser does not support CJK bi-gram segmentation.

2005-10-05 Thread Jack Tang (JIRA)
Environment: all Reporter: Jack Tang Priority: Minor I customize one query filter using "test" as my field. And when i try to search "test:(c1)(c2)(c3)", the query object which is generated by NutchAnalysis is wrong. Now the result is test:(c1)(c2) [DEFAUL

[jira] Commented: (NUTCH-36) Chinese in Nutch

2005-10-05 Thread Jack Tang (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ] Jack Tang commented on NUTCH-36: Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c

Re: Nutch Suggestion? (Google like "did you mean")

2005-09-29 Thread Jack Tang
oach would be to keep a Lucene index with |query,frequency| > tuples (updated nightly, weekly, or whatever), and simply search this index > with a FuzzyQuery with some defined similarity, and pick the most frequent > query for suggestion. > > Fredrik > > On 9/29/05, Jack Tang

Nutch Suggestion? (Google like "did you mean")

2005-09-29 Thread Jack Tang
Hi I am very like Google's "Did you mean" and I notice that nutch now does not provider this function. In this article http://today.java.net/lpt/a/211, author Tim White implemented suggestion using n-gram to generate suggestion index. Do you think is it good for nutch? I mean index in nutch will

Re: what contibute to fetch slowing down

2005-09-28 Thread Jack Tang
Hi AJ I guess the growing of thread. You can show the thread id in the log. I think it makes sence Regards /Jack On 9/29/05, AJ Chen <[EMAIL PROTECTED]> wrote: > I started the crawler with about 2000 sites. The fetcher could achieve > 7 pages/sec initially, but the performance gradually dropped

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

2005-09-27 Thread Jack Tang
input_stream.backup(1); > + } > +} > + > +if(cjkToken == null || cjkToken.termText().equals("")) { > + cjkTokenizer = null; > + cjkStartOffset = 0; > +} > + } > > > > Chinese in Nutch > > > >

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread Jack Tang
Hi EM On 9/26/05, EM <[EMAIL PROTECTED]> wrote: > > >> > >>I know that if you are big user (several dedicated machines in a data > >>center with fast connection...) you probably don't care about this, your > >>crawler will run over any website, with 50-500 threads the default three > >>retry times

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread Jack Tang
itions like you meet. > > > >I think to crawle a dynamic page is black hole for crawler. > > > >we could not get all necessary parameters which need to post to a form . > > > >and to fetch dynamic page , we need to identify the duplicate page. > > > >2

Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread Jack Tang
Hi Guys I known it is one difficult question for crawler and I just want to know is it possible to nutch's crawler. The page structure of website I want to crawl is like this -> Page 1

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

2005-09-22 Thread Jack Tang
Hi Kerang I have test the query, no problem in summary highlight. It is really amazing. It's the solution for Chinese bi-gram segmentation. Regards /Jack On 9/22/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Kerang > > Pretty nice hack! > I will test highlight in query s

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

2005-09-22 Thread Jack Tang
> > Key: NUTCH-36 > > URL: http://issues.apache.org/jira/browse/NUTCH-36 > > Project: Nutch > > Type: Improvement > > Components: indexer, searcher > > Environment: all > > Reporter: Jack Tang > > Priority: Mino

hyperbolic browser api (I missed)

2005-09-21 Thread Jack Tang
Hi Nutchers I hope this email is noise in this community. I am now working on something like hyperbolic browser ( http://www.acm.org/sigchi/chi96/proceedings/videos/Lamping/hb-video.html ). And I remembered that there were some apis written by java. I got it through click the blog address in email

Incremental Crawling / Revisting Pages

2005-09-13 Thread Jack Tang
Hi There is wonderful discussion in Heritrix mailist. I cannot help FWDing some information here. And hope it helps for nutch - Dennis Hotson wrote: > I'm just wondering whether anyone has wri

Re: crawling protected pages

2005-09-12 Thread Jack Tang
Hi Andrzej There is HttpAuthenticationFactory class in protocol-httpclient plugin. But I doubt that whether RFC 2617 basic authentication works. I cannot see the reference to HttpAuthenticationFactory class. I missed something? Reagds /Jack On 9/13/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Re: RSS Parser Bug!?

2005-09-08 Thread Jack Tang
de of each plugin for > instance. Because that way, I believe you can customize whatever plugin to do > whatever your need is, * without * having to recompile the code just to add > another accepted content type to a plugin so it doesn't throw an error > message. > > What

RSS Parser Bug!?

2005-09-07 Thread Jack Tang
Hi Guys Did someone install parse-rss and try to fetch rss feeds? It failed on my side. I enabled the plugin and it fetched, not rss parser didnot work. My feed is http://www.craigslist.org/evs/index.rss Here is the error: org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but can

Re: "db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread Jack Tang
Thanks Chen, I will try that:) On 9/8/05, AJ Chen <[EMAIL PROTECTED]> wrote: > Jack, > Set the max to 100, but run 10 cycles (i.e., depth=10) with the > CrawlTool. You may see all the outlinks are collected toward the end. 3 > cycles is usually not enough. > -AJ > >

Re: "db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread Jack Tang
; Am 07.09.2005 um 18:43 schrieb Jack Tang: > > > Hi All > > > > Here is the "db.max.outlinks.per.page" property and its description in > > nutch-default.xml > > > > db.max.outlinks.per.page > > 100 > > The maximum n

Re: "db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread Jack Tang
l of the outlinks are processed for a page, the > db.max.outlinks.per.page must be set to a number that is larger than the > number of outlinks on the page. If these is true, then the max number > has to be determined in real time since the number of outlinks varies > from page to page.

"db.max.outlinks.per.page" is misunderstood?

2005-09-07 Thread Jack Tang
Hi All Here is the "db.max.outlinks.per.page" property and its description in nutch-default.xml db.max.outlinks.per.page 100 The maximum number of outlinks that we'll process for a page. I don't think the description is right. Say, my cra

Re: Nutch crawler is breadth-first ?

2005-09-07 Thread Jack Tang
Hi I found the reason. The value of maximum number of outlinks that nutch willl process for a page is 100. And the website contains more than 300 URLs in the page. Now, everything is ok. /Jack On 9/7/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Andrzej > > First of all, thanks

Re: Nutch crawler is breadth-first ?

2005-09-07 Thread Jack Tang
Hi Andrzej First of all, thanks for your quick response. On 9/7/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > Hi All > > > > Is nutch crawler breadth-first one? It seems a lot of URLs are lost > > while I try do breadth-first crawli

Nutch crawler is breadth-first ?

2005-09-07 Thread Jack Tang
Hi All Is nutch crawler breadth-first one? It seems a lot of URLs are lost while I try do breadth-first crawling, I set the depth to 3. Any comments? Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

HttpAuthentication in protocol-httpclient plugin

2005-08-29 Thread Jack Tang
Hi Nutcher Now, I am going to add HttpFormAuthentication class to protocol-httpclient plugin. I hope I do not duplicate your work. Before my work, I read the code of HttpBasicAuthentication, it is clear. But I cannot make myself understood. HttpAuthenticationFactory is the factory which provides

Re: dump nutch index

2005-08-21 Thread Jack Tang
p/file"(the protocol) keywords using NutchBean, I guess it will dump all index;), right? > thanks, > > Michael Ji > Regards /Jack > --- Jack Tang <[EMAIL PROTECTED]> wrote: > > > Hi Michael > > > > Is "segread" nutch command what you wanna? &g

Re: dump nutch index

2005-08-21 Thread Jack Tang
ry powerful tool. > > But I wonder if I can output the content of the > individual files in index dir to a text format, means, > I can see the each text saved in index files without > interpreting by Lukeall. > > thanks, > > Michael Ji > > --- Jack Tang <[EM

Re: dump nutch index

2005-08-21 Thread Jack Tang
Hi Michael Hope luke helps you. http://www.getopt.org/luke/ Regards /Jack On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote: > hi there, > > Is there a easy way that I could dump nutch index to a > human-readable format? > > thanks, > > Michael Ji > > > > ___

Re: Parse-html should be enhanced!

2005-08-18 Thread Jack Tang
g process, first to index HTML tags and > find similarities (like as usual header, footer, Options, Menu, (c), > etc.), then to use second parsing and second indexing - to index only > unique text. > > -Fuad > > > -Original Message- > From: Jack Tang [mailto:

Re: Parse-html should be enhanced!

2005-08-18 Thread Jack Tang
similar HTML, and I need only > subset. > > Also, I need to find a point in Nutch where I can replace Analyzer with > my own "non-analyzer"; I don't need to remove stop-words etc. > > I'd like to use Lucene as a database too... To perform a lot of queries, &

Parse-html should be enhanced!

2005-08-18 Thread Jack Tang
Hi Nutchers I think parse-html parse should be enhanced. In some of my projects(Intranet search engine), we only need the content in the specified detectors and filter the junk, say the content between and or some detectors like XPath. Any thoughts on this enhancement? Regards /Jack -- Keep D

Re: Information extraction

2005-07-26 Thread Jack Tang
smarc/ > > On these websites, there are several documents that maybe useful. I don't > think they will release the source code. > > > Regards, > > Cuong Hoang > -Original Message- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, 26

Re: Information extraction

2005-07-26 Thread Jack Tang
Hi Cuong. I am going to build private book search engine. And I am face the same problem. Could you describe more about the information you want to extract and the website? Regards /Jack On 7/26/05, Cuong Hoang <[EMAIL PROTECTED]> wrote: > Hi all, > > > > Does anyone have experience with desi

Re: Nutch's intranet VS internet crawling

2005-07-24 Thread Jack Tang
Hi Michael I think some information in intranet crawling is privacy while internet is public. /Jack On 7/24/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote: > > I wonder if there is any difference between these two? > Or intranet crawling must indicate an intranet site > explicitly in crawl-url

Re: NutchAnalysis and CJK

2005-07-21 Thread Jack Tang
( my server: 4G > mem and 4 cpu and the index file size about: 2.2G ) > > >so , recent days, I am strive to solve the above 2 questions. > >good luck > if you are chinese , we could use chinese for further exchange.. > > 2005/7/19, Jack

Re: NutchAnalysis and CJK

2005-07-19 Thread Jack Tang
Hi Transbuerg Could you please describe your solution in detail? Appreciate your time. Regards /Jack On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote: > hi, > Jack Tang > > I have the same condition with u , could you share your total >

Re: NutchAnalysis and CJK

2005-07-19 Thread Jack Tang
Hi ShiBin Thanks for your post. I had known weblucene since 2003. It was said weblucene used FMM(dictionary based segmentation) segmenation. But I find nothing in weblucene cvs util now. Here, I hope you can understand my mail, there is no difficult to make cjk-plugin avaiable, and I wanna Nutc

Re: Nutch and cluster search result

2005-07-18 Thread Jack Tang
experices more? Thanks Regards /Jack On 7/17/05, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > On 7/17/05, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi nutch guys > > > > Is it impossible cluster nutch search result and other search engine's > > on t

Nutch and cluster search result

2005-07-16 Thread Jack Tang
Hi nutch guys Is it impossible cluster nutch search result and other search engine's on the fly? You can some info. here http://blogs.msdn.com/msnsearch/archive/2005/04/13/407939.aspx. Appreciate you time and comments. Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

NutchAnalysis and CJK

2005-07-14 Thread Jack Tang
Hi All It takes long time for me to think about embedding improved CJKAnalysis into NutchAnalysis. I got nothing but some failure experiences, and share with you, maybe you can hack it( well, I am not going to give up). I have written several Chinese words segmentation, some are dictionary based,

Re: hi all

2005-07-11 Thread Jack Tang
Hi Bin The smiplest way is invent cjk-index-basic and cjk-query-basic plugin, and replace index index-basic and query-basic with them. The invention is quite simple, you can use CJKTokenizer and CJKAnalyzer in Lucene project. And take care the query syntax characters in nutch. // query syntax c