Hi list,
I am now facing one problem on scientific computering.
there exist > 5G datum (maily matrix/vector) that we collected for
some surveys. And now we plan to do some datamining on these. And
honestly, I am not every well know Hadoop/Mapreduce. The question
seems quite simple to you experts
Hi all
RegExp is widely used in nutch, and I now wondering is it jdk/jakarta
classes is faster enough?
Here is the benchmarks i found on web.
http://tusker.org/regex/regex_benchmark.html
it seems dk.brics.automaton.RegExp is fastest among the libs.
/Jack
--
Keep Discovering ... ...
http://www.jr
Hi
How to avoid duplicate content?
1. Mirror sites: 1 website, 2 domains.
2. Confusing the bot: dynamic URL's. As robots find dynamic content,
the site may be returning a different URL with the same content…
3. Print friendly pages?
Will nutch enhanced the dedup code?
/Jack
--
Keep Discovering ..
Yes, you're right:) i find the answer.
Thanks.
On 2/24/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Isn't HitDetails.length == hitsPerPage?
> This happens in search.jsp.
>
>
> Am 24.02.2006 um 03:09 schrieb Jack Tang:
>
> > I dont think s
r result page.
> Does this answer the question?
>
> Am 24.02.2006 um 02:51 schrieb Jack Tang:
>
> > Hi Stefan
> >
> > Can you explain a little more? I mean I cannot find some evidence in
> > the source code...
> > Thanks
> >
> > /Jack
> >
> &g
06 um 02:45 schrieb Jack Tang:
>
> > On 2/23/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
> >> Jack Tang wrote:
> >>> In FetchedSegments class, below code shows how to get the hit
> >>> summaries.
> >>>
> >>&
On 2/23/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
> > In FetchedSegments class, below code shows how to get the hit summaries.
> >
> > public String[] getSummary(HitDetails[] details, Query query)
> > throws IOException {
>
Hi All
I don't know will nutch only support JDK1.5 or both JDK1.4 and 1.5 in
the future. If the former, is it better to adopt JDK1.5 concurrency
framework for thread (say fetcher and summaries thread)? And here is
ibm tutorial on the new classes in tiger.
/Jack
--
Keep Discovering ... ...
http:
Hi
Can someone explain the original design?
And I suggest to refactor the API (FetchedSegments.class) to
public String[] getSummary(HitDetails[] details, int hitStart, int
hitEnd, Query query) {
}
Does this make sense?
/Jack
On 2/20/06, Jack Tang <[EMAIL PROTECTED]> wrote:
>
Hi Guys
In FetchedSegments class, below code shows how to get the hit summaries.
public String[] getSummary(HitDetails[] details, Query query)
throws IOException {
SummaryThread[] threads = new SummaryThread[details.length];
for (int i = 0; i < threads.length; i++) {
threads[i
Hi All
Now nutch only supports "content" field highlight. Any suggestion to
enable multi-fields highlighting? say some hits in anchor text and url
(like google), and etc.. I know one simplest but stupid way is get the
hitdetails first then invoke summarier threads, any smarter ideas?
Thanks.
/Jac
On 2/9/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Folks,
>
> I hope and it looks like we are close to get meta data support for
> crawlDatum (CrawlDB) into the sources soon.
> At this point we can store and read but not 'process' (means creation
> or inheritance etc. [some one knows a bet
Hi
Is it reasonable to guess language info. from target servers geographical info.?
/Jack
On 1/23/06, Jérôme Charron <[EMAIL PROTECTED]> wrote:
> > Any plan to implement this ? I mean move LanguageIdentifier class
> > intto nutch core.
>
> As I already suggested it on this list, I really would l
Hi Jérôme
On 1/21/06, Jérôme Charron <[EMAIL PROTECTED]> wrote:
> > I am wondering Analyzer of nutch in svn trunk is chosen by
> > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> It's not really choosen by the languageidentifier, but coosen regarding the
> value of the lan
On 1/21/06, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi All
>
> I am wondering Analyzer of nutch in svn trunk is chosen by
> languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
>
> In org.apache.nutch.indexer.Indexer.class line 104
>
> writer.addDocum
Hi All
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
In org.apache.nutch.indexer.Indexer.class line 104
writer.addDocument((Document)((ObjectWritable)value).get());
It should be
NutchAnalyzer analyzer = AnalyzerF
Hi Guys
I update the source code from svn head version now. However I cannot
find org.apache.nutch.protocol.http.api.HttpBase class. Did you miss
it?
Thanks
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi
I think it is reasonable that PluginManifestParser should implement
NutchConfigurable interface. As the NutchConfigurable interface
described, PluginManifestParser need NutchConf.
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi
I am going to feed nutch-0.8-dev crawler with seeds in xml format. And
I have read nutch TextInputFormat/InputFormatBase. It seems now nutch
breaks the plain text files into chars and parses on them. My question
is how to support XmlInputFormat, in my eye, xml format is not
character-based but
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?), TextBasedInformat
BTW: if nutch is going to support vertical searching, I think page
urls should be grouped in three type: fetchable url(just fetching it),
extractable url(fetch it and extract information from this page) and
pagination url.
/Jack
On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote:
> H
gt; (sorted for query frequency).
>
> Am 20.12.2005 um 10:29 schrieb Jack Tang:
>
> > Hi Guys
> >
> > Is it possible to dump suggestion list from nutch index in order to
> > implement ajax auto-complete?
> >
> > Google suggestion: http://www.google.co
Hi Guys
Is it possible to dump suggestion list from nutch index in order to
implement ajax auto-complete?
Google suggestion: http://www.google.com/webhp?complete=1&hl=en
Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
ce the current live index with
> the new copy.
>
> Good luck,
> Fredrik
>
>
> On 12/12/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> > Hi
> >
> > The approach is great for one sigle query field. How about multi-fields?
> > Say I want do some recomme
, and pick the most frequent
> query for suggestion.
>
> Fredrik
>
> On 9/29/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> > Hi
> >
> > I am very like Google's "Did you mean" and I notice that nutch now
> > does not provider this func
Stefan
It seemed your patch missing
org.apache.nutch.protocol.ContentProperties class, right?
/Jack
On 12/10/05, Stefan Groschupf (JIRA) <[EMAIL PROTECTED]> wrote:
> [
> http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ]
>
> Stefan Groschupf commented on NUTCH-13
Hi
I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
parse.getData().getMetadata().get("propertyName") is NULL. In fact
when i stepped in the source code, the value of propertyName is not
NULL.
So can someone explain this? Thanks
/Jack
--
Keep
Guys
My fault! I miss copying the segments dir. Sorry for that. Pls ignore
this messgae.
/Jack
On 12/8/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi All
>
> Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come
> across one question on searcher.
>
> I
Hi All
Currently I update my nutch from 0.7 to 0.8-dev (svn version) and come
across one question on searcher.
I wrote my own indexer and searcher based on nutch-0.7 and they both
worked fine. However, without luck, searcher is failed in
nutch-0.8-dev. Here are the exceptions:
Total hits: 26
Ex
aning? the id of DataNode? why the scoket connectio will reset?
Thanks
/Jack
On 12/6/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi
>
> I checked out latest source code from svn, and played NDFS according
> the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem).
Hi
I checked out latest source code from svn, and played NDFS according
the tutorial (http://wiki.apache.org/nutch/NutchDistributedFileSystem).
And I tested my NDFS using TestClient. It was odd that when I input
every command, the NameNode would throw exception:
051206 003714 Server connection on
Hi Doug
1. How to deal with "dead urls"? If I remove the url after nutch 1st
crawling. Should nutch keeps the "dead urls" and never fetches them
again?
2. should nutch export dedup as one extension point? In my project, we
add information extraction layer to nutch, I think it is good idea
export d
Hi Doug
On 11/10/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
> > Below is google architecture in my brain:
> >
> > DataNode A
> > Master DataNode B GoogleCrawler
> > DataNode C
> >
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.
Regards
/Jack
On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
>
> >Hi Andrzej
> >
> >In document, Michael sa
ied to DataNode B and C.
Commnets?
Regards
/Jack
On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
>
> >Hi Stefan
> >
> >Deleting is totally OK if there is NO references to the chunks(segments).
> >Also, Will master balance th
ch new segments.
> Stefan
>
> Am 08.11.2005 um 18:38 schrieb Jack Tang:
>
> > Hi
> >
> > I read GFS document and NFS document on the wiki. One interesting
> > question here: does NFS support updating index on the fly?
> >
> > As you known, google updat
Hi
I read GFS document and NFS document on the wiki. One interesting
question here: does NFS support updating index on the fly?
As you known, google updats its index via google dance. It is said
that replicator in GFS placed three copies of chunks in different
datanode. During index updating, the
Environment: all
Reporter: Jack Tang
Priority: Minor
I customize one query filter using "test" as my field. And when i try to
search "test:(c1)(c2)(c3)", the query object which is generated by
NutchAnalysis is wrong. Now the result is
test:(c1)(c2) [DEFAUL
[
http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ]
Jack Tang commented on NUTCH-36:
Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in
Summarizer. Say here is one chinese string (c1)(c2)(c
oach would be to keep a Lucene index with |query,frequency|
> tuples (updated nightly, weekly, or whatever), and simply search this index
> with a FuzzyQuery with some defined similarity, and pick the most frequent
> query for suggestion.
>
> Fredrik
>
> On 9/29/05, Jack Tang
Hi
I am very like Google's "Did you mean" and I notice that nutch now
does not provider this function.
In this article http://today.java.net/lpt/a/211, author Tim White
implemented suggestion using n-gram to generate suggestion index. Do
you think is it good for nutch? I mean index in nutch will
Hi AJ
I guess the growing of thread.
You can show the thread id in the log. I think it makes sence
Regards
/Jack
On 9/29/05, AJ Chen <[EMAIL PROTECTED]> wrote:
> I started the crawler with about 2000 sites. The fetcher could achieve
> 7 pages/sec initially, but the performance gradually dropped
input_stream.backup(1);
> + }
> +}
> +
> +if(cjkToken == null || cjkToken.termText().equals("")) {
> + cjkTokenizer = null;
> + cjkStartOffset = 0;
> +}
> + }
>
>
> > Chinese in Nutch
> >
> >
Hi EM
On 9/26/05, EM <[EMAIL PROTECTED]> wrote:
>
> >>
> >>I know that if you are big user (several dedicated machines in a data
> >>center with fast connection...) you probably don't care about this, your
> >>crawler will run over any website, with 50-500 threads the default three
> >>retry times
itions like you meet.
> >
> >I think to crawle a dynamic page is black hole for crawler.
> >
> >we could not get all necessary parameters which need to post to a form .
> >
> >and to fetch dynamic page , we need to identify the duplicate page.
> >
> >2
Hi Guys
I known it is one difficult question for crawler and I just want to
know is it possible to nutch's crawler.
The page structure of website I want to crawl is like this
-> Page 1
Hi Kerang
I have test the query, no problem in summary highlight. It is really
amazing. It's the solution for Chinese bi-gram segmentation.
Regards
/Jack
On 9/22/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Kerang
>
> Pretty nice hack!
> I will test highlight in query s
> > Key: NUTCH-36
> > URL: http://issues.apache.org/jira/browse/NUTCH-36
> > Project: Nutch
> > Type: Improvement
> > Components: indexer, searcher
> > Environment: all
> > Reporter: Jack Tang
> > Priority: Mino
Hi Nutchers
I hope this email is noise in this community. I am now working on
something like hyperbolic browser (
http://www.acm.org/sigchi/chi96/proceedings/videos/Lamping/hb-video.html
). And I remembered that there were some apis written by java. I got
it through click the blog address in email
Hi
There is wonderful discussion in Heritrix mailist. I cannot help
FWDing some information here. And hope it helps for nutch
-
Dennis Hotson wrote:
> I'm just wondering whether anyone has wri
Hi Andrzej
There is HttpAuthenticationFactory class in protocol-httpclient
plugin. But I doubt that whether RFC 2617 basic authentication works.
I cannot see the reference to HttpAuthenticationFactory class. I
missed something?
Reagds
/Jack
On 9/13/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
de of each plugin for
> instance. Because that way, I believe you can customize whatever plugin to do
> whatever your need is, * without * having to recompile the code just to add
> another accepted content type to a plugin so it doesn't throw an error
> message.
>
> What
Hi Guys
Did someone install parse-rss and try to fetch rss feeds?
It failed on my side. I enabled the plugin and it fetched, not rss
parser didnot work.
My feed is http://www.craigslist.org/evs/index.rss
Here is the error:
org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
can
Thanks Chen, I will try that:)
On 9/8/05, AJ Chen <[EMAIL PROTECTED]> wrote:
> Jack,
> Set the max to 100, but run 10 cycles (i.e., depth=10) with the
> CrawlTool. You may see all the outlinks are collected toward the end. 3
> cycles is usually not enough.
> -AJ
>
>
; Am 07.09.2005 um 18:43 schrieb Jack Tang:
>
> > Hi All
> >
> > Here is the "db.max.outlinks.per.page" property and its description in
> > nutch-default.xml
> >
> > db.max.outlinks.per.page
> > 100
> > The maximum n
l of the outlinks are processed for a page, the
> db.max.outlinks.per.page must be set to a number that is larger than the
> number of outlinks on the page. If these is true, then the max number
> has to be determined in real time since the number of outlinks varies
> from page to page.
Hi All
Here is the "db.max.outlinks.per.page" property and its description in
nutch-default.xml
db.max.outlinks.per.page
100
The maximum number of outlinks that we'll process for a
page.
I don't think the description is right.
Say, my cra
Hi
I found the reason. The value of maximum number of outlinks that nutch
willl process for a page is 100. And the website contains more than
300 URLs in the page.
Now, everything is ok.
/Jack
On 9/7/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Andrzej
>
> First of all, thanks
Hi Andrzej
First of all, thanks for your quick response.
On 9/7/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
> > Hi All
> >
> > Is nutch crawler breadth-first one? It seems a lot of URLs are lost
> > while I try do breadth-first crawli
Hi All
Is nutch crawler breadth-first one? It seems a lot of URLs are lost
while I try do breadth-first crawling, I set the depth to 3.
Any comments?
Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi Nutcher
Now, I am going to add HttpFormAuthentication class to
protocol-httpclient plugin. I hope I do not duplicate your work.
Before my work, I read the code of HttpBasicAuthentication, it is
clear. But I cannot make myself understood. HttpAuthenticationFactory
is the factory which provides
p/file"(the protocol) keywords using
NutchBean, I guess it will dump all index;), right?
> thanks,
>
> Michael Ji
>
Regards
/Jack
> --- Jack Tang <[EMAIL PROTECTED]> wrote:
>
> > Hi Michael
> >
> > Is "segread" nutch command what you wanna?
&g
ry powerful tool.
>
> But I wonder if I can output the content of the
> individual files in index dir to a text format, means,
> I can see the each text saved in index files without
> interpreting by Lukeall.
>
> thanks,
>
> Michael Ji
>
> --- Jack Tang <[EM
Hi Michael
Hope luke helps you.
http://www.getopt.org/luke/
Regards
/Jack
On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> hi there,
>
> Is there a easy way that I could dump nutch index to a
> human-readable format?
>
> thanks,
>
> Michael Ji
>
>
>
> ___
g process, first to index HTML tags and
> find similarities (like as usual header, footer, Options, Menu, (c),
> etc.), then to use second parsing and second indexing - to index only
> unique text.
>
> -Fuad
>
>
> -Original Message-
> From: Jack Tang [mailto:
similar HTML, and I need only
> subset.
>
> Also, I need to find a point in Nutch where I can replace Analyzer with
> my own "non-analyzer"; I don't need to remove stop-words etc.
>
> I'd like to use Lucene as a database too... To perform a lot of queries,
&
Hi Nutchers
I think parse-html parse should be enhanced. In some of my
projects(Intranet search engine), we only need the content in the
specified detectors and filter the junk, say the content between and or some detectors like XPath. Any
thoughts on this enhancement?
Regards
/Jack
--
Keep D
smarc/
>
> On these websites, there are several documents that maybe useful. I don't
> think they will release the source code.
>
>
> Regards,
>
> Cuong Hoang
> -Original Message-
> From: Jack Tang [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 26
Hi Cuong.
I am going to build private book search engine. And I am face the same problem.
Could you describe more about the information you want to extract and
the website?
Regards
/Jack
On 7/26/05, Cuong Hoang <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>
>
> Does anyone have experience with desi
Hi Michael
I think some information in intranet crawling is privacy while
internet is public.
/Jack
On 7/24/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote:
>
> I wonder if there is any difference between these two?
> Or intranet crawling must indicate an intranet site
> explicitly in crawl-url
( my server: 4G
> mem and 4 cpu and the index file size about: 2.2G )
>
>
>so , recent days, I am strive to solve the above 2 questions.
>
>good luck
> if you are chinese , we could use chinese for further exchange..
>
> 2005/7/19, Jack
Hi Transbuerg
Could you please describe your solution in detail? Appreciate your time.
Regards
/Jack
On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote:
> hi,
> Jack Tang
>
> I have the same condition with u , could you share your total
>
Hi ShiBin
Thanks for your post.
I had known weblucene since 2003. It was said weblucene used
FMM(dictionary based segmentation) segmenation. But I find nothing in
weblucene cvs util now.
Here, I hope you can understand my mail, there is no difficult to make
cjk-plugin avaiable, and I wanna Nutc
experices more?
Thanks
Regards
/Jack
On 7/17/05, Stanislaw Osinski <[EMAIL PROTECTED]> wrote:
> On 7/17/05, Jack Tang <[EMAIL PROTECTED]> wrote:
> > Hi nutch guys
> >
> > Is it impossible cluster nutch search result and other search engine's
> > on t
Hi nutch guys
Is it impossible cluster nutch search result and other search engine's
on the fly?
You can some info. here
http://blogs.msdn.com/msnsearch/archive/2005/04/13/407939.aspx.
Appreciate you time and comments.
Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Hi All
It takes long time for me to think about embedding improved
CJKAnalysis into NutchAnalysis. I got nothing but some failure
experiences, and share with you, maybe you can hack it( well, I am not
going to give up).
I have written several Chinese words segmentation, some are dictionary
based,
Hi Bin
The smiplest way is invent cjk-index-basic and cjk-query-basic plugin,
and replace index index-basic and query-basic with them. The invention
is quite simple, you can use CJKTokenizer and CJKAnalyzer in Lucene
project. And take care the query syntax characters in nutch.
// query syntax c
77 matches
Mail list logo