Maybe you can filter javascript files(*.js) using url filter..
/Jack
On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:
Hello
Sorry my little English
I use nutch-0.7.1 and have issue with html parser
I got in summary javascript code and don't know how to remove it. For
example
remove it :)
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 15, 2006 11:14 AM
To: nutch-user@lucene.apache.org
Subject: Re: javascript in summaries [nutch-0.7.1]
On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:
This script present in html
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote:
I am not familiar with Rhino engine. But it is said jdk 6 adopted it
as embeded javascript engine. Can we build one RhinoInterpreter first,
and then evaluate the javascipt function to get the result rather than
extracting pure text now.
Hi Andrzej.
In my previous projects, I bound javascript functions with center url.
And I knew the idea does not fit for nutch.
I am not familiar with Rhino engine. But it is said jdk 6 adopted it
as embeded javascript engine. Can we build one RhinoInterpreter first,
and then evaluate the
pls put hadoop-0.1-dev.jar into your classpath
On 3/14/06, Tolga Erkal [EMAIL PROTECTED] wrote:
I am trying to use NGramProfile to create a profile and getting the
following error. It is probably related with classpath setting but could not
figure out how will I make it work.
Any help?
PROTECTED] mailto:[EMAIL PROTECTED]
Magnotia.com | www.magnotia.com http://www.magnotia.com/
My Blog | x.magnotia.com http://x.magnotia.com/
917 495 1938
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2006 11:59 PM
To: nutch-user
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
Mr Tang:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
Weird! You are running nutch on local file system or distributed file
system?
Local file system
And can you find the same query hasan via luke?
Nope
ok. As stepan said, can you get
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
ok. As stepan said, can you get any hit when you try to search http or
www?
No
Hey, can you zip the index and send it to me directly?
--
Cheers,
Hasan Diwan [EMAIL PROTECTED]
--
Keep
Tang [EMAIL PROTECTED] wrote:
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
ok. As stepan said, can you get any hit when you try to search http or
www?
No
Hey, can you zip the index and send it to me directly?
--
Cheers,
Hasan
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
You can still build it on local file system:)
Build, yes, but what of deployment? Can I use it in the same way?
Of course yes.
At
present, I don't have enough resources to run a distributed
On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote:
hi,
I tried this, actually in my case, one site ends with
.net and the other is .org
so I modified it to
+^http://([a-z0-9]*\.)*(abc.net|def.org)/
I guess '.' is metadata in regexp, so pls try
+^http://([a-z0-9]*\.)*(abc\.net|def\.org)/
Good
Hi
I think in the url-filter it uses contain rather than match.
/Jack
On 2/23/06, Elwin [EMAIL PROTECTED] wrote:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
I think it's not, but in
Hey
One simplest way is copy BasicQueryFilter class and rename it, then
modify the FIELDS/FIELD_BOOSTS by replacing them with you meta tags
from nutch config. And don't forget the configuration in your query
filter's plugin.xml.
Good luck!
/Jack
On 2/24/06, Vanderdray, Jacob [EMAIL PROTECTED]
Hi Stefan
The GUI looks great!
My idea is to add ajax tech. to reduce the page reload and show the
job progress in realtime. If contribution is welcome and no one is
working on this, I'd like to take this.
Regards
/Jack
On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Daniel,
thanks
On 2/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Jack,
The GUI looks great!
I will forward this to Frank Henze he had done the design and sample. :)
Thanks. I'll prepare some utility and debug javascript classes from now on:)
My idea is to add ajax tech. to reduce the page reload
On 2/24/06, sudhendra seshachala [EMAIL PROTECTED] wrote:
Thanks Stefan.
But when I compiled, the jar size was just 318kB for 0.8-dev where as the
0.7.1 release was 718KB.
Am I missing something ?
I guess no.
All classes about mapreduce were sperated from nutch and hosted in hadoop proj.
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/lucene/st ore/FSDirectory
^^^ Why one blank here?
On 2/24/06, Wong Ting Kiong [EMAIL PROTECTED] wrote:
hi,
I had tried some java codes calling lucene lib lucene-1.9-rc1-dev.jar, but
got error, my
Hi there.
I am facing the same the question and looking for same solution.
Your solution seems easy:) My question is what file system the
application runs on?
LocalFileSystem or DistributedFileSystem?
Thanks
/Jack
On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote:
Hi David,
Thanks for your
Please specify plugin.folders(in nutch-default/site.xml) to the real
plugin built destination dir. Of course, you can use absolutely path.
/Jack
On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote:
Hi
Do you mean I should create a dir called build and move dir plugins in?
It seems it doesn't work either
Hi Byron
I am thinking will it be faster to do this offline? I mean you can
re-visit webdb and link db and generate the index.
/Jack
On 2/8/06, Byron Miller [EMAIL PROTECTED] wrote:
Is there an easy way to categorize content on parse?
I have an extensive list of adult terms and i would
like
Hi Thomas
I suppose the only unique key of contents in web db is page' url. So
why not retrieve the content by url directly?
/Jack
On 1/8/06, Thomas Delnoij [EMAIL PROTECTED] wrote:
I am working with Nutch 0.7.1.
As far as I understand the current implementation (please correct me if I
am
Hi Jérôme
Is it better add id to the hits span so that we can highlight it or
not in the javascript?
/** Returns an HTML representation of this fragment. */
public String toString() { return span id=nutch-hl
class=\highlight\ + super.toString() + /span; }
Thanks
/Jack
On 1/12/06,
Hi
I am sorry, it should be getTextHelper() method.
Say i want to index the content in this block:
!--indexware--
This is not Ads
!--/indexware--
The code may look like this:
boolean contentStart;
boolean contentEnd;
if (node.getNodeType() == Node.COMMENT_NODE) {
// you can move the value
Hi Jeff
Pls refer to getText() method in
org.apache.nutch.parse.html.DOMContentUtils class (of course
parse-html plugin). You can add your filter easily;)
/Jack
On 12/27/05, Jeff Breidenbach jeff@jab.org wrote:
Hi all,
Another open source search engine, HtDig, allows web page authors to
Hi
Doug proposed the solution before.
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200511.mbox/[EMAIL
PROTECTED]
Hope it helps.
/Jack
On 12/16/05, Bostjan [EMAIL PROTECTED] wrote:
Hi,
I'm using nutch 0.7.
If I understend correctly, touching the web.xml of my webapp helps my
Hi Nguyen
I am going to face this problem too. Here is my thoughts. One field
will be add in the index, saying uid, and the value of uid will be
generate from URL. Say the url is http://www.a.com/x/y/z.hml
uid = md5_hash(http://www.a.com;).append(md5_hash(/x/y/z.html));
Is that ok? When i query
Hi
I am facing the same problem. However my crawl only focuses on some
website and I recognize the paganition url ursing regexp and inject
them in every fetch cycle.
/Jack
On 12/8/05, K.A.Hussain Ali [EMAIL PROTECTED] wrote:
HI all,
Do Nutch crawl pages in any listing pages( pages with
Hi
Could you pls post your ant logging?
Thanks
/Jack
On 12/1/05, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
I'm able to get things to build if I run just 'ant' or 'ant
jar'. I only get the error when I do 'ant war'. I've written a plugin
that extends the HTMLParseFilter,
Hi,
I have no idea about wether something wrong with NutchBean or the web
application container. I suggest you running NutchBean out of web
container first. Yeah, run it as application.
Regards
/Jack
On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
Hi,
The following is th exception
Hi
I hope this helps
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
/Jack
On 11/30/05, Arun Kumar Sharma [EMAIL PROTECTED] wrote:
Nutch Geeks-
I want to do local hard-disk crawling. I want to know what I need to
do for this.I find this article helpful
Hi
More logging info and exceptions will help dealing with the problem quickly;)
/Jack
On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
Hi all,
I am getting the following error, while i use the nutch to search the
crawled nutch database, what would be the problem,
Error:
On 11/23/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Stefan Groschupf wrote:
Can it do an image search like google?
No. ;-/
Yes ;-)
That is, if you have the image parser... which indeed is not so
difficult, what with JAI and other libraries. You could index the image
metadata.
Hi Jon
I think you can revise the URL by discarding sid param before
putting it into fetchlist.
Regards
/Jack
On 9/28/05, Jon Shoberg [EMAIL PROTECTED] wrote:
Gal Nitzan wrote:
Jon Shoberg wrote:
I'm getting a ton of duplicate content from a forum with sessionIDs.
Its a phpBB which
[EMAIL PROTECTED] wrote:
Hi Jack,
How can you discard URL from fetchlist?
Regards,
Gal
Jack Tang wrote:
Hi Jon
I think you can revise the URL by discarding sid param before
putting it into fetchlist.
Regards
/Jack
On 9/28/05, Jon Shoberg [EMAIL PROTECTED] wrote:
Gal
Hi Gal
You can get the orignal paper from google labs
http://labs.google.com/papers/mapreduce.html
and some presentations in nutch wiki
http://wiki.apache.org/nutch/Presentations
Hope these resources help.
Regards
/Jack
On 9/27/05, Gal Nitzan [EMAIL PROTECTED]
Hi Andrzej
I think javascript-function-and-url mapping is a good solution.
Say
domainName.javascript:go = http://www.a.com/b.jsp?id={0}
go is the javascipt function and it contains one param. And
http://www.a.com/b.jsp?id={0}; is the URL template for go function.
and {0} is the exactly param, it
Hi Jake
Basic, but pretty hard issue.
Now, we re-crawling website by running crawl command, and put index
into temp dir. I think the core issue is how to swap index on the fly.
Some index maybe are referenced by NutchBean. Should we shutdown it?
Mapreduce will solve the problem? I mean can we
Hi Nils
Make sure the Adpater configuration is right in your linux box.
And you can search thread nutch and linux box in nutch maillist. I
think I posted the problem before.
Regards
/Jack
On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote:
Hi
my Problem is:
I ve done everything
Hi Novelli
Do you insist on HtmlParser in Nutch?
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net
http://htmlparser.sourceforge.net/
Regards
/Jack
On 7/29/05, Giovanni Novelli [EMAIL PROTECTED] wrote:
Hello,
I'm working to the development of a multi-agents
Hi Nils
Make sure the Adpater configuration is right in your linux box.
And you can search thread nutch and linux box in nutch maillist. I
think I posted the problem before.
Regards
/Jack
On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote:
Hi
my Problem is:
I ve done everything as
Hi Matthias.
The website is interesting but any document about the implementation avaiable?
Cuong.
I notice a lot paper mentioned HMM is great for information
extraction. But I cannot find one demo in opensource way:(
What's your thoughts?
Regards
/Jack
On 7/26/05, Matthias Jaekle [EMAIL
Michael
Error logs helps. pls post them on the email. Thanks
/Jack
On 7/27/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote:
hi,
I checked my log file, found crawler generates error
when met a page with word file and pdf file inside.
Any configuration file I have to change to let crawler
, there are several documents that maybe useful. I don't
think they will release the source code.
Regards,
Cuong Hoang
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, 26 July 2005 8:29 PM
To: nutch-user@lucene.apache.org
Subject: Re: Information extraction
Hi
Hi
Please search the maillist.
I remembered one thread talked about dynamically delete index some
days ago. Maybe it helps. Index-hot-swap is not supported in Nutch by
default now.
Regards
/Jack
On 7/25/05, blackwater dev [EMAIL PROTECTED] wrote:
So there is no way to set up different
Hi Vacuum
I hope nutch wiki will help you much:)
http://wiki.apache.org/nutch/
Regards
/Jack
On 7/6/05, Vacuum Joe [EMAIL PROTECTED] wrote:
Hello Nutch-gurus,
I have some very straightforward and yet totally
newbie questions which I hope some kind person would
answer.
First of all,
45 matches
Mail list logo