RE: memory consumed by jakarta-oro

2010-02-12 Thread Fuad Efendi
I believe simple regular expression (Pattern, Matcher) may create several
hundreds 'child' instances of Perl5Repetition, Perl5Substitution, etc.

Same as with parsing XML.


-Fuad
http://www.tokenizer.ca


> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: February-12-10 7:54 PM
> To: nutch-user@lucene.apache.org
> Subject: memory consumed by jakarta-oro
> 
> Hi,
> We use jakarta-oro-2.0.7.jar
> I see the following from jmap output:
> 14369 instances of class org.apache.oro.text.regex.Perl5Repetition
> 4972 instances of class org.apache.oro.text.regex.PatternMatcherInput
> 4445 instances of class org.apache.hadoop.hbase.HColumnDescriptor
> 2969 instances of class org.apache.oro.text.regex.Perl5Substitution
> 
> I am wondering why so many objects from org.apache.oro.text.regex are
> held
> in memory. I see GC every 10 seconds.
> 
> Here is the list:
> 63916 instances of class
> org.apache.hadoop.hbase.io.ImmutableBytesWritable
> 26612 instances of class org.apache.hadoop.hbase.KeyValue
> 14369 instances of class org.apache.oro.text.regex.Perl5Repetition
> 4972 instances of class org.apache.oro.text.regex.PatternMatcherInput
> 4445 instances of class org.apache.hadoop.hbase.HColumnDescriptor
> 2969 instances of class org.apache.oro.text.regex.Perl5Substitution
> 2313 instances of class org.apache.nutch.util.domain.DomainSuffix
> 1709 instances of class org.apache.hadoop.hbase.client.Put
> 581 instances of class org.apache.nutch.parse.Outlink
> 553 instances of class org.apache.nutch.util.hbase.ColumnData
> 496 instances of class
> com.rialto.nutchbase.fetcher.FetcherReducer$FetchItem
> 495 instances of class org.apache.nutch.util.hbase.WebTableRow
> 422 instances of class org.apache.hadoop.hbase.HRegionLocation
> 422 instances of class org.apache.hadoop.hbase.HServerAddress
> 414 instances of class org.apache.hadoop.hbase.HRegionInfo
> 414 instances of class org.apache.hadoop.hbase.HTableDescriptor
> 412 instances of class org.apache.hadoop.hbase.util.SoftValue
> 293 instances of class org.apache.nutch.util.domain.TopLevelDomain
> 253 instances of class
> org.cyberneko.html.HTMLEntities$IntProperties$Entry
> 219 instances of class org.apache.oro.text.regex.Perl5Matcher
> 
> Your hint is helpful.




RE: A well-behaved crawler

2010-02-03 Thread Fuad Efendi
In my past experience, I was explicitly banned by about 60 sites (from 1
in my "vertical" list!); via explicit instruction in their robots.txt file

After detailed analysis I found that about 50 sites were hosted on same IP
address; I used fetch-per-TLD instead of fetch-per-IP.

The rest sites were simply not willing to appear in my search results list -
it's their right! 


-Fuad
Tokenizer



> -Original Message-
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
> Sent: February-03-10 2:50 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: A well-behaved crawler
> 
> When you say "banned by several sites", do you mean that you get back
> non-200 responses for pages that you know exist? Or something else?
> 
> Also, there's another constraint that many sites impose, which is the
> total number of page fetches/day. Unfortunately you don't know if
> you've hit this until you run into problems. A good rule of thumb is
> no more than 5K requests/day for a major site.
> 
> -- Ken
> 
> PS - You're not running in EC2 by any chance, are you?
> 
> On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:
> 
> > "A well-behaved crawler needs to follow a set of loosely-defined
> > behaviors to be 'polite' - don't crawl a site too fast, don't crawl
> > any single IP address too fast, don't pull too much bandwidth from
> > small sites by e.g. downloading tons of full res media that will
> > never be indexed, meticulously obey robots.txt, identify itself with
> > user-agent string that points to a detailed web page explaining the
> > purpose of the bot, etc. "
> >
> > But my crawler still banned by several sites... :(
> >
> > cheers
> > iful
> >
> >
> > http://zipclue.com
> >
> >
> >
> >
> 
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 





RE: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Fuad Efendi
> Googling reveals
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you
> could try increasing the Java stack size in bin/nutch (-Xss), or use
> an alternate regexp if you can.
> 
> Just out of curiosity, why does a performance critical program such as
> Nutch use Sun's backtracking-based regexp implementation rather than
> an efficient Thompson-based one?  Do you need the additional
> expressiveness provided by PCRE?


Very interesting point... we should use it for BIXO too.


BTW, SUN has memory leaks with LinkedBlockingQueue,
http://bugs.sun.com/view_bug.do?bug_id=6806875
http://tech.groups.yahoo.com/group/bixo-dev/message/329


And, of course, URL is synchronized; Apache Tomcat uses simplified version
of URL class.
And, RegexUrlNormalizer is synchronized in Nutch... 
And, in order to retrieve plain text from HTML we are creating fat DOM
object (instead of using, for instance, filters in NekoHtml)
And more...

-Fuad,
+1 416-993-2060
http://www.tokenizer.ca







RE: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Fuad Efendi
Also, put it in Hadoop settings for tasks...

http://www.tokenizer.ca/


> -Original Message-
> From: Godmar Back [mailto:god...@gmail.com]
> Sent: January-11-10 11:53 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Help Needed with Error: java.lang.StackOverflowError
> 
> On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood 
> wrote:
> > Do you have to set the -Xss flag somewhere else?
> 
> Yes, in bin/nutch - looking for where it sets -Xmx
> 
>  - Godmar




RE: recrawl.sh stopped at depth 7/10 without error

2009-12-07 Thread Fuad Efendi
>crawl.log 2>&1 &

You forgot 2>&1... output for errors...

Also, you need to close _politely_ the SSH session by executing "exit".
Without it, it pipe is broken, OS will kill the process.


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca
Data Mining, Vertical Search


> -Original Message-
> From: BELLINI ADAM [mailto:mbel...@msn.com]
> Sent: December-07-09 12:01 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> 
> 
> 
> 
> hi,
> 
> mabe i found my probleme, it's not nutch mistake, i beleived when running
> the crawl command as background process when closing my console it will
> not stop the process, but it seems that it realy kill the process
> 
> 
> i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth
> -10 > crawl.log &
> 
> but even with the '&' caractere when closing my console it kills the
> process.
> 
> thx
> 
> > Date: Mon, 7 Dec 2009 19:00:37 +0800
> > Subject: Re: recrawl.sh stopped at depth 7/10 without error
> > From: yea...@gmail.com
> > To: nutch-user@lucene.apache.org
> >
> > I sill want to  know the reason.
> >
> > 2009/12/2 BELLINI ADAM 
> >
> > >
> > > hi,
> > >
> > > anay idea guys ??
> > >
> > >
> > >
> > > thanx
> > >
> > > > From: mbel...@msn.com
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > > > Date: Fri, 27 Nov 2009 20:11:12 +
> > > >
> > > >
> > > >
> > > > hi,
> > > >
> > > > this is the main loop of my recrawl.sh
> > > >
> > > >
> > > > do
> > > >
> > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> $topN \
> > > >   -adddays $adddays
> > > >   if [ $? -ne 0 ]
> > > >   then
> > > > echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > > > break
> > > >   fi
> > > >   segment=`ls -d $crawl/segments/* | tail -1`
> > > >
> > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > > >   if [ $? -ne 0 ]
> > > >   then
> > > > echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > > > echo "runbot: Deleting segment $segment."
> > > > rm $RMARGS $segment
> > > > continue
> > > >   fi
> > > >
> > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > > >
> > > > done
> > > >
> > > > echo "- Merge Segments (Step 3 of $steps) -"
> > > >
> > > >
> > > >
> > > > in my log file i never find the message "- Merge Segments (Step
> 3 of
> > > $steps) -" ! so it breaks the loop and stops the process.
> > > >
> > > > i dont understand why it stops at depth 7 without any errors !
> > > >
> > > >
> > > > > From: mbel...@msn.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: recrawl.sh stopped at depth 7/10 without error
> > > > > Date: Wed, 25 Nov 2009 15:43:33 +
> > > > >
> > > > >
> > > > >
> > > > > hi,
> > > > >
> > > > > i'm running recrawl.sh and it stops every time at depth 7/10
> without
> > > any error ! but when run the bin/crawl with the same crawl-urlfilter
> and the
> > > same seeds file it finishs softly in 1h50
> > > > >
> > > > > i checked the hadoop.log, and dont find any error there...i just
> find
> > > the last url it was parsing
> > > > > do fetching or crawling has a timeout ?
> > > > > my recrawl takes 2 hours before it stops. i set the time fetch
> interval
> > > 24 hours and i'm running the generate with adddays = 1
> > > > >
> > > > > best regards
> > > > >
> > > > > _
> > > > > Eligible CDN College & University students can upgrade to Windows
> 7
> > > before Jan 3 for only $39.99. Upgrade now!
> > > > > http://go.microsoft.com/?linkid=9691819
> > > >
> > > > _
> > > > Eligible CDN College & University students can upgrade to Windows 7
> > > before Jan 3 for only $39.99. Upgrade now!
> > > > http://go.microsoft.com/?linkid=9691819
> > >
> > > _
> > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> Windows 7
> > > now
> > > http://go.microsoft.com/?linkid=9691818
> 
> _
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows
> 7 now
> http://go.microsoft.com/?linkid=9691818




RE: Simple vertical search engine question

2009-11-09 Thread Fuad Efendi
Premium Google publishers (>20 mlns pageviews per month) may use more
features of AdSense such as explicit keywords in a query (to Google)


> -Original Message-
> From: Carlos Vera [mailto:carlodesil...@gmail.com]
> Sent: November-09-09 10:53 AM
> To: nutch-user@lucene.apache.org
> Subject: Simple vertical search engine question
> 
> I have looked into few vertical search engines like indeed.com,
> simplyhired.com.  Anyone know how vertical search engine like indeed.com
and
> simplyhired.com displays relevant google ads for the searched keywords on
> thier site?




RE: char encoding

2009-10-29 Thread Fuad Efendi
> > i dont have any special requirement for any special characters, i am
> > happy with usual utf-8
> >
> > any suggestion on the best way to configure this correctly; everything
> > seems quite ok looking at the code not sure whats missing.


Try to set UTF-8 in configuration file:
parser.character.encoding.default = UTF-8



> -----Original Message-
> From: Fuad Efendi [mailto:f...@efendi.ca]
> Sent: October-29-09 8:19 PM
> To: nutch-user@lucene.apache.org; fa...@butterflycluster.net
> Subject: RE: char encoding
> 
> Is it "?" or "¿" (Inverted Question Mark)?
> 
> Because ¿ is replacement for character codes not having representation in
> specific encoding scheme; you may get it, for instance, if binary stream
is
> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
> having representation in windows-1252 will be represented as "¿".
> 
> Nutch tries on the best effort; however, it can't use dedicated CPU as
> browsers I agree with Ken. Browsers may fully ignore headers/meta and
> sniff and analyze byte array to find correct encoding (in case, for
> instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
> can't do that (it requires a lot of CPU).
> 
> Windows-1252 -s default scheme for html-parser in case if Nutch can't find
> correct HTTP/META...
> 
> 
> From HtmlParser API:
>* We need to do something similar to what's done by mozilla
>*
>
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
> 1993).
>* See also http://www.w3.org/TR/REC-xml/#sec-guessing
> 
> 
> private static String sniffCharacterEncoding(byte[] content) {...}
> 
> - it doesn't currently use HTTP Headers.
> - it tries to find META tag in first 2000 bytes.
> 
> 
> So, for instance, some weird sites (such as AJAX/Portals) may have a lot
of
> generated JavaScript before META tag; 2000 could be small.
> 
> Then, EncodingDetector is called:
>   detector.addClue(sniffCharacterEncoding(contentInOctets),
"sniffed");
> 
> - but it doen't make sense...
> 
> 
>   public String guessEncoding(Content content, String defaultValue) {
> /*
>  * This algorithm could be replaced by something more sophisticated;
>  * ideally we would gather a bunch of data on where various clues
>  * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
> with
>  * the correct answer, and use machine learning/some statistical
method
>  * to generate a better heuristic.
>  */
> 
> 
> 
> TODO list... as a workaround, please check for this site that META could
be
> found in first 2000 bytes...
> 
> 
> 
> -Fuad
> http://www.linkedin.com/in/liferay
> 
> 
> > -Original Message-
> > From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
> > Sent: October-29-09 7:05 PM
> > To: nutch-user@lucene.apache.org
> > Subject: char encoding
> >
> > hi there,
> >
> > i am having issues with the HTMLParser failing to detect the char
> > encoding. so lots of non alpha-numeric chars end up as "?" ;
> >
> > i dont have any special requirement for any special characters, i am
> > happy with usual utf-8
> >
> > any suggestion on the best way to configure this correctly; everything
> > seems quite ok looking at the code not sure whats missing.
> >
> > thanks.
> >
> >
> 
> 





RE: char encoding

2009-10-29 Thread Fuad Efendi
Is it "?" or "¿" (Inverted Question Mark)?

Because ¿ is replacement for character codes not having representation in
specific encoding scheme; you may get it, for instance, if binary stream is
UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
having representation in windows-1252 will be represented as "¿".

Nutch tries on the best effort; however, it can't use dedicated CPU as
browsers I agree with Ken. Browsers may fully ignore headers/meta and
sniff and analyze byte array to find correct encoding (in case, for
instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
can't do that (it requires a lot of CPU).

Windows-1252 -s default scheme for html-parser in case if Nutch can't find
correct HTTP/META...


>From HtmlParser API:
   * We need to do something similar to what's done by mozilla
   *
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
1993).
   * See also http://www.w3.org/TR/REC-xml/#sec-guessing


private static String sniffCharacterEncoding(byte[] content) {...}

- it doesn't currently use HTTP Headers.
- it tries to find META tag in first 2000 bytes.


So, for instance, some weird sites (such as AJAX/Portals) may have a lot of
generated JavaScript before META tag; 2000 could be small. 

Then, EncodingDetector is called:
  detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");

- but it doen't make sense...


  public String guessEncoding(Content content, String defaultValue) {
/*
 * This algorithm could be replaced by something more sophisticated;
 * ideally we would gather a bunch of data on where various clues
 * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
with
 * the correct answer, and use machine learning/some statistical method
 * to generate a better heuristic.
 */



TODO list... as a workaround, please check for this site that META could be
found in first 2000 bytes...



-Fuad
http://www.linkedin.com/liferay


> -Original Message-
> From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
> Sent: October-29-09 7:05 PM
> To: nutch-user@lucene.apache.org
> Subject: char encoding
> 
> hi there,
> 
> i am having issues with the HTMLParser failing to detect the char
> encoding. so lots of non alpha-numeric chars end up as "?" ;
> 
> i dont have any special requirement for any special characters, i am
> happy with usual utf-8
> 
> any suggestion on the best way to configure this correctly; everything
> seems quite ok looking at the code not sure whats missing.
> 
> thanks.
> 
> 





RE: http keep alive

2009-10-14 Thread Fuad Efendi
I'd like to add:

Keep-Alive is not polite. It uses dedicated listener on server-side. 
Establishing TCP socket via specific IP "handshake" takes time, that's why 
KeepAlive exists for web servers - to improve performance of subsequent 
requests. However, it allocated dedicated listener for specific IP port / 
remote client... 

What will happen with classic setting of 150 processes in HTTPD 1.3 in case of 
150 robots trying to use Keep-Alive feature?

==
http://www.linkedin.com/in/liferay


> 
> protocol-httpclient can support keep-alive. However, I think that it
> won't help you much. Please consider that Fetcher needs to wait some
> time between requests, and in the meantime it will issue requests to
> other sites. This means that if you want to use keep-alive connections
> then the number of open connections will climb up quickly, depending on
> the number of unique sites on your fetchlist, until you run out of
> available sockets. On the other hand, if the number of unique sites is
> small, then most of the time the Fetcher will wait anyway, so the
> benefit from keep-alives (for you as a client) will be small - though
> there will be still some benefit for the server side.
> 
> 
> 
> --
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com





RE: how to "upgrade" a java application with nutch?

2009-10-02 Thread Fuad Efendi
Nutch alone is OK, but "embedded Nutch" is not Ok... extremely hard!

You need to embed "Nutch Client" into your web application. Nutch should run
separately.

I believe Nutch supports "Open Search" or something similar (XML protocol,
REST-like) - your web application should use it to interact with Nutch, but
you have to develop client-part. Sorry, I don't know current status of Nutch
features...

So that I posted about SOLR: SolrJ is out-of-the-box client library for
Java, and for me it seems extremely easy solution for small search indexes
(few domains). Nutch has plugin for SOLR (and command-line option)

SOLR outputs everything as XML, JSON, etc. (instead of pure HTML which is
hard to embed in another HTML)
-Fuad


> -Original Message-
> From: Jaime Martín [mailto:james...@gmail.com]
> Sent: October-02-09 5:43 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to "upgrade" a java application with nutch?
> 
> thank you for your responses. My scenario is this:
> I have a web application with some pages. I want from one of the web menu
> options to redirect to a search page. There you would be only able to
search
> info that has been previously crawled.
> Everything is java, so I think nutch fits really well and on top of that
it
> is opensource.
> So in my nutch downloaded distro I would configure the urls I want to
crawl,
> change something of the GUI and change some features of the nutch business
> logic to customise for the specific project purposes.
> That´s why I thought to customise nucth in a project, then create a
library
> from it and then use it from the main application.
> So, for that.. nutch "alone" isn´t proper and you need to work together
with
> Solr? (anyway I´ll have a look at bixo and Droids)
> than you!
> 
> 
> 2009/10/1 Fuad Efendi 
> 
> > Hi Jaime,
> >
> > You don't have to embed; try (simplified) Nutch + SOLR (Nutch has plugin
> > for
> > SOLR). And use SolrJ client for SOLR from your application. This is very
> > easy.
> > -Fuad
> >
> >
> > http://www.linkedin.com/in/liferay
> >
> > > -Original Message-
> > > From: Jaime Martín [mailto:james...@gmail.com]
> > > Sent: October-01-09 5:59 AM
> > > To: nutch-user@lucene.apache.org
> > > Subject: how to "upgrade" a java application with nutch?
> > >
> > > Hi!
> > > I´ve a java application that I would like to "upgrade" with nutch.
What
> > jars
> > > should I add to my lib applicaction to make it possible to use nutch
> > > features from some of my app pages and business logic classes?
> > > I´ve tried with nutch-1.0.jar generated by "war" target without
success.
> > > I wonder what is the proper nutch build.xml target I should execute
for
> > this
> > > and what of the generated jars are to be included in my app. Maybe
apart
> > > from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few
of
> > > them?
> > > thanks in advance!
> >
> >
> >




RE: how to "upgrade" a java application with nutch?

2009-10-01 Thread Fuad Efendi
Hi Jaime,

You don't have to embed; try (simplified) Nutch + SOLR (Nutch has plugin for
SOLR). And use SolrJ client for SOLR from your application. This is very
easy.
-Fuad


http://www.linkedin.com/in/liferay

> -Original Message-
> From: Jaime Martín [mailto:james...@gmail.com]
> Sent: October-01-09 5:59 AM
> To: nutch-user@lucene.apache.org
> Subject: how to "upgrade" a java application with nutch?
> 
> Hi!
> I´ve a java application that I would like to "upgrade" with nutch. What
jars
> should I add to my lib applicaction to make it possible to use nutch
> features from some of my app pages and business logic classes?
> I´ve tried with nutch-1.0.jar generated by "war" target without success.
> I wonder what is the proper nutch build.xml target I should execute for
this
> and what of the generated jars are to be included in my app. Maybe apart
> from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
> them?
> thanks in advance!




RE: URL built by JavaScript Function - Can this be Crawled

2009-09-14 Thread Fuad Efendi
Google has sitemaps instead... initially designed to help finding such
dynamic URLs (not necessarily built by JavaScript; could be form submission)

Evaluation of JavaScript is extremely CPU-costly for crawlers (it isn't
personal computer where you have single JavaScript thread for double-cores!)
- especially if you need to execute 1000s "use cases" (method parameters'
combinations) in order to find all possible return values...


Google may use some JavaScript emulations (sometimes!) in order to find
black-hat-SEOs etc, and to evaluate some landing pages quality for AdWords
(do they use AdSense?) - but it is not a job of Googlebot...


Just generate 'sitemap' (seed.txt file) for Nutch...



> -Original Message-
> From: Mohamed Parvez [mailto:par...@gmail.com]
> Sent: September-14-09 12:36 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL built by JavaScript Function - Can this be Crawled
> 
> Thanks ken.
> If Google itself has not fully implemented, JavaScript analysis/execution
> for crawling
> I am going to stay away from it and look for alternate solution.
> 
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Mon, Sep 14, 2009 at 11:15 AM, Ken Krugler
> wrote:
> 
> > JavaScript code that creates dynamic URLs is always a problem for web
> > crawlers.
> >
> > Most web sites try to make their content crawlable by creating
alternative
> > static links to the content.
> >
> > I think Google now does some analysis/execution of JS code, but it's a
> > tricky problem.
> >
> > I would suggest modifying the HTML parser to explicitly look for calls
> > being made to your function, and generate appropriate outlinks.
> >
> > -- Ken
> >
> >
> >
> > On Sep 14, 2009, at 8:04am, Mohamed Parvez wrote:
> >
> >  Can anyone please through some light on this
> >>
> >> Thanks/Regards,
> >> Parvez
> >>
> >>
> >> On Fri, Sep 11, 2009 at 3:23 PM, Mohamed Parvez 
wrote:
> >>
> >>  We have a JavaScript function, which takes some prams and builds an
URL
> >>> and
> >>> then uses  window.location to send the user to that URL.
> >>>
> >>> Our website uses this feature a lot and most of the urls are built
using
> >>> this function.
> >>>
> >>> I am trying to crawl using Nutch and I am also using the parse-js
plugin.
> >>>
> >>> But it does not look like Nautch is able to crawl these URLs.
> >>>
> >>> Am I doing something wrong or Nutch is not able to crawl URLs build by
> >>> JavaScript function.
> >>>
> >>> 
> >>> Thanks/Regards,
> >>> Parvez
> >>>
> >>>
> >>>
> > --
> > Ken Krugler
> > TransPac Software, Inc.
> > 
> > +1 530-210-6378
> >
> >




RE: Ignoring Robots.txt

2009-09-11 Thread Fuad Efendi
> 
> My sysadm refuses to change the robots.txt citing the following reason:
> 
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.



Extremely strange thoughts of some smart sys-minds...

If crawler wants impersonate... it will, and it will ignore robots.txt, and
sysadmin may ban such IP... I don't know any such public crawler except some
desktop based download agents such as WebCEO or Teleport or even IE and
Firefox...

No way, Nutch must follow robots.txt.




RE: URL with Space

2009-09-04 Thread Fuad Efendi
My fault, sorry...


I did several tests with Perl5Compiler and XML.

1. \s can be used,
  \s

(although Java doesn't allow unescaped [String s = "\s";])



2. Whitespace can be used,
   


And, check Nutch configuration; ensure Normalizer is properly configured.

You can use + (plus) sign instead of %20 in URLs.




RE: URL with Space

2009-09-04 Thread Fuad Efendi
> From: Fuad Efendi 
> I already posted here that URL Normalizer is called after extracting
> Outlinks from a Page.

-I was _wrong_, sorry.


Code from Injector:
  try {
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
url = filters.filter(url); // filter the url
  } catch (Exception e) {



You have to ensure that Nutch uses proper config file (with correct
normalizer)


Perl5Compiler in Java should use encoded \\s instead of \s; I am not sure if
one can use whitespace character inside XML node


P.S.
Some "normalizers" in NUTCH are synchronized singletons and you will have
obvious performance bottleneck.





RE: URL with Space

2009-09-04 Thread Fuad Efendi
I already posted here that URL Normalizer is called after extracting
Outlinks from a Page.

It won't work for injecting URLs from seed.txt.

Seed.txt must contain correct URLs (preferably root domain names)



> -Original Message-
> From: Kirby Bohling [mailto:kirby.bohl...@gmail.com]
> Sent: September-03-09 6:38 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL with Space
> 
> On Thu, Sep 3, 2009 at 5:03 PM, Mohamed Parvez wrote:
> > Thanks for the suggestion Kirby. It works for URL in the seed.txt file
but
> > wont work for URLs in the parsed content of a page
> >
> 
> Hmmm, I thought it worked for me.  We have a bunch of Wiki/Sharepoint
> sites internally that we crawl.  I'll never educate the users to
> remove the spaces.  I guess I need to double check that it is in fact
> fixing them.  I know the URL error message went away for me.  It might
> only work for the URL's are inside of an .
> 
> Kirby
> 
> > I used a URL that has spaces in the cong/seed.txt file and it replaces
the
> > space with %20 and I was able to crawl the page.
> >
> > Senario-1:
> > urls/seed.txt:
> > --
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=SmallBusiness&portletTitle=Small
> > Business Features
> >
> >
> > In this scenario the URL gets translated to :
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=Small%20Business&portletTitle=Small%20Business%20Features
> >
> >
> > Senario-2:
> > urls/seed.txt:
> > ---
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
> pageLabel=SMBPortal_page_main_newsandresources
> >
> > The content of this page has many URLs that have space and Nutch can not
> > crawl beyond one level.
> > As it gets error when it encounters an URL with space, in the content of
the
> > page.
> >
> > Part of the content of the crawled page with Error:
> > ---
> >   Small Business Features         ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Small Business Expert Advice       ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Wall Street Journal       ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Retail
> >
> >
> > 
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi  wrote:
> >
> >>
> >> But 'normalizer' can't be used with 'injector' (seed.txt)...
'normalizer'
> >> is
> >> called after Fetching-Parsing-Outlinks HTML...
> >>
> >>
> >> > -Original Message-
> >> > From: Mohamed Parvez [mailto:par...@gmail.com]
> >> > Sent: September-03-09 3:58 PM
> >> > To: nutch-user@lucene.apache.org
> >> > Subject: Re: URL with Space
> >> >
> >> > Thanks for the suggestion fuad.
> >> >
> >> > I used your suggestion but does not seem to work, the space does not
get
> >> > replaces by %20 or +
> >> >
> >> > Senario-1
> >> > urls/seed.txt:
> >> > --
> >> >
> >>
> >>
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
> >>
>
&_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e%0A&_>
> >> >
> >>
> >>
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
> >> ry
> >> > name=SmallBusiness&portletTitle=Small
> >> > Business Features
> >> >
> >> > I get the fallowing error:
> >> > -
> >> > fetch of
> >> >
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
> >> >
> >>
> >>
>
=true&_pageLabel=

RE: URL with Space

2009-09-03 Thread Fuad Efendi

But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer' is
called after Fetching-Parsing-Outlinks HTML... 


> -Original Message-
> From: Mohamed Parvez [mailto:par...@gmail.com]
> Sent: September-03-09 3:58 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL with Space
> 
> Thanks for the suggestion fuad.
> 
> I used your suggestion but does not seem to work, the space does not get
> replaces by %20 or +
> 
> Senario-1
> urls/seed.txt:
> --
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=SmallBusiness&portletTitle=Small
> Business Features
> 
> I get the fallowing error:
> -
> fetch of
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
>
=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553
&c
> at
> egoryname=Small Business&portletTitle=Small Business
> *Features failed with: Httpcode=406*
> 
> 
> But if I Start with an URL with %20 instead of space
> 
> Senario-2
> urls/seed.txt:
> --
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=Small%20Business&portletTitle=Small%20Business%20Features
> 
> Everything works as expected.
> 
> 
> 
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi  wrote:
> 
> >
> > > I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic))
> > and
> > I
> > > put the below rule in the conf/regex-normalize.xml file
> > >
> > > 
> > >   \s
> > >   %20
> > > 
> > >
> >
> >
> > Should be escaped backslash:
> >  \\s
> >
> >
> > You can also use + (plus) instead of %20.
> >
> >
> >
> >
> >




RE: URL with Space

2009-09-03 Thread Fuad Efendi
 
> I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) and
I
> put the below rule in the conf/regex-normalize.xml file
> 
> 
>   \s
>   %20
> 
> 


Should be escaped backslash:
  \\s


You can also use + (plus) instead of %20.






RE: Nutch truncating URL to 318 Chars

2009-09-01 Thread Fuad Efendi
What it truncates, 'http://' or 'sId=386'? Or something inside URL?


Just inject http://business.verizon.net/ ... nutch should find the rest...

I believe Nutch doesn't have any limits with URL length, although some Web
servers limited to 4000...


>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel
=S
>
MBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFControll
er
>
_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fs
mb
>
%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_
1p
> roductsId=386
> 
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi  wrote:
> 
> > > I opened the part-0 file in the dump folder and there, is only ONE
> > url
> > > and it has been truncated to 318 chars
> > > How make Nutch consider URLs with length more than 318 chars
> >
> > Please provide original (before truncating) sample of such URL
> > Thanks
> >
> >
> >
> >
> >




RE: Nutch truncating URL to 318 Chars

2009-09-01 Thread Fuad Efendi
> I opened the part-0 file in the dump folder and there, is only ONE url
> and it has been truncated to 318 chars
> How make Nutch consider URLs with length more than 318 chars

Please provide original (before truncating) sample of such URL
Thanks






RE: content of hadoop-site.xml

2009-08-26 Thread Fuad Efendi
Unfortunately, you can't manage disk space usage via configuration
parameters... it is not easy... just try to keep your eyes on
services/processes/ram/swap (disk swapping happens if RAM is not enough)
during merge, even browse file/folders and click 'refresh' button to get an
idea... it is strange that 50G was not enough to merge 2G, may be problem is
somewhere else (OS X specifics for instance)... try to play with Nutch with
smaller segment sizes and study it's behaviour on your OS...
-Fuad


-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: August-26-09 6:41 PM
To: nutch-user@lucene.apache.org
Subject: Re: content of hadoop-site.xml


 


 Thanks for the response. 

How can I check disk swap? 
50GB was before running merge command. When it crashed available space was 1
kb. RAM in my MacPro is 2GB. I deleted tmp folders created by hadoop during
merge and after that OS X does not start. I plan to run merge again and need
to reduce disk space usage by merge. I have read on the net that for
reducing space we must use hadoop-site.xml. But, there is no
hadoop-default.xml file and hadoop-site.xml file is empty.


Thanks.
Alex.


 

-Original Message-----
From: Fuad Efendi 
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 26, 2009 3:28 pm
Subject: RE: content of hadoop-site.xml










You can override default settings (nutch-default.xml) in nutch-site.xml; but
it won't help with spacing; empty file is Ok.

"merge" may generate temporary files, but 50Gb against 2Gb looks extremely
strange; try to empty recycle bin for instance... check disk swap... OS may
report 50G available but you may be out of space... for instance heavy disk
swap during merge due to low RAM...



-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org


-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: August-26-09 5:33 PM
To: nutch-user@lucene.apache.org
Subject: content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
But my MacPro with 50G free space did not start, after merge crashed with no
space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
anyone let me know what must be put inside this file in order for merge not
to take too much space.

Thanks in advance.
Alex.





 





RE: content of hadoop-site.xml

2009-08-26 Thread Fuad Efendi
You can override default settings (nutch-default.xml) in nutch-site.xml; but
it won't help with spacing; empty file is Ok.

"merge" may generate temporary files, but 50Gb against 2Gb looks extremely
strange; try to empty recycle bin for instance... check disk swap... OS may
report 50G available but you may be out of space... for instance heavy disk
swap during merge due to low RAM...



-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org


-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: August-26-09 5:33 PM
To: nutch-user@lucene.apache.org
Subject: content of hadoop-site.xml

Hello,

?I have run merge script? to merge two crawl dirs, one 1.6G another 120MB.
But my MacPro with 50G free space did not start, after merge crashed with no
space error. I have been told that OSX got corrupted. 
I looked inside my nutch-1.0/conf/hadoop-site.xml file and it is empty. Can
anyone let me know what must be put inside this file in order for merge not
to take too much space.

Thanks in advance.
Alex.




RE: Is Nutch purposely slowing down the crawl, or is it just really really inefficient?

2009-08-26 Thread Fuad Efendi

Only if website provides "Last-Modified: " "Etag: " response headers for
initial retrieval, and only if it can understand "If-Modified-Since" request
header of Nutch... However, even in this case Nuth must be polite and not
use frequent requests against same site, even with "If-Modified-Since"
request headers, - each HTTP request is logged, fresh response might be sent
indeed instead of 304, and each one (even 304) needs to establish TCP/IP,
Client-Thread, CPU, etc. - to use server-side resources.


You can have several threads concurrently accessing same website with crawl
delay 0 only and only if this website is fully under your control and
permission.


-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org



-Original Message-
From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
Tomblin
Sent: August-26-09 1:36 PM
To: nutch-user@lucene.apache.org
Subject: Re: Is Nutch purposely slowing down the crawl, or is it just really
really inefficient?

On Wed, Aug 26, 2009 at 1:32 PM, MilleBii wrote:
> beware you could create a kind of Denial of Service attack if you search
the
> site too quickly.
>

Well, since I fixed Nutch so that it understands "Last-Modified" and
"If-Modified-Since", it won't be downloading the pages 99% of the
time, just getting a "304" response.

-- 
http://www.linkedin.com/in/paultomblin




RE: Limiting number of URL from the same site in a fetch cycle

2009-08-26 Thread Fuad Efendi
Probably this is suitable:



  generate.max.per.host
  -1
  The maximum number of urls per host in a single
  fetchlist.  -1 if unlimited.



[-topN N] - Number of top URLs to be selected



-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: August-26-09 5:39 AM
To: nutch-user@lucene.apache.org
Subject: Re: Limiting number of URL from the same site in a fetch cycle

 db.max.outlinks.per.page will result in missing links ? Don't want that.
I just would want to balance them on a next fetch cycle.




2009/8/26 Fuad Efendi 

> You can filter some unnecessary "tail" using UrlFilter; for instance, some
> sites may have long forums which you don't need, or shopping cart / process
> to checkout pages which they forgot to restrict via robots.txt...
>
> Check regex-urlfilter.txt.template in /conf
>
>
> Another parameter which equalizes 'per-site' URLs is
> db.max.outlinks.per.page=100 (some sites may have 10 links per page, others
> - 1000...)
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
>
>
>
> -Original Message-
> From: MilleBii [mailto:mille...@gmail.com]
> Sent: August-25-09 5:48 PM
> To: nutch-user@lucene.apache.org
> Subject: Limiting number of URL from the same site in a fetch cycle
>
> I'm wondering if there is a setting by which you can limit the number of
> urls per site on a fetch list, not a on a total site.
> In this way I could avoid long tails in a fetch list all from the same site
> so it takes damn long (5s per URL), I'd like to fetch them on the next
> cycle.
>
> --
> -MilleBii-
>
>
>


-- 
-MilleBii-




RE: Limiting number of URL from the same site in a fetch cycle

2009-08-25 Thread Fuad Efendi
You can filter some unnecessary "tail" using UrlFilter; for instance, some 
sites may have long forums which you don't need, or shopping cart / process to 
checkout pages which they forgot to restrict via robots.txt...

Check regex-urlfilter.txt.template in /conf


Another parameter which equalizes 'per-site' URLs is 
db.max.outlinks.per.page=100 (some sites may have 10 links per page, others - 
1000...)


-Fuad
http://www.linkedin.com/in/liferay
http://www.tokenizer.org



-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: August-25-09 5:48 PM
To: nutch-user@lucene.apache.org
Subject: Limiting number of URL from the same site in a fetch cycle

I'm wondering if there is a setting by which you can limit the number of
urls per site on a fetch list, not a on a total site.
In this way I could avoid long tails in a fetch list all from the same site
so it takes damn long (5s per URL), I'd like to fetch them on the next
cycle.

-- 
-MilleBii-




RE: Nutch bug: can't handle urls with spaces in them

2009-08-25 Thread Fuad Efendi
I don't think this is a bug; modern browsers can deal with many HTML errors
on "best guess effort" but it does not mean that their guess is always
correct... my wonderfullink text - very common mistake... 
About URL - yes, I am using some modified pieces of Nutch, with some
preprocessing to guess correct URL...


Certain documents aren't indexed due to Webmaster mistakes.



-Original Message-
From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
Tomblin
Sent: August-25-09 3:28 PM
To: nutch-user
Subject: Nutch bug: can't handle urls with spaces in them

In my browser, I can see a URL with spaces in it, but when I hover
over it, the browser has replaced the spaces with %20s, and when I
click on it I get the document.  However, when Nutch attempts to
follow the link, it doesn't do that, and so it gets a 404.  It should
do the same thing that web browsers do, or else I'm going to be facing
questions from my users about why certain documents aren't indexed
even though they can see them just fine.

If I do a view source, I can see the URLs with spaces in them:
http://localhost/Documents/pharma/DocSamples/Leg blood
clots.htm">Leg blood clots.htm

But when I click on them, the URL got converted to:
http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm


-- 
http://www.linkedin.com/in/paultomblin




RE: Could anyone teache me how to index the title of txt?

2007-05-12 Thread Fuad Efendi
Google can't index the title of txt; what you see using this specific query
is first craracters of txt.

>Some text from txt file, like google or anothe search engines.
>http://www.google.com/search?hl=en&client=opera&rls=en&hs=1FO&q=inurl%3A.tx
t&btnG=Search





RE: Nutch crawler problem

2006-12-06 Thread Fuad Efendi
Do you mean that you can search for "" and find it?
Some 'tags' should be included into index, such as value of
, and XML comments (not sure)...
With the simplest HTML page and with default Nutch settings Nutch must index
all plain text after defining language settings and removing all HTLM
tags... 'Body' of a tag Body will be indexed.


-Original Message-
From: Damian Florczyk [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 06, 2006 9:19 AM
To: nutch-user@lucene.apache.org
Subject: Nutch crawler problem


Hi there,

Hi have small problem when i'm indexing few sites crawler indexes html
tags too and when i'm trying to search using this index there are some
results which are inside html tags. What Can i do to remove those tags


-- 
Damian Florczyk
Gentoo/NetBSD Development Lead




RE: lucene/nutch investigation

2006-12-06 Thread Fuad Efendi
Nutch is mostly an Internet Search Engine, although experienced in Java
users may develop own protocol implementations as plugins to Nutch, and own
parsers in addition to HTML, PDF, even MPG...
It repeats many ideas of Google, including PageRank
It is difficult to implement Nutch without 'commercial support'...


-Original Message-
From: Phillip Rhodes [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 05, 2006 2:42 PM
To: nutch-user@lucene.apache.org
Subject: Re: lucene/nutch investigation


Bruce,

I recently wrestled with this same issue.

Nutch is good if you need something crawled. (e.g. apache web server, 
file system)
Lucene is good if you need to index something that can't be crawled 
(e.g. database)

While there are exceptions to the above, I would stick to that as a 
general rule of thumb when evaluating  lucene or nutch to use in a 
project.  Of course, an understanding of lucene will probably help out 
with nutch.

IMO.
Phillip

Insurance Squared Inc. wrote:

> Hi Bruce,
>
> This list is not only very active - it's full of people constantly 
> giving helpful, instructive answers.  If you've got questions, this is 
> the place.
>
> I would say based on my experience that nutch is a) excellent and b) 
> not for the faint of heart when it comes to java - you'll need someone 
> who knows what they're doing probably even to get it installed.
>
> g.
>
>
> bruce wrote:
>
>> anybody running lucene/nutch that i can talk to, to exchange 
>> information,
>> ideas.. i'm considering using lucene/nutch for a project, but i have 
>> zero
>> java experience. i'm around the cali/bay area.
>>
>> the guy who was going to be provide the java experience oversold his
>> expertise.. so i might have to bite the bullet on this one.
>>
>> thanks
>>
>> -bruce
>>
>>
>>
>>   
>
>
>







RE: nutch - functionality..

2006-06-24 Thread Fuad Efendi
Bruce,

I had similar problem a year ago... I needed very specific crawling and data
mining, I decided to use a database, and I was able to rewrite everything
within a week (thanks to Nutch developers!) (needed to implement very
specific business case);

My first approach was to modify parse-html plugin; it writes directly to a
database 'path' and 'query params'; and it writes to a database some
specific 'tokens' such as product name, price, etc.

What I found: 
- performance of a database (such as Oracle or Postgre) is the main
bottleneck
- I need to 'mine' everything in-memory and minimize file read/write
operations (minimize HDD I/O, and use pure Java)

I had some (may be useful) ideas:
- using statistics (how many anchors have similar text pointing same page
during period in time), define 'category' of info, such as 'product
category', 'subcategory', 'manufacturer'
- define 'dynamic' crawl, e.g. frequently re-crawl 'frequently-queried'
pages 

I think existing Nutch is very 'generic', a lot of plugins such as
'parse-mpg', 'parse-pdf',... It repeats logic/functionality of Google...
'Anchor text is a true subject of a page!' - Google Bombing...

So, if it is 'Data Mining' engine, I believe just creating of additional
plugin for Nutch is not enough: you have to define additional classes,
'Outlink' does not have functionality of 'Query Parameters' etc... And you
need to define datastore, existing WebDB interface is not enough... You will
need to rewrite Nutch... And there are no any suitable 'extension points'...


If you need just HTML crawl/mining - focus on it...


-Original Message-
From: bruce 
Sent: Saturday, June 24, 2006 2:40 AM
To: nutch-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: RE: nutch - functionality..


hi fuad,

it lloks like you're looking at what i'm trying to do as though it's for a
search engine... it's not...

i'm looking to create a crawler to extract specific information. as such, i
need to emulate some of the function of a crawler. i also need to implement
other functionality that's apparently not in the usual spider/crawler
function. being able to selectively iterate/follow through forms (GET/POST)
in a recursive manner is a requirement. as is being able to selectively
define which form elements i'm going to use when i do the crawling

of course this approach is only possible because i have causal knowledge of
the structure of the site prior to me crawling it...

-bruce



-Original Message-
From: Fuad Efendi 
Sent: Friday, June 23, 2006 8:28 PM
To: nutch-user@lucene.apache.org
Subject: RE: nutch - functionality..


Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and PageFound (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-Original Message-
From: bruce
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit

RE: nutch - functionality..

2006-06-23 Thread Fuad Efendi
Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and PageFound (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-Original Message-
From: bruce 
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce







FW: following forms using nutch...

2006-06-21 Thread Fuad Efendi

According to HTTP/1.1 specs,
POST method, page 55, RFC 2616: 
"Responses to this method are not cacheable, unless the response 
includes appropriate Cache-Control or Expires header fields." 
http://www.ietf.org/rfc/rfc2616.txt 

So, Nutch _should_not_ store anywhere information retrieved via POST...
Web-Developers _expect_ that such pages won't be cached...

Suppose we have a form on a forum (or simple 'Contact Me' form), will Nutch
post dummy messages? It was fixed as a bug, and Nutch does not follow 'post'
anymore (I believe...)

Thanks


-Original Message-
From: Honda-Search Administrator 

Bruce,

There is no reason you shouldn't be able to use POST, especially if you use 
the opensearch method to display your results.

Matt
- Original Message - 
From: "bruce" 
> hi...
>
> not sure whether this should be a dev/user question...
>
> some of the archives seem to indicate that nutch doesn't/can't/perhaps
> shouldn't follow a form that uses POST... is this correct, and if it is, 
> can
> someone tell me why?
>
> can nutch hand forms that use GET??
>
> i'm looking to extract some information off of public college sites, and
> some of the sites use POST, while others use GET with their forms...
>
> thanks
>
> -bruce



RE: following forms using nutch...

2006-06-21 Thread Fuad Efendi
According to HTTP/1.1 specs,
POST method, page 55, RFC 2616: 
"Responses to this method are not cacheable, unless the response 
includes appropriate Cache-Control or Expires header fields." 
http://www.ietf.org/rfc/rfc2616.txt 

So, Nutch _should_not_ store anywhere information retrieved via POST...
Web-Developers _expect_ that such pages won't be cached...

Suppose we have a form on a forum (or simple 'Contact Me' form), will Nutch
post dummy messages? It was fixed as a bug, and Nutch does not follow 'post'
anymore (I believe...)

Thanks


-Original Message-
From: Honda-Search Administrator 

Bruce,

There is no reason you shouldn't be able to use POST, especially if you use 
the opensearch method to display your results.

Matt
- Original Message - 
From: "bruce" 
> hi...
>
> not sure whether this should be a dev/user question...
>
> some of the archives seem to indicate that nutch doesn't/can't/perhaps
> shouldn't follow a form that uses POST... is this correct, and if it is, 
> can
> someone tell me why?
>
> can nutch hand forms that use GET??
>
> i'm looking to extract some information off of public college sites, and
> some of the sites use POST, while others use GET with their forms...
>
> thanks
>
> -bruce



RE: takes too long to remove a page from WEBDB

2006-02-03 Thread Fuad Efendi
We have following code:

org.apache.nutch.parse.ParseOutputFormat.java
...
[94]toUrl = urlNormalizer.normalize(toUrl);
[95]toUrl = URLFilters.filter(toUrl);
...


It normalizes, then filters normalized URL, than writes it to /crawl_parse

In some cases normalized URL is not same as raw URL, and it is not filtered.


-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 03, 2006 10:53 PM
To: nutch-user@lucene.apache.org
Subject: RE: takes too long to remove a page from WEBDB


It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-Original Message-
From: Keren Yu [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 






RE: takes too long to remove a page from WEBDB

2006-02-03 Thread Fuad Efendi
It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...

I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..); }
...

- it doesn't help, I see some URLs from "filtered" hosts again...


-Original Message-
From: Keren Yu [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 03, 2006 4:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: takes too long to remove a page from WEBDB


Hi Stefan,

As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.

Best regards,
Keren

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> not if you filter it in the url filter.
> There is a database based url filter I think in the
> jira somewhere  
> somehow, this can help to filter larger lists of
> urls.
> 
> Am 03.02.2006 um 21:35 schrieb Keren Yu:
> 
> > Hi Stefan,
> >
> > Thank you. You are right. I have to use a url
> filter
> > and remove it from the index. But after 30 days
> later,
> > the page will be generated again in generating
> fetch
> > list.
> >
> > Thanks,
> > Keren
> >
> > --- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >
> >> And also it makes no sense, since it will come
> back
> >> as soon the link
> >> is found on a page.
> >> Use a url filter instead  and remove it from the
> >> index.
> >> Removing from webdb makes no sense.
> >>
> >> Am 03.02.2006 um 21:27 schrieb Keren Yu:
> >>
> >>> Hi everyone,
> >>>
> >>> It took about 10 minutes to remove a page from
> >> WEBDB
> >>> using WebDBWriter. Does anyone know other method
> >> to
> >>> remove a page, which is faster.
> >>>
> >>> Thanks,
> >>> Keren
> >>>
> >>>
> __
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>
> >>
> >
> >
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




RE: Recovering from Socket closed

2006-02-01 Thread Fuad Efendi
Nutch-Dev,
What do you use for JobTracker now?
sun.net.www.http.KeepAliveCache will keep HTTP 1.1 persisent connections
open for reuse. It launches a thread that will close the connection after a
timeout. This thread has a strong reference to your classloader (through a
ProtectionDomain). The thread will eventually die, but if you are doing
rapid redeployment, it could still be a problem. The only known solution is
to add "Connection: close" to your HTTP responses.

>However, it definitely seems as if the JobTracker is still waiting 
>for the job to finish (no failed jobs).

>My thought would be to first try stopping the TaskTracker process on 
>the slave.



RE: How many data have you got?

2006-01-31 Thread Fuad Efendi
>> When I performed a whole-web crawl test according to the tutorial, I got
>> Number of pages: 36668
>> Number of links: 46721.
>> Then how many have you got?

>I only played around with Nutch some month ago, and I got as many as
500.000 
>pages and several million links within a few days over my home DSL line.
Your 
>crawler might be stuck somewhere ...?

Number of pages - it's probably number of Page instances, number of
successfully retrieved web-pages.
Number of links - probably total number of Link instances in WebDB,
including non-retrieved pages, and links to the same Page instance. 

Different pages may have different links (with different anchor text and
even different URL) to the same Page instance; page equality is defined as
MD5 hash (checksum of all bytes in plain HTTP response).

Single page may have hundreds of links, including links to foreign hosts.

Nutch 0.7.1



RE: interesting paper with competing index systems

2006-01-21 Thread Fuad Efendi
Interesting...

I am looking for some "data mining" concept, I found http://opennlp.org,
"Natural Language Processing"...

Information classification, finding new language terms/tokens such as "IBM
T42p", "SuSE Linux 10.0", "Red Rouge", "Break Barrel", etc...



-Original Message-
From: Otis

No, LingPipe is a different beast not to be compared with Nutch nor Lucene.
It doesn't "index" anything in the Lucene sense, although it does create
certain in-memory or on-disk language models.  The authors are very smart
guys!

Oh, ali LingPipe was described in Lucene in Action's Case Study chapter,
along with Nutch, and others.

Otis

- Original Message 
From: Fuad

Another interesting tool to perform linguistic analysis on natural language
data:

http://www.alias-i.com/lingpipe/
- is it really "indexing" engine?

They are using NekoHTML parser.




RE: getOutlinks doesn't work properly

2006-01-19 Thread Fuad Efendi

  file.content.limit
  -1
  The length limit for downloaded content, in bytes.
  If this value is larger than zero, content longer than it will be
  truncated; otherwise (zero or negative), no truncation at all.
  


(default is 65536)



-Original Message-
From: Jack Tang

Hi

pls change the value of "db.max.outlinks.per.page"(default is 100)
property to say 1000.


  db.max.outlinks.per.page
  1000
  The maximum number of outlinks that we'll process for a page.
  


/Jack

On 1/20/06, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
>   Hi everyone,
>
>   I found that getOutlinks function in html-parser/DOMContentUtils.java
> doesn't work correctly for some cases. An example is this website:
> http://blog.donews.com/boyla/. The function returns only 170 records,
while
> in fact it contains a lot more (Firefox returns 356 links!).
>
>   When I compare the hyperlink list with the one returned by Firefox, the
> orders are exactly identical, meaning that the 170th link of getOutlinks
> function is the same as the 170th link of Firefox. Therefore, it seems
that
> the algorithm is correct, but there is some bug around. There is no
> threshold at this point, since the max outlinks parameter is set at
updatedb
> part. Even when I increase the max outlinks to 1000, the situation still
> remains.
>
>   Any suggestions are very appreciated.
>
>   Regards,
>   Giang
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars




RE: interesting paper with competing index systems

2006-01-19 Thread Fuad Efendi
Another interesting tool to perform linguistic analysis on natural language
data:

http://www.alias-i.com/lingpipe/
- is it really "indexing" engine?

They are using NekoHTML parser.


-Original Message-
From: Byron Miller 
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf
Anyone have any further details on this?






[off-topic] Web Crawlers Comparison.

2006-01-18 Thread Fuad Efendi
I am new with Linux (as an end user, desktop, SuSE 10.0), it seems that all
this "free" staff does not work well. 

KWebGet is GUI for crawling, based on Wget. KWebGet is single threaded!
Works really very fast, but needs a lot of RAM for URL's (aka fetch-list);
doesn't have any database of URLs.

Wget uses few concurrent sessions to download single "HTTP Response"
(including HTML, zip, doc, etc.) Few crawlers developed on top of Wget.

Pavuk - multithreaded, but buggy, works fine with single thread only.

Windows, Teleport Ultra, it has many "netiquette" features, such as dynamic
bandwidth allocation for slow/fast Web-Servers (do we need dynamic
configuration for Nutch?).

And, of course, we should try http://htmlparser.sourceforge.net (it has
utility class just for crawling)

Thanks

P.S.
Mozilla Firefox is the best debugger (DOM introspector)



RE: throttling bandwidth

2006-01-18 Thread Fuad Efendi
I was totally wrong in previous 3-4 posts.
ISP route IP.
Thanks
-Original Message-
From: Andrzej Bialecki

Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>   

This is completely false. Having worked for an ISP I can assure you that 
the most important metric is the amount of traffic, and its behavior 
over time. TCP sessions? We don't need no stinking TCP, we route good 
ol' IP ;)

Please check your facts before claiming something about all ISPs around 
the world.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: throttling bandwidth

2006-01-17 Thread Fuad Efendi
I made small assumption/mistake in a previous post. Not all of you are using
Transport-Layer-Routers (aka Firewalls, or layer-4-Router)

But, small in-house companies are almost always using SHDSL etc., IP over
ATM, IP over Frame Relay, ...

Hardware between Crawler and Web-Site always has limitations such as CPU,
RAM; and IP packets (layer 3 of OSI), and in some cases TCP (layer 4) are
randomly/evenly distributed... 

If hardware allows to send 1,000,000 of IP packets per second, and you are
trying to send 1,999,999 of IP packets per second, no one else can get
access to Internet but you, even if you are using just 10% of the total
available bandwidth.

In some cases equipment gets overloaded even with 55-60% of the total
channel loading. 



-Original Message-
From: Fuad Efendi 
...
hardware allows to remember (due to RAM and CPU limitations) up to 1,000,000
of IP addresses, and 20,000 TCP ports for each "handshake". And his hardware
randomize bandwidth evenly between 1,000,000 x 20,000 = 20,000,000,000 TCP
connections...



RE: throttling bandwidth

2006-01-17 Thread Fuad Efendi
Andrzej,


I think I really need to provide more details here, just as a sample:

Ted Rogers is ISP, and he has 8000Mbps synchronous connection to UUNet. His
hardware allows to remember (due to RAM and CPU limitations) up to 1,000,000
of IP addresses, and 20,000 TCP ports for each "handshake". And his hardware
randomize bandwidth evenly between 1,000,000 x 20,000 = 20,000,000,000 TCP
connections. Why evenly? Because it is cheapest solution, no CPU required,
no network latency.

So, for instance, you use more bandwisth if you use more TCP connections
(because connection to UUNet backbone is shared between many users).

Now, consider big building, and Router on the roof, which allows only 1024
TCP sessions... If you are using 512 TCP threads, you are guaranteed to use
at least 50% of the total bandwidth of shared channel (if your "last mile"
allows it).

This is the case when 100Mbps "last mile" is dedicated, and 1000Gbps "before
last mile" is shared between 256 users with 65536 sessions limitation for
all of them.

ISP's employees usually don't know such details. They are "help-desk", and
they usually ask "Could you please use less than 50Gb download per month?
What is your IE version?!"

So, suggestion to Crawlers:
1. Consider education of ISP's employees
2. Decrease number of concurrent "alive" threads (aka concurrently open TCP
sockets)
3. Increase bandwidth




-Original Message-
From: Andrzej Bialecki 


Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>   

This is completely false. Having worked for an ISP I can assure you that 
the most important metric is the amount of traffic, and its behavior 
over time. TCP sessions? We don't need no stinking TCP, we route good 
ol' IP ;)

Please check your facts before claiming something about all ISPs around 
the world.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






RE: throttling bandwidth

2006-01-17 Thread Fuad Efendi
Yes, it is completely wrong, just because ISP's Employees usually asks
questions like as "What is your OS version? What is your hard drive?" etc. I
gave very-very old info, may be it was true just 4-5 years ago.

CISCO licenses their PIX by number of concurrent TCP sessions, and it is not
IP... It is on different layer...

Of course, ISP may have different policy depending on their technology and
their connections to another ISP, they are all intemediaries...

Which ISP have you worked for, UUNet? WorldCom...


TCP is over IP. Always.




-Original Message-
From: Andrzej Bialecki

Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>   

This is completely false. Having worked for an ISP I can assure you that 
the most important metric is the amount of traffic, and its behavior 
over time. TCP sessions? We don't need no stinking TCP, we route good 
ol' IP ;)

Please check your facts before claiming something about all ISPs around 
the world.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






RE: throttling bandwidth

2006-01-16 Thread Fuad Efendi
For ISPs around-the-world, thew most important thing is the Number of Active
TCP Sessions.

Such manufacturers as CISCO sell/license their hardware with different
options: 1024 sessions, 65536 sessions, etc.

Backbones are shared between users, and you can kill others using 1024
sessions.

ISPs don't like such "download accelerators" as wGet which use few TCP
sessions for a single file download.


-Original Message-
From: Insurance Squared Inc. [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 16, 2006 6:03 PM
To: nutch-user@lucene.apache.org
Subject: throttling bandwidth


My ISP called and said my nutch crawler is chewing up 20mbits on a line 
he's only supposed to be using 10.   Is there an easy way to tinker with 
how much bandwidth we're using at once?  I know we can change the number 
of open threads the crawler has, but it seems to me this won't make a 
huge difference.  If I chop the number of open threads in half, it'll 
just download half the pages, twice as fast?  I stand to be corrected on 
this.

Any other thoughts? doesn't have to be correct or elegant as long as it 
works. 

Failing a reasonable solution in nutch, is there some sort of linux 
level tool that will easily allow me to throttle how much bandwidth the 
crawl is using at once?

Thanks.






RE: How can no URLs be fetched until the 11th round of fetching?

2006-01-16 Thread Fuad Efendi
You are fetching one single URL in a loop of 14 fetches (14 separate JVM
executions). It is possible, smth wrong with DNS Servers in your LAN/WAN at
the time of these fetches. At 11th fetch, DNS-to-IP was resolved, and Nutch
was able to fetch 1-st page, and subsequent pages. Just as sample. 

Difficult to reproduce...


=
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173409

0 here means Size of a Segment, NOT number of pages in a fetch list. Check
org.apache.nutch.segment.SegmentReader, SegmentReader.size is the number of
entries in FetcherOutput.DIR_NAME 

Number of entries in FetcherOutput.DIR_NAME  - it should be probably totals
of success/error (different HTTP responses), I can't go in a very deep
details now... 0 means that no pages were fetched (even HTTP errors)...




-Original Message-
From: Bryan Woliner
Sent: Sunday, January 15, 2006 11:44 PM
To: 
Subject: Re: How can no URLs be fetched until the 11th round of fetching?


I don't think that I was completely clear in my first post. What you
are saying makes sense if I was doing a one-round fetch on a number of
different occasions. However, I am doing 14 rounds of fetching each
called by one script, in the pattern outlined in the nutch tutorial,
where my script does 14 loops of the following:
--

bin/nutch generate db segments
s[i]=`ls -d segments/2* | tail -1`
bin/nutch fetch $s[i]
bin/nutch updatedb db $s[i]
--

Do you think the possibilities you suggested makes sense in light of
the fact that I am doing each of these rounds of fetching within
seconds of each other, each being called by the same script?

I also have a couple of related questions?

(1) In the first round of fetching, the fetchlist is generated from
the database, which was injected with the one URL that comprises my
urls file. If in the first round of fetching, the one URL in the fetch
list can't be fetched and/or parsed, I am assuming that subsequent
rounds of fetching just used the same one-URL fetchlist until this URL
is successfully fetched and its outlinks added to the database. Is
that correct?

(2) When I call the following command, the resulting file has no
output for the rounds where no URLs were fetched. This leads me to
believe that the fact that no URLs were fetched is not a result of a
fetching or parsing error (since such errors usually show up in the
output of this command). Does this make sense? If it does, then what
caused no URLs to be fetched.

Thanks for any helpful suggestions,
Bryan

On 1/15/06, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Many things could happen.
>
> Sample1: website was unavailable during first 10 fetches
> Sample2: 11th fetch used different IP, DNS-to-IP mapping changed (or may
be
> finally resolved!)
> Sample3: Smth changed on a site, "redirect" added/changed, etc.
> Sample4: web-master modified robots.txt
> Sample5: big first HTML file, network errors during first 10 fetch
attempts,
> etc.
>
> It should be very uncommon behaviour, but it may happen...
>
>
> -Original Message-
> From: Bryan Woliner
>
> I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14
> rounds of fetching and an urls files with one URL in it. No urls were
> fetched during the first 10 rounds, but then in the 11th round one URL was
> fetched and increasing more URLs were fetched in rounds 12-14. I am basing
> the numbers of URLs fetched  on the  output from calling bin/nutch segread
> (included below). I don't understand how this can happen. If a URL is not
> fetched during a round are its outlinks still added to the database for
the
> next round of fetching? Why would I have 10 rounds of fetching with no
URLs
> fetched and then suddenly have one fetched successfully in the 11th round?
>
> Any suggestions are appreciated.
> -Bryan
>
> Here is the output when I call:
>
> bin/nutch segread -list -dir segments
>
> run java in /usr/local/j2sdk1.4.2_08
> 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml
> 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml
> 060115 205601 No FS indicated, using default:local
> 060115 205601 PARSED?   STARTED FINISHED
> COUNT   DIR NAME
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173409
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173413
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173417
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173421
> 060115 205601 true  19691231-18:00:00   19691231-18:00:00
> 0   ../segments/20060115173424
> 060115 205601 true  19691231-18:

RE: How can no URLs be fetched until the 11th round of fetching?

2006-01-15 Thread Fuad Efendi
Many things could happen.

Sample1: website was unavailable during first 10 fetches
Sample2: 11th fetch used different IP, DNS-to-IP mapping changed (or may be
finally resolved!)
Sample3: Smth changed on a site, "redirect" added/changed, etc.
Sample4: web-master modified robots.txt
Sample5: big first HTML file, network errors during first 10 fetch attempts,
etc.

It should be very uncommon behaviour, but it may happen...


-Original Message-
From: Bryan Woliner 

I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14
rounds of fetching and an urls files with one URL in it. No urls were
fetched during the first 10 rounds, but then in the 11th round one URL was
fetched and increasing more URLs were fetched in rounds 12-14. I am basing
the numbers of URLs fetched  on the  output from calling bin/nutch segread
(included below). I don't understand how this can happen. If a URL is not
fetched during a round are its outlinks still added to the database for the
next round of fetching? Why would I have 10 rounds of fetching with no URLs
fetched and then suddenly have one fetched successfully in the 11th round?

Any suggestions are appreciated.
-Bryan

Here is the output when I call:

bin/nutch segread -list -dir segments

run java in /usr/local/j2sdk1.4.2_08
060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml
060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml
060115 205601 No FS indicated, using default:local
060115 205601 PARSED?   STARTED FINISHED
COUNT   DIR NAME
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173409
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173413
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173417
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173421
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173424
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173428
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173432
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173436
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173440
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173443
060115 205602 true  20060115-17:34:51   20060115-17:34:51
1   ../segments/20060115173447
060115 205602 true  20060115-17:34:57   20060115-17:41:07
42  ../segments/20060115173454
060115 205602 true  20060115-17:41:16   20060115-18:12:28
234 ../segments/20060115174113
060115 205602 true  20060115-18:12:37   20060115-19:51:07
738 ../segments/20060115181234
060115 205602 TOTAL: 1015 entries in 14 segments.



RE: [Nutch-general] Error running MapReduce - Jetty server & .jsp files

2006-01-15 Thread Fuad Efendi
Include tools.jar in your classpath.
tools.jar - this file contains javac compiler. It is not included with SUN's
JRE.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Sunday, January 15, 2006 12:51 AM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Error running MapReduce - Jetty server & .jsp
files


I believe javac is only included in the full JDK (not JRE).  Do 'which
javac' or 'locate javac' or just do a fine in your $JAVA_HOME dir and look
for javac.

Otis

- Original Message 
From: Ken Krugler <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Sat 14 Jan 2006 05:50:00 PM EST
Subject: [Nutch-general] Error running MapReduce - Jetty server & .jsp files

Hi all,

After a bit of thrashing with RSA keys and having to move 
/Nutch/src/webapps up to /Nutch, I've gotten the 1/12/2006 build of 
Nutch running on three servers.

The "master" is running as a NameNode & JobTracker, and two slaves 
are running as DataNodes and TaskTrackers.

I'm running into a problem with using the JobTracker web interface.

I can see the two .jsp files (jobdetails.jsp and jobtracker.jsp) when 
I point my browser at http://master:50030, but when I actually try to 
run one of the JSPs (e.g. http://master:50030/jobdetails.jsp) I get a 
500 error.

The nutch-crawler-jobtracker-main1.log file on the master tells me 
that Jetty wasn't able to compile the .jsp because of a classpath 
problem. The relevant portion of the log says:

060114 110818 SEVERE Javac exception
Unable to find a javac compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK

[snip]

I've verified that JAVA_HOME is set to /usr/java/jre1.5.0_05

Is the problem because this isn't a full JDK?

What's confusing to me is that the classpath dumped by Jetty in the 
log looks like:

classpath=/tmp/Jetty__50030___24406:/usr/java/jre1.5.0_05/lib/ext/localedata
.jar:/usr/java/jre1.5.0_05/lib/ext/sunpkcs11.jar 
[snip]

So obviously somebody is using JAVA_HOME to build the path to these .jar
files.

But JAVA_HOME (the top-level path, ie /usr/java/jre1.5.0_05) isn't a 
member of this classpath.

Any help would be appreciated!

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general







RE: Would Someone Give Me Pointer On How to Index Database?

2005-10-26 Thread Fuad Efendi
Question: what do you need to index?

Simple answer "I need to index my MySQL" is not enough... MySQL has own
indexes...

Nutch is an Internet Search Engine; Lucene is a framework for indexing and
searching of any text information... Does your database contain huge text
fields, "Documents"? 


-Original Message-
From: Sam Lee [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 26, 2005 3:35 PM
To: java-user@lucene.apache.org
Subject: Would Someone Give Me Pointer On How to Index Database?


Hi,
  I want to use Lucene/Nutch to index my mysql
database.  I think of using JDBC, is it a good idea? 
I searched all over the web, but all the examples are
non-lucene/Nutch related.  Would you guys give me
pointers or websites or examples on how to use JDBC on
Lucene/Nutch to index mysql database?  

Many thanks.



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs




RE: New Nutch User

2005-10-14 Thread Fuad Efendi
Nutch does support A9's OpenSearch extensions to RSS.

I think, it would be easier to start with pure Nutch, then to learn some
JSP/Servlet... If you need own crawler...


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 13, 2005 1:33 PM
To: nutch-user@lucene.apache.org
Subject: RE: New Nutch User


Thanks so much for your help.  One more question I've never written a
wrapper before.  I did some searching online and found SWIG
(http://www.swig.org) which seems like it can help me write a wrapper.

Does anyone have some examples of a wrapper I can use, or will SWIG be my
best bet?  Ultimately my goal for Nutch is to create a site similar to
Indeed.com.  Any suggestions would be greatly appreciated.  Thanks!

-Original Message-
From: Ngoc Giang Nguyen [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 12, 2005 1:31 PM
To: nutch-user@lucene.apache.org
Subject: Re: New Nutch User

I think Nutch 0.7 supports OpenSearch protocol, so that you don't need to
digest much on Java code. Just treat Nutch as a web service, and you can
write wrapper on any scripting language that you love to handle HTTP
requests/responses.


On 10/13/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
wrote:
>
> I am new to Nutch. I love it, but am not sure if I can handle putting 
> this together by myself. I run Red Hat Linux boxes with apache. I have 
> knowledge of HTML, some Java, MYSQL, PHP and Linux.
>
>
>
> Will I be able to get Nutch up and running to crawl multiple sites on 
> the internet the way a basic search engine does? What skills am I 
> missing or need to learn?
>
>
>
> My main problem, is that I am pretty confident that I can get Nutch 
> installed on my machines, but I'm not too sure how to integrate it 
> into the front end of my site. Is it just a simple POST or GET form, 
> or is it very JAVA intensive?
>
>
>
> Any suggestions would be greatly appreciated.
>
>
>
> Thanks
>
>
>





RE: SocketTimeoutException

2005-09-23 Thread Fuad Efendi
AJ,
Number of threads 1000 - very good, but... at least existing J2SE from
SUN, it will perform ugly!

Preferabe: 32 Processes, 32 Threads each... At least with "The Grinder"
and 2 Gb of memory (I don't know about Nutch!)... You should really
calculate everything, CPU, memory for each thread, ...
1000 Threads - too much...

P.S.
Increase "timeout", or decrease "threads".


-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 23, 2005 1:41 PM
To: nutch-user
Subject: SocketTimeoutException


In my crawling of large number of selected sites, the number of threads 
is automatically determined by the number of pages on the fetchlist in 
each fetch/updatedb cycle. Max number of threads is set to 1000.  When 
using 1000 threads, there are lots of SocketTimeoutException in fetching

toward the end of the fetch cycle. Any suggestion for reducing 
SocketTimeoutException?  I also notice that the SocketTimeoutException 
errors are not counted in the error count for segment status. Why is
that?

Relevant parameters set:
http.timeout=1
http.max.delays=10
fetcher.threads.fetch= from 10 to 1000 depending on the size of
fetchlist

Appreciate your help,
AJ





RE: Nuch capability

2005-08-29 Thread Fuad Efendi
Nutch has plugins for PDF, MSWORD, JavaScript, HTML, Text, RSS
You may enable it using nustch-site.xml and copying/modifying section
from nutch-default.xml:

  plugin.includes
 
nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|pdf|msword|js)|index-basic|query-(basic|site|url)
...

parse-(text|html|pdf|msword|js)

 
You may also design own plugins (Excel, PowerPoint).
 
As I understand main purpose of a parser is to extract plain text for
indexing, extract Outlink[], and metadata. If your PDF have metadata
with Author, probably should work...
 
I am currently working on specific Anchor processing (indexing,
analyzing). You also have access to WebDB database with Link objects,
and Anchor text elements linking to the Page.
 
Check WebDBReader, WebDBAnchors.

HTDIG can't do this...



-Original Message-
From: Valmir Macário [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 3:18 PM
To: nutch-user@lucene.apache.org
Subject: Nuch capability


Someone can answer me if i can index other types of documents like doc,
pdf,
ppt, xls... And i can return them from a repository at intranet with
some
caracteristcs witch i can choice, like the autor. I'm doubts if i use
nuch
or htdig for this pourpose. Someone can help-me?




RE: For day-to-day usage, which commands should I execute?

2005-08-28 Thread Fuad Efendi
Just kidding, tried to execute Parse separately...

generate db segments
fetch -threads 1 segments\$s
updatedb db segments\$s
index segments\$s

Is that enough for simple Refresh/Update of a search engine? Executed
only once... One new segment, and "index segments"... 


-Original Message-
From: EM [mailto:[EMAIL PROTECTED] 
Sent: Sunday, August 28, 2005 8:21 PM
To: nutch-user@lucene.apache.org
Subject: RE: For day-to-day usage, which commands should I execute?


Why fetch -noParsing?


-Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Sunday, August 28, 2005 7:50 PM
To: nutch-user@lucene.apache.org
Subject: For day-to-day usage, which commands should I execute?

I created initial WebDB, Segments, Index; I executed bin/nutch crawl.

For day-to-day usage, which commands should I execute?

generate db segments
fetch -noParsing -threads 1 segments\$s
parse segments\$s
updatedb db segments\$s
index segments\$s

Am I right, is one cycle (one segment) enough for existing WebDB?

Thanks,
Fuad





For day-to-day usage, which commands should I execute?

2005-08-28 Thread Fuad Efendi
I created initial WebDB, Segments, Index; I executed bin/nutch crawl.

For day-to-day usage, which commands should I execute?

generate db segments
fetch -noParsing -threads 1 segments\$s
parse segments\$s
updatedb db segments\$s
index segments\$s

Am I right, is one cycle (one segment) enough for existing WebDB?

Thanks,
Fuad


RE: Index local file.

2005-08-22 Thread Fuad Efendi
Hi Benny,


Nutch is a generic web search engine, with a distributed file system
support (NFS, hugest crawls and indexes), and a framework... It has
plugins, you can probably design and use "FILE" plugin instead of
"PROTOCOL-HTTP"

However... What about hyperlinks, anchors? 

Indexing of presentation layers of aliens is very difficult, and
fuzzy... HTML has formatting, and 95% of an extracted plain text (Select
Options, Header, Footer, Menu, Reviews, ...) do not really need to be
indexed... 

If you need to index local files, best of all is to use Lucene directly,
with possible usage of org.apache.nutch.searcher package for web
front-end (if you really need web front-end); especially if you have
access to data layer (bypassing presentation such as HTML).

For all IntrAnet related tasks, Lucene.

If you have small amount of HTML, you can index your web-server directly
via HTTP without performance impact, it's easy... without any logic, you
will index everything... 


Regards,
Fuad


-Original Message-
From: Benny [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 22, 2005 2:54 PM
To: nutch-user@lucene.apache.org
Subject: Index local file.


Hi,

Can someone give me some hints how index local files?

I have a lot of plain HTML files (more than 50K pages, the size is
around 2-3k/page). I don't prefer puting them in the web service and
using url to index them. I'd like NUTCH to index them from local HD. Is
it possible? if it is, what kind of url I need inject into db? for
example, if you use web service, we use the

http://domain/file.html 

How about local HD file's format? I believe no more "http", what's
protocol supposed to be. These file are still in plain HTML format.


Benny




RE: How to view the content of fetched pages?

2005-08-22 Thread Fuad Efendi
Olena,

To serialize HTML you can use NekoHTML, create plugin (like as
parse-html), and execute "nutch parse"
http://people.apache.org/~andyc/neko/doc/html/filters.html#filters.seria
lize

"bin/nutch parse" will execute parse-html plugin
To access non-parsed content (after fetching) you can start from
ParseSegment utility as a sample



-Original Message-
From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 22, 2005 12:50 PM
To: nutch-user@lucene.apache.org
Subject: Re: How to view the content of fetched pages?


Hi,
You have to write some Java code (the easiest way to start is to use 
SegmentReader) - to access Content objects stored in segments. Regards
Piotr Olena Medelyan wrote:
> Hi,
> 
> I would like to use Nutch only as a (whole web) crawler, without the 
> indexing stage... After I've completed the fetching stage, how can I 
> access the database with the crawled data, in particular the texts of 
> the fetched pages? I tried to use segread and readdb from the command 
> line, unfortuately with no success.
> 
> Cheers, Olena
> 
> 





RE: [Nutch-general] Re: about the nutch function

2005-08-20 Thread Fuad Efendi
If you need search engine - you can use Nutch.
For front-end, you can use links to images from source sites, it's
better for performance of your own site.

If you need just to collect image files, probably best is
http://htmlparser.sourceforge.net/ 
org.htmlparser.parserapplications.SiteCapturer
However, I'd prefer to write some additional readers for Nutch! Better
for troubleshooting, I need it.
Thanks


-Original Message-
From: Zhou LiBing [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 20, 2005 7:59 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Re: about the nutch function


can I specify more than one URL to crawl the whole web?are you sure? how
to edit the crawl-urlfilter.txt to fetch the images?how could I extract 
these* segment *images' feature?
 thank you
   

 2005/8/19, Piotr Kosiorowski <[EMAIL PROTECTED]>: 
> 
> Yes.Yes. :).
> You can specify more than one url while injecting pages to WebDB. You 
> can fetch the image file (you have to edit crawl-urlfilter.txt or 
> regex-urlfilter.txt to allow particular extension as majority of image

> extensions are blocked by default). But such data would be only stored

> in segment - I do not think it would be accesible by search.
> P.
> 
> 
> On 8/19/05, Zhou LiBing <[EMAIL PROTECTED]> wrote:
> > Can Nutch use one or more start URL to crawl the WEB?
> > Can Nutch fetch the IMAGE file?
> > thank you
> >
> >
> > --
> > ---Letter From your friend Blue at HUST CGCL---
> >
> >
> 
> 
> ---
> SF.Net email is Sponsored by the Better Software Conference & EXPO 
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle 
> Practices Agile & Plan-Driven Development * Managing Projects & Teams 
> * Testing & QA Security * Process Improvement & Measurement * 
> http://www.sqe.com/bsce5sf 
> ___
> Nutch-general mailing list Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 



-- 
---Letter From your friend Blue at HUST CGCL---



RE: [Nutch-general] Re: about the nutch function

2005-08-20 Thread Fuad Efendi
Such data is called "Content" (Java Class), it has HTTP headers, you can
check HTTP headers ("text/html", "jpg", etc MIME extensions)
It is stored in folder "content"...
You can extract it, and save in hard-drive as regular files, you need
specific code (Nutch does not have it). Simply extract 
Check in Fetcher class, "ArrayFile.Writer contentWriter;"
You need also to modify url-filter files...

Content.getContentType() - MIME extension
byte[] Content.getContent() - file content (JPG, HTML, DOC, PDF, ...)




-Original Message-
From: Zhou LiBing [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 20, 2005 7:59 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Re: about the nutch function


can I specify more than one URL to crawl the whole web?are you sure? how
to edit the crawl-urlfilter.txt to fetch the images?how could I extract 
these* segment *images' feature?
 thank you
   

 2005/8/19, Piotr Kosiorowski <[EMAIL PROTECTED]>: 
> 
> Yes.Yes. :).
> You can specify more than one url while injecting pages to WebDB. You 
> can fetch the image file (you have to edit crawl-urlfilter.txt or 
> regex-urlfilter.txt to allow particular extension as majority of image

> extensions are blocked by default). But such data would be only stored

> in segment - I do not think it would be accesible by search.
> P.
> 
> 
> On 8/19/05, Zhou LiBing <[EMAIL PROTECTED]> wrote:
> > Can Nutch use one or more start URL to crawl the WEB?
> > Can Nutch fetch the IMAGE file?
> > thank you
> >
> >
> > --
> > ---Letter From your friend Blue at HUST CGCL---
> >
> >
> 
> 
> ---
> SF.Net email is Sponsored by the Better Software Conference & EXPO 
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle 
> Practices Agile & Plan-Driven Development * Managing Projects & Teams 
> * Testing & QA Security * Process Improvement & Measurement * 
> http://www.sqe.com/bsce5sf 
> ___
> Nutch-general mailing list Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 



-- 
---Letter From your friend Blue at HUST CGCL---



RE: crawled page are not in HTML -- what should I do?

2005-08-17 Thread Fuad Efendi
I am newbie too... 

Look at src/plugin/parse-html, ParseHtml.java - here you can work
directly with Content object (HTTP binary response), split it to
HTTP-Headers, Metatags, and Body, and parse it...

public Parse getParse(Content content){...}

This method is called from org.apache.nutch.Fetcher

It seems that Nutch stores only parced data in "gzip" format, and in my
case I don't need to store plain HTML - only subset of HTML


-Original Message-
From: Sarah Zhai [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 17, 2005 8:51 PM
To: nutch-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: crawled page are not in HTML -- what should I do?


Hi,
I'm a newbie to Nutch.
I installed nutch and use it to do the crawling successfully.

The point is, I checked the crawled files under /segments/***/fetcher/ 
and they are not in .html or other similar format. 
(There are two files named "data" and "index" under each subfolder.)

Since I want to crawl thousands of web pages and parse the
HTML code of each web page...I was wondering, what should I 
do so that the crawled pages can be in HTML format?

Thanks.

--
sarah



RE: VOTE: (Re: RSS Feed Parser)

2005-08-12 Thread Fuad Efendi
+1
Need more samples!
(Can I vote? I am novice... Just a few funny days!)


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 11, 2005 6:08 PM
To: nutch-user@lucene.apache.org
Subject: VOTE: (Re: RSS Feed Parser)


Chris Mattmann wrote:
> Hi Zaheed,
> 
>  Thanks for the nice comments. I've went ahead and wrote an HTML page 
> that summarizes what I sent to Zaheed with respect to installing the 
> parse-rss plugin. You can find the small guide here:
> 
> http://www-scf.usc.edu/~mattmann/parse-rss-install.html
> 

My apologies to Chris - I was supposed to import this plugin before the 
release, however due to changes in my travel plans and other work I ran 
out of time... :-(

We are in the no-commit period now, before the release. I could do the 
import now, if other committers approve this exception. As a safety 
measure against this short testing period I would leave it disabled by 
default.

Please vote +1 if I should commit it before the release, or -1 if after.

-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com





RE: [Nutch-general] How to extend Nutch

2005-08-11 Thread Fuad Efendi
Ok, I see it...
It uses nutch-site.xml file from JAR file, not from CONF directory
(where I updated nutch-site.xml). Rebuild...


-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 11, 2005 5:12 PM
To: nutch-user@lucene.apache.org
Subject: RE: [Nutch-general] How to extend Nutch


I copided to nutch-site.xml (added CreativeCommons plugin):


  plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-ba
sic|query-(basic|site|url)|creativecommons
  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  


That's enough? Doesn't work, need help...
Thanks


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 10, 2005 4:00 PM
To: nutch-user@lucene.apache.org
Subject: Re: [Nutch-general] How to extend Nutch


nutch-site.xml is the only config file you should touch, by copying  
the appropriate section from nutch-default.xml and customizing it.

Yes, you will need to write a custom plugin like the creativecommons  
one.

 Erik


On Aug 10, 2005, at 2:44 PM, Fuad Efendi wrote:

>
> I probably need to work with plugins, and to modify config files... I
> need to add additional field to Document, and to show it on a web-page
>
> nutch-conf.xsl
> nutch-default.xml
> nutch-site.xml
>
> Am I right?
> Thanks
>
>
> -Original Message-
> From: Fuad Efendi [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 10, 2005 2:15 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: [Nutch-general] How to extend Nutch
>
>
> So, I need to modify some existing classes, isn't it?
>
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 10, 2005 1:48 PM
> To: [EMAIL PROTECTED]
> Subject: Re: [Nutch-general] How to extend Nutch
>
>
> Probably IndexingFilter or HtmlParser for indexing and for indexing I
> think there is something in org.apache.nutch.search some class
> that
> starts with Raw  I just saw this in the Javadoc earlier.
>
> Otis
>
> --- Fuad Efendi <[EMAIL PROTECTED]> wrote:
>
>
>> I need specific pre-processing of a html-page, to add more fields to
>> Document before storing it in Index, and to modify web-interface 
>> accordingly.
>>
>> Where is the base point of extension?
>> Thanks!
>>
>







RE: [Nutch-general] How to extend Nutch

2005-08-11 Thread Fuad Efendi
I copided to nutch-site.xml (added CreativeCommons plugin):


  plugin.includes
 
protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-ba
sic|query-(basic|site|url)|creativecommons
  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  


That's enough? Doesn't work, need help...
Thanks


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 10, 2005 4:00 PM
To: nutch-user@lucene.apache.org
Subject: Re: [Nutch-general] How to extend Nutch


nutch-site.xml is the only config file you should touch, by copying  
the appropriate section from nutch-default.xml and customizing it.

Yes, you will need to write a custom plugin like the creativecommons  
one.

 Erik


On Aug 10, 2005, at 2:44 PM, Fuad Efendi wrote:

>
> I probably need to work with plugins, and to modify config files... I 
> need to add additional field to Document, and to show it on a web-page
>
> nutch-conf.xsl
> nutch-default.xml
> nutch-site.xml
>
> Am I right?
> Thanks
>
>
> -Original Message-
> From: Fuad Efendi [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 10, 2005 2:15 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: [Nutch-general] How to extend Nutch
>
>
> So, I need to modify some existing classes, isn't it?
>
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 10, 2005 1:48 PM
> To: [EMAIL PROTECTED]
> Subject: Re: [Nutch-general] How to extend Nutch
>
>
> Probably IndexingFilter or HtmlParser for indexing and for indexing I 
> think there is something in org.apache.nutch.search some class
> that
> starts with Raw  I just saw this in the Javadoc earlier.
>
> Otis
>
> --- Fuad Efendi <[EMAIL PROTECTED]> wrote:
>
>
>> I need specific pre-processing of a html-page, to add more fields to 
>> Document before storing it in Index, and to modify web-interface 
>> accordingly.
>>
>> Where is the base point of extension?
>> Thanks!
>>
>





RE: [Nutch-general] How to extend Nutch

2005-08-10 Thread Fuad Efendi

I probably need to work with plugins, and to modify config files... I
need to add additional field to Document, and to show it on a web-page

nutch-conf.xsl
nutch-default.xml
nutch-site.xml

Am I right? 
Thanks


-Original Message-
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 10, 2005 2:15 PM
To: nutch-user@lucene.apache.org
Subject: RE: [Nutch-general] How to extend Nutch


So, I need to modify some existing classes, isn't it?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 10, 2005 1:48 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] How to extend Nutch


Probably IndexingFilter or HtmlParser for indexing and for indexing I
think there is something in org.apache.nutch.search some class that
starts with Raw  I just saw this in the Javadoc earlier.

Otis

--- Fuad Efendi <[EMAIL PROTECTED]> wrote:

> I need specific pre-processing of a html-page, to add more fields to
> Document before storing it in Index, and to modify web-interface 
> accordingly.
> 
> Where is the base point of extension?
> Thanks!
> 



RE: [Nutch-general] How to extend Nutch

2005-08-10 Thread Fuad Efendi
So, I need to modify some existing classes, isn't it?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 10, 2005 1:48 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] How to extend Nutch


Probably IndexingFilter or HtmlParser for indexing and for indexing I
think there is something in org.apache.nutch.search some class that
starts with Raw  I just saw this in the Javadoc earlier.

Otis

--- Fuad Efendi <[EMAIL PROTECTED]> wrote:

> I need specific pre-processing of a html-page, to add more fields to 
> Document before storing it in Index, and to modify web-interface 
> accordingly.
> 
> Where is the base point of extension?
> Thanks!
> 
> 
> 
> ---
> SF.Net email is Sponsored by the Better Software Conference & EXPO 
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle 
> Practices Agile & Plan-Driven Development * Managing Projects & Teams 
> * Testing & QA
> Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> ___
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 





How to extend Nutch

2005-08-10 Thread Fuad Efendi
I need specific pre-processing of a html-page, to add more fields to
Document before storing it in Index, and to modify web-interface
accordingly.

Where is the base point of extension?
Thanks!