Re: Automating workflow using ndfs

2005-09-02 Thread Earl Cahill
  The goal is to 
 avoid entering 100,000 regex in the
 craw-urlfilter.xml and checking ALL 
 these regex for each URL. Any comment?

Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. 
Especially if you have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com-site-good (for good.site.com stuff)
exclude: com-site-bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
 
 to get regex-normalize.xml to work i must put:
 in nutch-site.xml
 In nutch-default.xml there is set:
 Is this a bug or a feature? =)

nutch-site.xml overrides properties defined in nutch-default. So:
* If you remove urlnormalizer.class property from nutch-default it must 
still uses the one defined in nutch-site
* If you remove urlnormalizer.class property from nutch-site it must use the 
one defined in nutch-default
* ...
(if it works another way it is a bug, otherwise, the feature is to uses 
nutch-site first, then nutch-default if some properties are not defined in 
nutch-site).

Regards

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/


Re: regex-normalize.xml

2005-09-02 Thread Michael Weber

Hi Jérôme,

i think i expressed it wrong. The Question was if its a feature or a bug 
that regex-normalize.xml is used only after this changes.


Regards

Michael

Jérôme Charron schrieb:

to get regex-normalize.xml to work i must put:
in nutch-site.xml
In nutch-default.xml there is set:
Is this a bug or a feature? =)



nutch-site.xml overrides properties defined in nutch-default. So:
* If you remove urlnormalizer.class property from nutch-default it must 
still uses the one defined in nutch-site
* If you remove urlnormalizer.class property from nutch-site it must use the 
one defined in nutch-default

* ...
(if it works another way it is a bug, otherwise, the feature is to uses 
nutch-site first, then nutch-default if some properties are not defined in 
nutch-site).


Regards

Jérôme




Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
 
 i think i expressed it wrong. The Question was if its a feature or a bug
 that regex-normalize.xml is used only after this changes.

the regex-normalize.xml is used only after you specify that you want to use 
the RegexUrlNormalizer implementation. So it is used only if you specify 
urlnormalizer.class=org.apache.nutch.net.RegexUrlNormalizer.
But it must also works if you remove the urlnormalizer.class = 
org.apache.nutch.net.BasicUrlNormalizer int the nutch-default.

Regards

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/


[jira] Closed: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-09-02 Thread Jerome Charron (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-21?page=all ]
 
Jerome Charron closed NUTCH-21:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Commited to trunk (http://svn.apache.org/viewcvs.cgi?rev=267226view=rev)
Thanks to Stephan Strittmatter.

Note: Take care of the patches attached to this issue since the unit tests are 
platform dependent (only successed on windows). The committed code is platform 
independent (I hope). I tested it on Linux, so if someone can test it on other 
platforms it would be a good idea.


 parser plugin for MS PowerPoint slides
 --

  Key: NUTCH-21
  URL: http://issues.apache.org/jira/browse/NUTCH-21
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Reporter: Stefan Groschupf
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: MSPowerPointParser.java, build.xml.patch.txt, 
 parse-mspowerpoint.zip, parse-mspowerpoint.zip

 transfered from:
 http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356
 submitted by:
 Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen

Matt,
This is great! It would be very useful to Nutch developers if your code 
can be shared.  I'm sure quite a few applications will benefit from it 
because it fills a gap between whole-web crawling and single site (or a 
handful of sites) crawling.  I'll be interested in adapting your plugin 
to Nutch convention.

Thanks,
-AJ

Matt Kangas wrote:


AJ and Earl,

I've implemented URLFilters before. In fact, I have a  
WhitelistURLFilter that implements just what you describe: a  
hashtable of regex-lists. We implemented it specifically because we  
want to be able to crawl a large number of known-good paths through  
sites, including paths through CGIs. The hash is a Nutch ArrayFile,  
which provides low runtime overhead. We've tested it on 200+ sites  
thus far, and haven't seen any indication that it will have problems  
scaling further.


The filter and its supporting WhitelistWriter currently rely on a few  
custom classes, but it should be straightforward to adapt to Nutch  
naming conventions, etc. If you're interested in doing this work, I  
can see if it's ok to publish our code.


BTW, we're currently alpha-testing the site that uses this plugin,  
and preparing for a public beta. I'll be sure to post here when we're  
finally open for business. :)


--Matt


On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:

From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, 
it seems that a new urlfilter is a good  place to extend the 
inclusion regex capability.  The new urlfilter  will be defined by 
urlfilter.class property, which gets loaded by  the URLFilterFactory.
Regex is necessary because you want to include urls matching  certain 
patterns.


Can anybody who implemented URLFilter plugin before share some  
thoughts about this approach? I expect the new filter must have all  
capabilities that the current RegexURLFilter.java has so that it  
won't require change in any other classes. The difference is that  
the new filter uses a hash table for efficiently looking up regex  
for included domains (a large number!).


BTW, I can't find urlfilter.class property in any of the  
configuration files in Nutch-0.7. Does 0.7 version still support  
urlfilter extension? Any difference relative to what's described in  
the doc DissectingTheNutchCrawler cited above?


Thanks,
AJ

Earl Cahill wrote:



The goal is to avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL these regex for each URL. Any  
comment?





Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. Especially if  you 
have forward and maybe a backwards

lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com-site-good (for good.site.com stuff)
exclude: com-site-bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl




--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---



--
Matt Kangas / [EMAIL PROTECTED]





--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development

748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---


Re: Automating workflow using ndfs

2005-09-02 Thread Anjun Chen
I'm going to make a request in Jira now. -AJ

--- Matt Kangas [EMAIL PROTECTED] wrote:

 Great! Is there a ticket in JIRA requesting this
 feature? If not, we  
 should file one and get a few votes in favor of it.
 AFAIK, that's the  
 process for getting new features into Nutch.
 
 On Sep 2, 2005, at 1:30 PM, AJ Chen wrote:
 
  Matt,
  This is great! It would be very useful to Nutch
 developers if your  
  code can be shared.  I'm sure quite a few
 applications will benefit  
  from it because it fills a gap between whole-web
 crawling and  
  single site (or a handful of sites) crawling. 
 I'll be interested  
  in adapting your plugin to Nutch convention.
  Thanks,
  -AJ
 
  Matt Kangas wrote:
 
 
  AJ and Earl,
 
  I've implemented URLFilters before. In fact, I
 have a   
  WhitelistURLFilter that implements just what you
 describe: a   
  hashtable of regex-lists. We implemented it
 specifically because  
  we  want to be able to crawl a large number of
 known-good paths  
  through  sites, including paths through CGIs. The
 hash is a Nutch  
  ArrayFile,  which provides low runtime overhead.
 We've tested it  
  on 200+ sites  thus far, and haven't seen any
 indication that it  
  will have problems  scaling further.
 
  The filter and its supporting WhitelistWriter
 currently rely on a  
  few  custom classes, but it should be
 straightforward to adapt to  
  Nutch  naming conventions, etc. If you're
 interested in doing this  
  work, I  can see if it's ok to publish our code.
 
  BTW, we're currently alpha-testing the site that
 uses this  
  plugin,  and preparing for a public beta. I'll be
 sure to post  
  here when we're  finally open for business. :)
 
  --Matt
 
 
  On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
 
 
  From reading http://wiki.apache.org/nutch/  
  DissectingTheNutchCrawler, it seems that a new
 urlfilter is a  
  good  place to extend the inclusion regex
 capability.  The new  
  urlfilter  will be defined by urlfilter.class
 property, which  
  gets loaded by  the URLFilterFactory.
  Regex is necessary because you want to include
 urls matching   
  certain patterns.
 
  Can anybody who implemented URLFilter plugin
 before share some   
  thoughts about this approach? I expect the new
 filter must have  
  all  capabilities that the current
 RegexURLFilter.java has so  
  that it  won't require change in any other
 classes. The  
  difference is that  the new filter uses a hash
 table for  
  efficiently looking up regex  for included
 domains (a large  
  number!).
 
  BTW, I can't find urlfilter.class property in
 any of the   
  configuration files in Nutch-0.7. Does 0.7
 version still support   
  urlfilter extension? Any difference relative to
 what's described  
  in  the doc DissectingTheNutchCrawler cited
 above?
 
  Thanks,
  AJ
 
  Earl Cahill wrote:
 
 
 
  The goal is to avoid entering 100,000 regex in
 the
  craw-urlfilter.xml and checking ALL these
 regex for each URL.  
  Any  comment?
 
 
 
 
  Sure seems like just some hash look up table
 could
  handle it.  I am having a hard time seeing when
 you
  really need a regex and a fixed list wouldn't
 do. Especially if   
  you have forward and maybe a backwards
  lookup as well in a multi-level hash, to
 perhaps
  include/exclude at a certain sudomain level,
 like
 
  include: com-site-good (for good.site.com
 stuff)
  exclude: com-site-bad (for bad.site.com)
 
  and kind of walk backwards, kind of like dns. 
 Then
  you could just do a few hash lookups instead of
  100,000 regexes.
 
  I realize I am talking about host and not page
 level
  filtering, but if you want to include
 everything from
  your 100,000 sites, I think such a strategy
 could
  work.
 
  Hope this makes sense.  Maybe I could write
 some code
  to and see if it works in practice.  If nothing
 else,
  maybe the hash stuff could just be another
 filter
  option in conf/crawl-urlfilter.txt.
 
  Earl
 
 
 
 
  -- 
  AJ (Anjun) Chen, Ph.D.
  Canova Bioconsulting Marketing * BD * Software
 Development
  748 Matadero Ave., Palo Alto, CA 94306, USA
  Cell 650-283-4091, [EMAIL PROTECTED]
 
 ---
 
 
 
  -- 
  Matt Kangas / [EMAIL PROTECTED]
 
 
 
 
 
  -- 
  AJ (Anjun) Chen, Ph.D.
  Canova Bioconsulting Marketing * BD * Software
 Development
  748 Matadero Ave., Palo Alto, CA 94306, USA
  Cell 650-283-4091, [EMAIL PROTECTED]
 
 ---
 
 
 --
 Matt Kangas / [EMAIL PROTECTED]