[jira] Created: (NUTCH-452) Nutch JSF/My Faces Search Frontend

2007-03-01 Thread Zaheed Haque (JIRA)
Nutch JSF/My Faces Search Frontend
--

 Key: NUTCH-452
 URL: https://issues.apache.org/jira/browse/NUTCH-452
 Project: Nutch
  Issue Type: New Feature
  Components: web gui
 Environment: Java
Reporter: Zaheed Haque
 Fix For: 0.9.0


As per Doug's suggestion a ticket is now open. Over the weekend I will write up 
a small instruction plus upload all the files necessary for the ticket. (I need 
to remove all the libs and list them so one could download the libs directly 
this way the patch will probably make the 10 MB limit) If you have questions, 
comments just let me know.

Cheers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Nutch JSF front-end code submission - Please advice next steps?

2007-02-28 Thread Zaheed Haque

Hello all:

Last year for a client together with some developers and a lot of help
from Andrzej Bialecki,  I worked on a Nutch search frontend. The
web-application uses JSF/My Faces and it uses maven for build. Its a
full working user interface as of (rev. 478619)  has all the bells and
whistles themes, settings etc.

The client project has now ended (It was an Election search engine)
and it is now possible for me to submit the code to Apache, off course
under apache license.

Its been about a month I been trying to find time to make the
necessary changes so that I could submit the code. Due to enormous
amount of work load I am unable to find the time. I am not sure how
should I proceed, I have personally try to contact some of you off
list. (Which I thought might be interested as they discuss more web
apps related issue on the list ). But seems like everyone is busy. So
I am trying my last effort here. I would love someone do something
with the code rather then it becomes obsolete.

I have a working version up and running with nutch rev. 478619.
furthermore AB was invloved during the project I am sure he will be
able to answer if there are things that I can't answer.

What should I do? I would appreciate your advice.

Regards.
Zaheed


Re: How to Become a Nutch Developer

2007-01-22 Thread Zaheed Haque

On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:



Well ... so far this process was very informal, because there were so
few key developers that they more or less knew what needs to be done,
and who is doing what.

Hadoop follows a much stricter and formalized model, which we could
adopt, since it apparently works well there. This should address the
issue of notifying others that the work is started on this or that item.


My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you to my
mind it brings more benefit to be structured/rigid for the newbie developer
cos you can follow every issue from start to end and all the comments in between
I have notice some of the mailing list questions/answers related to
issues for example
are not in Nutch JIRA so to follow an issue you have to
go-back-and-forth consult
mailing list and JIRA.

IMHO Nutch should adopt Hadoop model furthermore its probably to good idea to
discuss it further cos soon Nutch will have an 0.9 release and
probably its a good time to
change to Hadoop style :-)

Just some thoughts.

Cheers


Re: Reviving Nutch 0.7

2007-01-22 Thread Zaheed Haque

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?


I agree with you that there is a need for 0.7-style Nutch. I wouldn't
say reviving but more Disecting and re-directing :-). here you go
--- my focus here is 0.7 style i.e. mid-size, enterprise need.

Solr could use a good crawler cos it has everything else .. (AFAIK)
probably this is not technically plug an pray :-) also I am not sure
Solr community wants a crawler but it could benefit from such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7 plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there any interest.

Cheers


Re: database exchange of 2 nutches (hybridity of nutch with yacy)

2007-01-02 Thread Zaheed Haque

Hi:

I am not sure p2p principles is good for web search.. where results
speed is number 1 concern. i.e. if your search engine is facing
consumers. However in a corporate environment i.e. various
corp.locations runs their own nutch installation and share index via a
common interface could use p2p principles then again just transferring
all the index to a single place is also compelling alternative.

In my view yes p2p ads flexibility but also adds tons of complexity
in terms of operations which I would prefer not to deal with :-)

However if there was a via-able business model where you could use
Nutch in conjunction with Amazon S3 and EC2 where an organization
offers the crawling service and those wishing to use parts or all of
the index would pay a small fee .. yes that would be nice.. I suppose
soon enough we will see Yahoo offering such service..

Cheers
Zaheed



On 1/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Hi

quite interesting projects out:
http://search.wikia.com/wiki/Search_Wikia

I want to suggest another one here.

Nutch is used for specified customers to index specified pages, or to have an 
open source engine for the worldwide web.

*Two* Nutch engines indexing the web make no sense.
It would be useful, if all Nutch - indexing the web - can be connected together 
and perform a database exchange.

Well you all know www.yacy.net - the p2p search engine - I do not want to 
suggest for nutch the same, but some interoperability of two nutch nodes.

Is it possible to add / import the indexed database of nutch A to nutch B ?

This import must be done manually, but why not within a network ?

If we have 5 nutch engines in the world indexing the web (I do not speak for 
customer solutions for partials intranet webs), why then not accumulating their 
indexes?

I want to suggest a structure, which is hybird with yacy.net

Would it be possible to peform a database-structure, which is usable as well 
for yacy?


Then the nutch index could be spread as well to yacy-nodes and get an backup 
there, other nutches then could add the yacy indexed media into their database.


So yacy p2p is the way to exchange and backup the database of several nutches, 
and the nutch can backup and exchange with yacy nodes and with other nutch 
engines.

I think therefore any nutch should run a yacy node as well and the database 
must be made interoperable.


Would this be possible?

Well, you know the emule-proejct.net filesharing structure. Or take gnutella 
with its ultrapeers. The emule servers support collecting urls/hashed and there 
is as well in emule a p2p node system called kademlia.


Would such a p2p engine structure be possible, if yacy is the p2p node and 
nutch the Ultrapeer indexing for its own, but as well backuping its database to 
the p2p yacy network and getting as well from the network redundant urls ?

See then the wiki-search project of the link above.

As urls get a human ranking (exactly the page is ranked after it was seen with 
the yacy bar) the nutch database could get as well these human ranked urls over 
the database exchange.

Any Idea, if a common database structure is possible and if nutch could 
implement a yacy node to held connections to the dht network of yacy, so nutch 
could be (as well) a yacy node? as both is java this should work?

Thanks for subscribing as well to the yacy.net forums to play around with this 
node and toolbar and the already implemented (need to be developed) human 
ranking.

Thanks for collaboration ideas.
tom

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer



Re: [jira] Updated: (NUTCH-251) Administration GUI

2006-11-23 Thread Zaheed Haque

Super Thanks! Now I can give it a go!

Cheers!

On 11/23/06, Enis Soztutar (JIRA) [EMAIL PROTECTED] wrote:

 [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]

Enis Soztutar updated NUTCH-251:


Attachment: Nutch-251-AdminGUI.tar.gz

I have updated the patch written by stephan.
This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of
nutch so far)

First extract the tar.gaz file into the root of nutch. It should copy
src/plugin/admin-*
lib/xalan.jar  lib/serializer.jar and lib/hadoop-0.7.2-dev.jar
hadoop_0.7.1_nutch_gui_v2.patch
nutch_0.9-dev_gui_v2.patch

then patch nutch with
  patch -p0 nutch_0.9-dev_gui_v2.patch
  (you can test the patch first by running : patch -p0 --dry-run
nutch_0.9-dev_gui_v2.patch

Patched hadoop is included in the archive, but if you wish you can patch
hadoop using
   patch -p0 hadoop_0.7.1_nutch_gui_v2.patch


I have :
converted necessary java.io.File fields and arguments to
org.apache.hadoop.fs.Path
replaced deprecated LogFormatter's with LogFactory's
used generics with collections(changed only that I've seen)
written PathSerializable which is implements Serializable interface(needed
for scheduling)
Some hadoop changes and some changes due to hadoop conflicts.

I have not tested every feature of this plugin so, there still can be some
bugs.

 Administration GUI
 --

 Key: NUTCH-251
 URL: http://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.9.0

 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz,
nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch
administration and management much more user friendly.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: What's the status of Nutch-GUI?

2006-11-22 Thread Zaheed Haque

Scott:

Would you be kind enough to upload your Nutch-Gui patch which works
with current trunk? I would like to give it a try.

Regards

On 11/22/06, scott green [EMAIL PROTECTED] wrote:

On 11/22/06, Sami Siren [EMAIL PROTECTED] wrote:
 scott green wrote:
  Hi
 
  I am now port Stefan to my dev-box. And some errors here, hope some
  one can help me. When I start embedded web application jetty, the
  exceptions:
 
  06/11/22 02:28:10 INFO util.Credential: Checking Resource aliases
  06/11/22 02:28:11 INFO util.Container: Started
  [EMAIL PROTECTED]
  Exception in thread main java.lang.ClassNotFoundException:
  org.apache.jasper.servlet.JspServlet
  at java.net.URLClassLoader$1.run(Unknown Source)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(Unknown Source)
  at java.lang.ClassLoader.loadClass(Unknown Source)
  at java.lang.ClassLoader.loadClass(Unknown Source)
  at org.mortbay.http.HttpContext.loadClass(HttpContext.java:1262)
  at org.mortbay.jetty.servlet.Holder.start(Holder.java:188)
  at
  org.mortbay.jetty.servlet.ServletHolder.start(ServletHolder.java:219)
  at
  
org.mortbay.jetty.servlet.ServletHandler.initializeServlets(ServletHandler.java:445)
 
  at
  
org.mortbay.jetty.servlet.WebApplicationHandler.initializeServlets(WebApplicationHandler.java:323)
 
  at
  
org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContext.java:511)
 
  at org.mortbay.util.Container.start(Container.java:72)
  at
  
org.apache.nutch.admin.WebContainer.addComponentExtensions(WebContainer.java:152)
 
  at
  
org.apache.nutch.admin.AdministrationApp.startContainer(AdministrationApp.java:41)
 
  at
  org.apache.nutch.admin.AdministrationApp.main(AdministrationApp.java:158)
  06/11/22 02:28:24 INFO util.Container: Started HttpContext[/,/]
 
  the code snippets:
   WebApplicationContext webContext =
  this.server.addWebApplication(contextName, new
  File(jsps).getCanonicalPath());
   webContext.setClassLoader(extension.getDescriptor().getClassLoader());
   webContext.setAttribute(component, component);
   webContext.setAttribute(components, components);
   if (instances != null) {
 webContext.setAttribute(instances, instances);
 webContext.setAttribute(container, this);
   }
   webContext.start();
 
  So how can I put some required jars into the classloader?
  Thanks

 Is there a starts script (bin/nutch?) or something like that where you
 could add the jasper-compiler.jar so it gets into classpath of JVM.

Hi Sami

You are right. I add the jars into JVM classpath and now it works, thanks.

- Scott

 --
  Sami Siren




Re: [jira] Commented: (NUTCH-249) black- white list url filtering

2006-09-05 Thread Zaheed Haque

Hi

Lot of the patch/plugins in Jiira are not updated to reflect changes
in trunk. Probably the way to test it would be building this using
that specific revision of nutch.

cheers

On 9/5/06, Uros Gruber (JIRA) [EMAIL PROTECTED] wrote:

[ 
http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12432584 ]

Uros Gruber commented on NUTCH-249:
---

I'm trying to test this patch but I'm having build problems

compile-core:
[javac] Compiling 2 source files to /usr/home/uros/nutch-wb/build/classes
[javac] 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:261: 
createJob(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path) in 
org.apache.nutch.crawl.CrawlDb cannot be applied to 
(org.apache.hadoop.conf.Configuration,java.io.File)
[javac] JobConf updateJob = CrawlDb.createJob(getConf(), crawlDb);
[javac]^
[javac] 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:267: 
install(org.apache.hadoop.mapred.JobConf,org.apache.hadoop.fs.Path) in 
org.apache.nutch.crawl.CrawlDb cannot be applied to 
(org.apache.hadoop.mapred.JobConf,java.io.File)
[javac] CrawlDb.install(updateJob, crawlDb);
[javac]^
[javac] Note: 
/usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java uses 
or overrides a deprecated API.



 black- white list url filtering
 ---

 Key: NUTCH-249
 URL: http://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.9.0

 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch


 Existing url filter mechanisms need to process each url against each filter 
pattern. For very large filter sets this may be does not scale very well.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: need volunteer to develop search for apache.org

2006-01-26 Thread Zaheed Haque
Sounds very interesting! When are you guys planning to start?

Cheers
Zaheed

On 1/25/06, Doug Cutting [EMAIL PROTECTED] wrote:
 Would someone volunteer to develop Nutch-based site-search engine for
 all apache.org domains?  We now have a Solaris zone to host this.

 Thanks,

 Doug



patch for nutch and nutch-daemon.sh

2006-01-23 Thread Zaheed Haque
Hi:

Due to a bug in the if statement its not possible to use the symlinks
for the shell scripts. Below you will find the  patch.

Thanks
Zaheed

---

$ svn diff nutch
Index: nutch
===
--- nutch   (revision 371849)
+++ nutch   (working copy)
@@ -17,7 +17,7 @@
 while [ -h $THIS ]; do
   ls=`ls -ld $THIS`
   link=`expr $ls : '.*- \(.*\)$'`
-  if expr $link : '.*/.*'  /dev/null; then
+  if expr $link : '/.*'  /dev/null; then
 THIS=$link
   else
 THIS=`dirname $THIS`/$link
$ svn diff nutch-daemon.sh
Index: nutch-daemon.sh
===
--- nutch-daemon.sh (revision 371849)
+++ nutch-daemon.sh (working copy)
@@ -29,7 +29,7 @@
 while [ -h $this ]; do
   ls=`ls -ld $this`
   link=`expr $ls : '.*- \(.*\)$'`
-  if expr $link : '.*/.*'  /dev/null; then
+  if expr $link : '/.*'  /dev/null; then
 this=$link
   else
 this=`dirname $this`/$link
$


Re: GettingNutchRunningOnUbuntu.html

2005-09-11 Thread Zaheed Haque
Documentation Style would probably be to much to ask. Something like
below would be great!

http://httpd.apache.org/docs/2.0/

Cheers
Zaheed

On 9/11/05, Matt Kangas [EMAIL PROTECTED] wrote:
 Earl, I've been building binary .deb packages from Nutch 0.7 trunk
 straight from ant for a few months now. It makes deployments to
 Ubuntu much smoother. Combine that with the java-package utils for
 deb-ifying the JDK, and your rollouts will be greatly simplified.
 
 My Nutch packaging stuff consists of:
 
 package/nutch/build.xml
 package/nutch/DEBIAN/control.template
 package/nutch/DEBIAN/postinst
 package/nutch/default.properties
 
 It's tested to work on Mac OS X (fink) and Ubuntu Linux.
 
 If you're interested  possibly motivated to clean up the code a bit
 for general consumption ;), create a JIRA ticket so folks can vote on
 it  and I'll attach a tarball to the ticket.
 
 --Matt
 
 On Sep 10, 2005, at 6:35 PM, Earl Cahill wrote:
 
  Well, it may not be perfect, but I just wrote
 
  http://spack.net/nutch/GettingNutchRunningOnUbuntu.html
 
  which I think details pretty well everything I had to
  do to get nutch trunk working on my ubuntu athlon box.
 
 
  Anyway I can get it added to the wiki?  I am happy to
  make edits first, if needs be.
 
  I next hope to write tutorials on getting nutch to
  work with mapreduce, in a few different ways, like
  local fs, ndfs, local crawl, distributed crawl, and
  the like.  I will likely need a little help :)
 
  If anyone has style ideas please let me know before I
  start this next one.  Right now, I could use a little
  more commentary, as some sections just outline what
  commands to run.  One dumb thing I would like is to be
  able to double click on a command and have just the
  command get highlighted instead of the whole line.
 
  I would also like to try and get a straight debian
  tutorial working.
 
  Enjoy!
  Earl
 
 
 --
 Matt Kangas / [EMAIL PROTECTED]
 
 
 


-- 
Best Regards
Zaheed Haque
Phone : +46 735 06
E.mail: [EMAIL PROTECTED]