[jira] Created: (NUTCH-452) Nutch JSF/My Faces Search Frontend
Nutch JSF/My Faces Search Frontend -- Key: NUTCH-452 URL: https://issues.apache.org/jira/browse/NUTCH-452 Project: Nutch Issue Type: New Feature Components: web gui Environment: Java Reporter: Zaheed Haque Fix For: 0.9.0 As per Doug's suggestion a ticket is now open. Over the weekend I will write up a small instruction plus upload all the files necessary for the ticket. (I need to remove all the libs and list them so one could download the libs directly this way the patch will probably make the 10 MB limit) If you have questions, comments just let me know. Cheers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Nutch JSF front-end code submission - Please advice next steps?
Hello all: Last year for a client together with some developers and a lot of help from Andrzej Bialecki, I worked on a Nutch search frontend. The web-application uses JSF/My Faces and it uses maven for build. Its a full working user interface as of (rev. 478619) has all the bells and whistles themes, settings etc. The client project has now ended (It was an Election search engine) and it is now possible for me to submit the code to Apache, off course under apache license. Its been about a month I been trying to find time to make the necessary changes so that I could submit the code. Due to enormous amount of work load I am unable to find the time. I am not sure how should I proceed, I have personally try to contact some of you off list. (Which I thought might be interested as they discuss more web apps related issue on the list ). But seems like everyone is busy. So I am trying my last effort here. I would love someone do something with the code rather then it becomes obsolete. I have a working version up and running with nutch rev. 478619. furthermore AB was invloved during the project I am sure he will be able to answer if there are things that I can't answer. What should I do? I would appreciate your advice. Regards. Zaheed
Re: How to Become a Nutch Developer
On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Well ... so far this process was very informal, because there were so few key developers that they more or less knew what needs to be done, and who is doing what. Hadoop follows a much stricter and formalized model, which we could adopt, since it apparently works well there. This should address the issue of notifying others that the work is started on this or that item. My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you to my mind it brings more benefit to be structured/rigid for the newbie developer cos you can follow every issue from start to end and all the comments in between I have notice some of the mailing list questions/answers related to issues for example are not in Nutch JIRA so to follow an issue you have to go-back-and-forth consult mailing list and JIRA. IMHO Nutch should adopt Hadoop model furthermore its probably to good idea to discuss it further cos soon Nutch will have an 0.9 release and probably its a good time to change to Hadoop style :-) Just some thoughts. Cheers
Re: Reviving Nutch 0.7
On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? I agree with you that there is a need for 0.7-style Nutch. I wouldn't say reviving but more Disecting and re-directing :-). here you go --- my focus here is 0.7 style i.e. mid-size, enterprise need. Solr could use a good crawler cos it has everything else .. (AFAIK) probably this is not technically plug an pray :-) also I am not sure Solr community wants a crawler but it could benefit from such Solr add on/snap on crawler. Furthermore I am sure some of the 0.7 plugins could be re-factored to fit into Solr. I will forward the mail to Solr community to see if there any interest. Cheers
Re: database exchange of 2 nutches (hybridity of nutch with yacy)
Hi: I am not sure p2p principles is good for web search.. where results speed is number 1 concern. i.e. if your search engine is facing consumers. However in a corporate environment i.e. various corp.locations runs their own nutch installation and share index via a common interface could use p2p principles then again just transferring all the index to a single place is also compelling alternative. In my view yes p2p ads flexibility but also adds tons of complexity in terms of operations which I would prefer not to deal with :-) However if there was a via-able business model where you could use Nutch in conjunction with Amazon S3 and EC2 where an organization offers the crawling service and those wishing to use parts or all of the index would pay a small fee .. yes that would be nice.. I suppose soon enough we will see Yahoo offering such service.. Cheers Zaheed On 1/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi quite interesting projects out: http://search.wikia.com/wiki/Search_Wikia I want to suggest another one here. Nutch is used for specified customers to index specified pages, or to have an open source engine for the worldwide web. *Two* Nutch engines indexing the web make no sense. It would be useful, if all Nutch - indexing the web - can be connected together and perform a database exchange. Well you all know www.yacy.net - the p2p search engine - I do not want to suggest for nutch the same, but some interoperability of two nutch nodes. Is it possible to add / import the indexed database of nutch A to nutch B ? This import must be done manually, but why not within a network ? If we have 5 nutch engines in the world indexing the web (I do not speak for customer solutions for partials intranet webs), why then not accumulating their indexes? I want to suggest a structure, which is hybird with yacy.net Would it be possible to peform a database-structure, which is usable as well for yacy? Then the nutch index could be spread as well to yacy-nodes and get an backup there, other nutches then could add the yacy indexed media into their database. So yacy p2p is the way to exchange and backup the database of several nutches, and the nutch can backup and exchange with yacy nodes and with other nutch engines. I think therefore any nutch should run a yacy node as well and the database must be made interoperable. Would this be possible? Well, you know the emule-proejct.net filesharing structure. Or take gnutella with its ultrapeers. The emule servers support collecting urls/hashed and there is as well in emule a p2p node system called kademlia. Would such a p2p engine structure be possible, if yacy is the p2p node and nutch the Ultrapeer indexing for its own, but as well backuping its database to the p2p yacy network and getting as well from the network redundant urls ? See then the wiki-search project of the link above. As urls get a human ranking (exactly the page is ranked after it was seen with the yacy bar) the nutch database could get as well these human ranked urls over the database exchange. Any Idea, if a common database structure is possible and if nutch could implement a yacy node to held connections to the dht network of yacy, so nutch could be (as well) a yacy node? as both is java this should work? Thanks for subscribing as well to the yacy.net forums to play around with this node and toolbar and the already implemented (need to be developed) human ranking. Thanks for collaboration ideas. tom -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
Re: [jira] Updated: (NUTCH-251) Administration GUI
Super Thanks! Now I can give it a go! Cheers! On 11/23/06, Enis Soztutar (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ] Enis Soztutar updated NUTCH-251: Attachment: Nutch-251-AdminGUI.tar.gz I have updated the patch written by stephan. This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of nutch so far) First extract the tar.gaz file into the root of nutch. It should copy src/plugin/admin-* lib/xalan.jar lib/serializer.jar and lib/hadoop-0.7.2-dev.jar hadoop_0.7.1_nutch_gui_v2.patch nutch_0.9-dev_gui_v2.patch then patch nutch with patch -p0 nutch_0.9-dev_gui_v2.patch (you can test the patch first by running : patch -p0 --dry-run nutch_0.9-dev_gui_v2.patch Patched hadoop is included in the archive, but if you wish you can patch hadoop using patch -p0 hadoop_0.7.1_nutch_gui_v2.patch I have : converted necessary java.io.File fields and arguments to org.apache.hadoop.fs.Path replaced deprecated LogFormatter's with LogFactory's used generics with collections(changed only that I've seen) written PathSerializable which is implements Serializable interface(needed for scheduling) Some hadoop changes and some changes due to hadoop conflicts. I have not tested every feature of this plugin so, there still can be some bugs. Administration GUI -- Key: NUTCH-251 URL: http://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Fix For: 0.9.0 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: What's the status of Nutch-GUI?
Scott: Would you be kind enough to upload your Nutch-Gui patch which works with current trunk? I would like to give it a try. Regards On 11/22/06, scott green [EMAIL PROTECTED] wrote: On 11/22/06, Sami Siren [EMAIL PROTECTED] wrote: scott green wrote: Hi I am now port Stefan to my dev-box. And some errors here, hope some one can help me. When I start embedded web application jetty, the exceptions: 06/11/22 02:28:10 INFO util.Credential: Checking Resource aliases 06/11/22 02:28:11 INFO util.Container: Started [EMAIL PROTECTED] Exception in thread main java.lang.ClassNotFoundException: org.apache.jasper.servlet.JspServlet at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.mortbay.http.HttpContext.loadClass(HttpContext.java:1262) at org.mortbay.jetty.servlet.Holder.start(Holder.java:188) at org.mortbay.jetty.servlet.ServletHolder.start(ServletHolder.java:219) at org.mortbay.jetty.servlet.ServletHandler.initializeServlets(ServletHandler.java:445) at org.mortbay.jetty.servlet.WebApplicationHandler.initializeServlets(WebApplicationHandler.java:323) at org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContext.java:511) at org.mortbay.util.Container.start(Container.java:72) at org.apache.nutch.admin.WebContainer.addComponentExtensions(WebContainer.java:152) at org.apache.nutch.admin.AdministrationApp.startContainer(AdministrationApp.java:41) at org.apache.nutch.admin.AdministrationApp.main(AdministrationApp.java:158) 06/11/22 02:28:24 INFO util.Container: Started HttpContext[/,/] the code snippets: WebApplicationContext webContext = this.server.addWebApplication(contextName, new File(jsps).getCanonicalPath()); webContext.setClassLoader(extension.getDescriptor().getClassLoader()); webContext.setAttribute(component, component); webContext.setAttribute(components, components); if (instances != null) { webContext.setAttribute(instances, instances); webContext.setAttribute(container, this); } webContext.start(); So how can I put some required jars into the classloader? Thanks Is there a starts script (bin/nutch?) or something like that where you could add the jasper-compiler.jar so it gets into classpath of JVM. Hi Sami You are right. I add the jars into JVM classpath and now it works, thanks. - Scott -- Sami Siren
Re: [jira] Commented: (NUTCH-249) black- white list url filtering
Hi Lot of the patch/plugins in Jiira are not updated to reflect changes in trunk. Probably the way to test it would be building this using that specific revision of nutch. cheers On 9/5/06, Uros Gruber (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12432584 ] Uros Gruber commented on NUTCH-249: --- I'm trying to test this patch but I'm having build problems compile-core: [javac] Compiling 2 source files to /usr/home/uros/nutch-wb/build/classes [javac] /usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:261: createJob(org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path) in org.apache.nutch.crawl.CrawlDb cannot be applied to (org.apache.hadoop.conf.Configuration,java.io.File) [javac] JobConf updateJob = CrawlDb.createJob(getConf(), crawlDb); [javac]^ [javac] /usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java:267: install(org.apache.hadoop.mapred.JobConf,org.apache.hadoop.fs.Path) in org.apache.nutch.crawl.CrawlDb cannot be applied to (org.apache.hadoop.mapred.JobConf,java.io.File) [javac] CrawlDb.install(updateJob, crawlDb); [javac]^ [javac] Note: /usr/home/uros/nutch-wb/src/java/org/apache/nutch/crawl/bw/BWUpdateDb.java uses or overrides a deprecated API. black- white list url filtering --- Key: NUTCH-249 URL: http://issues.apache.org/jira/browse/NUTCH-249 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.9.0 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch Existing url filter mechanisms need to process each url against each filter pattern. For very large filter sets this may be does not scale very well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: need volunteer to develop search for apache.org
Sounds very interesting! When are you guys planning to start? Cheers Zaheed On 1/25/06, Doug Cutting [EMAIL PROTECTED] wrote: Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug
patch for nutch and nutch-daemon.sh
Hi: Due to a bug in the if statement its not possible to use the symlinks for the shell scripts. Below you will find the patch. Thanks Zaheed --- $ svn diff nutch Index: nutch === --- nutch (revision 371849) +++ nutch (working copy) @@ -17,7 +17,7 @@ while [ -h $THIS ]; do ls=`ls -ld $THIS` link=`expr $ls : '.*- \(.*\)$'` - if expr $link : '.*/.*' /dev/null; then + if expr $link : '/.*' /dev/null; then THIS=$link else THIS=`dirname $THIS`/$link $ svn diff nutch-daemon.sh Index: nutch-daemon.sh === --- nutch-daemon.sh (revision 371849) +++ nutch-daemon.sh (working copy) @@ -29,7 +29,7 @@ while [ -h $this ]; do ls=`ls -ld $this` link=`expr $ls : '.*- \(.*\)$'` - if expr $link : '.*/.*' /dev/null; then + if expr $link : '/.*' /dev/null; then this=$link else this=`dirname $this`/$link $
Re: GettingNutchRunningOnUbuntu.html
Documentation Style would probably be to much to ask. Something like below would be great! http://httpd.apache.org/docs/2.0/ Cheers Zaheed On 9/11/05, Matt Kangas [EMAIL PROTECTED] wrote: Earl, I've been building binary .deb packages from Nutch 0.7 trunk straight from ant for a few months now. It makes deployments to Ubuntu much smoother. Combine that with the java-package utils for deb-ifying the JDK, and your rollouts will be greatly simplified. My Nutch packaging stuff consists of: package/nutch/build.xml package/nutch/DEBIAN/control.template package/nutch/DEBIAN/postinst package/nutch/default.properties It's tested to work on Mac OS X (fink) and Ubuntu Linux. If you're interested possibly motivated to clean up the code a bit for general consumption ;), create a JIRA ticket so folks can vote on it and I'll attach a tarball to the ticket. --Matt On Sep 10, 2005, at 6:35 PM, Earl Cahill wrote: Well, it may not be perfect, but I just wrote http://spack.net/nutch/GettingNutchRunningOnUbuntu.html which I think details pretty well everything I had to do to get nutch trunk working on my ubuntu athlon box. Anyway I can get it added to the wiki? I am happy to make edits first, if needs be. I next hope to write tutorials on getting nutch to work with mapreduce, in a few different ways, like local fs, ndfs, local crawl, distributed crawl, and the like. I will likely need a little help :) If anyone has style ideas please let me know before I start this next one. Right now, I could use a little more commentary, as some sections just outline what commands to run. One dumb thing I would like is to be able to double click on a command and have just the command get highlighted instead of the whole line. I would also like to try and get a straight debian tutorial working. Enjoy! Earl -- Matt Kangas / [EMAIL PROTECTED] -- Best Regards Zaheed Haque Phone : +46 735 06 E.mail: [EMAIL PROTECTED]