Re: Remove Header Footer and Menus from crawled content
I have been using something similar to this for a while because we came from Google Search Appliance and had googleon and googleoff all over the place. I don't really like having to patch the parse-html plugin everytime I do an upgrade, wish I could move that into it's own plugin somehow. Speaking of googleon/googleoff, is there any standard for denoting indexable elements? That one seems specific to GSA, it would be nice if there was something other search engines might also take into consideration. On Thu, Oct 1, 2015 at 7:20 AM,wrote: > Camillo thank you so much for sharing your changes. I am checking it out. > > > On 9/30/15 3:37 PM, "Camilo Tejeiro" wrote: > > >I believe you can do it with Tika, > > > >I did it a different way... > >I recently had to do something similar and I wrote a little parse-filter > >plugin to accomplish this. > > > >For reference look into the Jira Issue 585, it will give you some ideas. > >https://issues.apache.org/jira/browse/NUTCH-585 > > > >If it helps here is my open nutch install with the integrated plugin (look > >for the parse-html-filter-select-nodes plugin). I haven't created a patch > >but you are free to use it if it helps you... > >https://github.com/osohm/apache-nutch-1.10 > > > >cheers, > > > >On Wed, Sep 30, 2015 at 11:57 AM, wrote: > > > >> Hi All, > >> > >> We need to remove header, footer and menu from the crawled content > >>before > >> we index content into SOLR. I researched online and found references to > >> removal via Tika's boilerpipe support - Nutch-961 > >> > >> We are currently using Nutch 1.7 but I am looking into updating to Nutch > >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a > >> better job in removing extra content. > >> > >> I will be very thankful if you can let me know the best method and steps > >> to achieve this goal and how effective this is in removal. > >> > >> Thanks so much, > >> Madhvi > >> > >> > > > > > >-- > >Camilo Tejeiro > >*Be **honest, be grateful, be humble.* > >https://www.linkedin.com/in/camilotejeiro > >http://camilotejeiro.wordpress.com > >
Re: [MASSMAIL]Re: [VOTE] Release Apache Nutch 1.10
+0 On Wed, Apr 29, 2015 at 7:22 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: +1 - run small test crawl in local mode and index into Solr. - Original Message - From: Sebastian Nagel wastl.na...@googlemail.com To: user@nutch.apache.org Sent: Wednesday, April 29, 2015 6:19:59 PM Subject: [MASSMAIL]Re: [VOTE] Release Apache Nutch 1.10 +1 - download bin package - verified signature - run small test crawl (local mode) and index to Solr On 04/29/2015 11:54 PM, Lewis John Mcgibbney wrote: Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.10. The release candidate comprises the following components.* A staging repository [0] containing various Maven artifacts* A branch-1.10 of the trunk code [1]* The tagged source upon which we are VOTE'ing [2]* Finally, the release artifacts [3] which I would encourage you to verify for signatures and test.You should use the following KEYS [4] file to verify the signatures of all release artifacts.Please VOTE as follows[ ] +1 Push the release, I am happy :)[ ] +0 I am not bothered either way[ ] -1 I am not happy with this release candidate (please state why)Firstly thank you to everyone that contributed to Nutch 1.10. Secondly, thank you to everyone that VOTE's. It is highly appreciated.ThanksLewis(on behalf of Nutch PMC)p.s. Here's my +1 [0] https://repository.apache.org/content/repositories/orgapachenutch-1004[1] https://svn.apache.org/repos/asf/nutch/branches/branch-1.10[2] https://svn.apache.org/repos/asf/nutch/tags/release-1.10[3] https://dist.apache.org/repos/dist/dev/nutch/1.10/[4] http://www.apache.org/dist/nutch/KEYS
Re: [VOTE] Release Apache Nutch 2.3
+0 Thanks to all the contributors for the hard work. On Sat, Jan 10, 2015 at 1:36 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Tests pass and signature looks good. Here is my +1 (non-binding) Thanks for driving this Lewis! Renato M. 2015-01-09 9:58 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com : Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues as per the release report http://s.apache.org/nutch_2.3 The release candidate comprises the following components. * A staging repository [0] containing various Maven artifacts * A branch-2.3 [1] of the 2.x codebase from which the tag was subsequently cut * The tagged source (tagged from the above branch-2.3) upon which we are VOTE'ing [2] * Finally, the release artifacts [3] which I would encourage you to verify for signatures and test. You should use the following KEYS [4] file to verify the signatures of all release artifacts. Please VOTE as follows [ ] +1 Push the release, I am happy :) [ ] +0 I am not bothered either way [ ] -1 I am not happy with this release candidate (please state why) Firstly thank you to everyone that contributed to Nutch. Secondly, thank you to everyone that VOTE's. It is appreciated. Thanks Lewis (on behalf of Nutch PMC) p.s. Here's my +1 [0] https://repository.apache.org/content/repositories/orgapachenutch-1003/ [1] http://svn.apache.org/repos/asf/nutch/branches/branch-2.3/ [2] http://svn.apache.org/repos/asf/nutch/tags/release-2.3/ [3] https://dist.apache.org/repos/dist/dev/nutch/ [4] http://www.apache.org/dist/nutch/KEYS -- *Lewis*
File not found error
Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308)
Re: File not found error
Well I'm just using nutch in local mode, no hdfs (as far as I know)... My latest thing is trying to determine if there is a filesystem issue. It's not really clear what file is not found. I have about 10 different configs, this is just one of them and they all have the urls folder. The script worked for quite a while before this just started happening on it's own. That's why I'm suspecting a filesystem error. On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote: you might want to check to see if Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal( JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: File not found error
Okay, I got it working again. Not sure exactly what happened, but fsck didn't help. I noticed the last line showed native method so moved the native binaries out of the /lib folder. Lo and behold, the next time I ran it, it used the java libs and displayed the filename it was having a problem with. It was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I just went and moved the /tmp/hadoop-root directory and then it started working again. Permissions looked fine, so it might have just been corrupt. Thanks for the help! On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte jlafi...@brandextract.com wrote: Well I'm just using nutch in local mode, no hdfs (as far as I know)... My latest thing is trying to determine if there is a filesystem issue. It's not really clear what file is not found. I have about 10 different configs, this is just one of them and they all have the urls folder. The script worked for quite a while before this just started happening on it's own. That's why I'm suspecting a filesystem error. On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote: you might want to check to see if Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal( JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: Nutch 1.7 - deleting segments
What would be the case where you would want to keep the segments? I'm considering automatically deleting them after sending the data to solr On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote: Thanks for your reply! Regards, -- Chethan Prasad On Sat, May 3, 2014 at 12:22 PM, remi tassing tassingr...@gmail.com wrote: you are correct On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote: Hi, I have a Nutch crawl with 4 segments which are fully indexed using the bin/nutch solrindexcommand. Now I'm all out of storage on the box, so can I delete the 4 segments and retain only the crawldb and continue crawling from where I left it? Since all the segments are merged and indexed to Solr I don't see a problem in deleting the segments, or am I wrong there? Regards, -- Chethan Prasad
Re: Don't fetch all urls in a page
Hi Zabini, I'm a little unclear if you are having a problem with nutch following the links or indexing the pages. Have you tried both of these to verify the links and index data? https://wiki.apache.org/nutch/bin/nutch%20parsechecker https://wiki.apache.org/nutch/bin/nutch%20indexchecker The second link above seems wrong to me, it shows *IndexingFiltersChecker* but I think it should be *indexchecker*. That works for me. On Wed, Apr 16, 2014 at 11:48 AM, Zabini antony.noli...@actimage.comwrote: Hi, I am facing a problem with the urls nutch fetch. I have a page and whithin several URLs, but Nucth does not fetch them. They are allowed in the regex-urlfilter and those URLs works fine if I put them in my urls seed list. Does anyone has any hint on what to do? Best Regards, Zabini -- View this message in context: http://lucene.472066.n3.nabble.com/Don-t-fetch-all-urls-in-a-page-tp4131531.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Index web folders.
Hi Shane, The way I save on bandwidth for internal sites is by adding the internal IP into the hosts file for the domain. If it's the local machine you can probably just point it to 127.0.0.1 but I kind of wonder if you would be saving anything but a DNS lookup... I'm not a networking guru but it should route to itself automatically. On Tue, Apr 8, 2014 at 8:30 PM, Shane Wood sh...@cbm8bit.com wrote: I have a few sites i mirror via FTP and would like too index them with nutch but not hit the site too save bandwidth. /var/www/www.somesite.com/ lets say one is located in this folder o the same server nutch is located on, how would you index it and have the search results point too the actual site. http://somesite.com ? Thanks Shane.
Re: Control and monitor nutch-1.x via Web interface ?
gethue seems pretty awesome but it looks like it's more for reporting and metrics, will it really administer nutch? On Tue, Apr 8, 2014 at 7:07 AM, Talat Uyarer ta...@uyarer.com wrote: Hi anunpak, If you want to use pretty ui. You can use Hue[0] [0] http://www.gethue.com Talat 2014-04-06 17:22 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: Nutch is a Hadoop application = you can use the MapReduce UI to monitor your crawl On 3 April 2014 23:17, anupamk anup...@usc.edu wrote: Hi, I am not able to find the answer to my question on google. The question is -- is there a a web-interface enable me to control and monitor nutch-1.x ? I am aware of the the nutch-admin REST API for nutch 2.x (http://wiki.apache.org/nutch/NutchAdministrationUserInterface#Download :) Is there a ready-to-user web-interface for nutch 1.x ? If not, can I adapt the 2.x webUI to nutch-1.x ? How do I go about doing it ? -- View this message in context: http://lucene.472066.n3.nabble.com/Control-and-monitor-nutch-1-x-via-Web-interface-tp4128995.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: Unable to crawl wiki pages through Nutch
reddibabu, I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that an internal dns record? On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote: Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages. While trying to crawl wiki pages it is saying that fetch of http://wiki.ibm.com/ failed with: java.net.UnknownHostException: wiki.ibm.com. Is it require any additional configuration for crawling wiki pages. Anyone assist me on the same would be helpful a lot. Thanks in advance. Reddi Babu -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Freegen and Solr score
Thanks Sebastian, That did work when I set both of those to false, but now the url I'm inserting has an abnormally high score. You mentioned two options, the first was to use FreeGenerator with an initial score, however I cannot find it documented anywhere how to do that. The only parameters I see is normalize and filter and they don't take values. Can you point me in the right direction for that? On Wed, Mar 26, 2014 at 6:59 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: There may be no relevant links if all documents are from one single host (or domain) and (link.ignore.internal.host == true) resp. (link.ignore.internal.domain == true) cf. explanations about that in the wiki. 2014-03-26 4:09 GMT+01:00 John Lafitte jlafi...@brandextract.com: Thanks for that Sebastian. So given the hint you've given me, I'm trying to generate the scoring using this example: https://wiki.apache.org/nutch/NewScoringIndexingExample But when it gets to the LinkRank part I get: 2014-03-26 02:57:14,208 INFO webgraph.LinkRank - Analysis: starting at 2014-03-26 02:57:14 2014-03-26 02:57:14,913 INFO webgraph.LinkRank - Starting link counter job 2014-03-26 02:57:17,927 INFO webgraph.LinkRank - Finished link counter job 2014-03-26 02:57:17,928 INFO webgraph.LinkRank - Reading numlinks temp file 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis: java.io.IOException: No links to process, is the webgra$ at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668) I can see the webgraph directory got created and there are directories and files in there, but I'm guessing something is not getting populated correctly. Any clue what I may be doing wrong? On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi John, FreeGenerator unlike Injector does not use db.score.injected (default = 1.0) but sets the initial score to 0.0. If all URLs stem from FreeGenerator the total score in the link graph is also 0.0, and no linked documents can get a higher score that 0.0 As possible solutions: - use FreeGenerator with a initial score 0.0 (but don't put thousands URLs with a score of 1.0: if the total score is too high some pages may get unreasonable high scores) - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the scores: the default scoring OPIC has the advantage of calculating scores online while following links. It gives good and plausible scores if crawl is started from few authoritative seeds. But sometimes, esp. in continuous crawls, OPIC scores run out of control. Sebastian On 03/25/2014 08:31 PM, John Lafitte wrote: I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new documents. I thought the score was calculated independent from what Nutch does, just uses the content and other metadata to calculate it, however that doesn't seem to be the case. Anyone have a clue what might be causing this? The content and other metadata look normal and I reloaded the core to no avail.
Freegen and Solr score
I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new documents. I thought the score was calculated independent from what Nutch does, just uses the content and other metadata to calculate it, however that doesn't seem to be the case. Anyone have a clue what might be causing this? The content and other metadata look normal and I reloaded the core to no avail.
Re: Freegen and Solr score
Thanks for that Sebastian. So given the hint you've given me, I'm trying to generate the scoring using this example: https://wiki.apache.org/nutch/NewScoringIndexingExample But when it gets to the LinkRank part I get: 2014-03-26 02:57:14,208 INFO webgraph.LinkRank - Analysis: starting at 2014-03-26 02:57:14 2014-03-26 02:57:14,913 INFO webgraph.LinkRank - Starting link counter job 2014-03-26 02:57:17,927 INFO webgraph.LinkRank - Finished link counter job 2014-03-26 02:57:17,928 INFO webgraph.LinkRank - Reading numlinks temp file 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis: java.io.IOException: No links to process, is the webgra$ at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668) I can see the webgraph directory got created and there are directories and files in there, but I'm guessing something is not getting populated correctly. Any clue what I may be doing wrong? On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi John, FreeGenerator unlike Injector does not use db.score.injected (default = 1.0) but sets the initial score to 0.0. If all URLs stem from FreeGenerator the total score in the link graph is also 0.0, and no linked documents can get a higher score that 0.0 As possible solutions: - use FreeGenerator with a initial score 0.0 (but don't put thousands URLs with a score of 1.0: if the total score is too high some pages may get unreasonable high scores) - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the scores: the default scoring OPIC has the advantage of calculating scores online while following links. It gives good and plausible scores if crawl is started from few authoritative seeds. But sometimes, esp. in continuous crawls, OPIC scores run out of control. Sebastian On 03/25/2014 08:31 PM, John Lafitte wrote: I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new documents. I thought the score was calculated independent from what Nutch does, just uses the content and other metadata to calculate it, however that doesn't seem to be the case. Anyone have a clue what might be causing this? The content and other metadata look normal and I reloaded the core to no avail.
Re: Ranking Algorithm
I am still new to this, but I believe solr creates the score field. There is also a boost field from nutch that you can save into solr. You'll have to create your solr query to sort by both. Score, boost, title was the most logical sorting to me. On Sat, Mar 22, 2014 at 11:16 PM, azhar2007 azhar2...@outlook.com wrote: Which folder and files is the ranking algorithm stored in? Does nutch and solr both have one each -- View this message in context: http://lucene.472066.n3.nabble.com/Ranking-Algorithm-tp4126294.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Ready to use nutch
This is a good question. I was looking for an out-of-the-box Solr/Nutch solution to replace our google mini. We use rackspace and they don't seem to offer it. Even though it was a bit of a learning curve, I feel it was good to know how to actually set it up and configure it. You might look at https://qbox.io though, they seem to offer amazon and rackspace instances that they keep in their account but offer a managed version of Elasticsearch. We didn't go far enough to evaluate how good it is, but it might work for you. On Fri, Mar 21, 2014 at 4:21 AM, Jayadeep Reddy jayad...@ehealthaccess.comwrote: Are there any ready to use nutch instances in amazon or any other cloud.
Re: Crawling an authenticated site
I haven't done it myself but it's documented here: http://wiki.apache.org/nutch/HttpAuthenticationSchemes I'm not sure how you would do it with forms based auth, but if it's a custom app you might be able to just automatically grant it access if it the user agent and/or IP match up. On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord lmcc...@ucmerced.edu wrote: Hi, I have another question... If I have an authenticated site that I want to crawl in which I have access with my username/password. Is there a configuration step where I would add my credentials or is this something that had to be customized on my end? Thanks Again, Laura
Re: Usage Scenarios
Thanks Remi. I presume I basically just need my own version of the crawl script that uses freegen instead of generate? For the BOM issue, I searched all over for it, but just now found that someone has already brought it up. So I'll try that patch out. https://issues.apache.org/jira/browse/NUTCH-1733 On Tue, Mar 18, 2014 at 8:18 AM, remi tassing tassingr...@gmail.com wrote: Hi John, Try freegen for the second question: http://wiki.apache.org/nutch/bin/nutch_freegen Remi On Tuesday, March 18, 2014, John Lafitte jlafi...@brandextract.com wrote: We are just starting out using nutch and solr but I have a couple of issues I can't find any answers for. 1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch seems to capture it and store it as some strange characters . I can fix it by removing the BOM and indexchecker confirms it no longer will index it with those strange characters. Is there a way to prevent this from happening without modifying all of the HTML files that contain it? 2. Often a URL gets updated and we want to recraw/index a specific URL on demand. I see no way to do this currently without deleting the crawl directory and starting over. What is the proper way to handle this situation? These are somewhat related because even though I can go through the files and manually remove the BOM I can't figure out how to have nutch reindex them. We are using nutch 1.7 but I have patched a few things and would be happy to upgrade if it fixes any of this. Thanks in advance for help.
Usage Scenarios
We are just starting out using nutch and solr but I have a couple of issues I can't find any answers for. 1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch seems to capture it and store it as some strange characters . I can fix it by removing the BOM and indexchecker confirms it no longer will index it with those strange characters. Is there a way to prevent this from happening without modifying all of the HTML files that contain it? 2. Often a URL gets updated and we want to recraw/index a specific URL on demand. I see no way to do this currently without deleting the crawl directory and starting over. What is the proper way to handle this situation? These are somewhat related because even though I can go through the files and manually remove the BOM I can't figure out how to have nutch reindex them. We are using nutch 1.7 but I have patched a few things and would be happy to upgrade if it fixes any of this. Thanks in advance for help.
Re: [VOTE] Apache Nutch 1.8 Release Candidate #1
0 On Tue, Mar 4, 2014 at 1:28 PM, S.L simpleliving...@gmail.com wrote: +1 On Tue, Mar 4, 2014 at 12:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All Nutch'ers, This thread is a VOTE for releasing Apache Nutch 1.8. The release candidate comprises the following components. * A staging repository [0] containing various Maven artifacts * A branch-1.8 of the trunk code [1] * The tagged source upon which we are VOTE'ing [2] * Finally, the release artifacts [3] which i would encourage you to verify for signatures and test. You should use the following KEYS [4] file to verify the signatures of all release artifacts. Please VOTE as follows [ ] +1 Push the release, I am happy :) [ ] +0 I am not bothered either way [ ] -1 I am not happy with this release candidate (please state why) Firstly thank you to everyone that contributed to Nutch. Secondly, thank you to everyone that VOTE's. It is appreciated. Thanks Lewis (on behalf of Nutch PMC) p.s. Here's my +1 [0] https://repository.apache.org/content/repositories/orgapachenutch-1000/ [1] https://svn.apache.org/repos/asf/nutch/branches/branch-1.8 [2] https://svn.apache.org/repos/asf/nutch/tags/release-1.8/ [3] http://people.apache.org/~lewismc/nutch/1.8/ https://dist.apache.org/repos/dist/dev/nutch/1.8 [4] http://people.apache.org/~lewismc/nutch/KEYS http://www.apache.org/dist/nutch/KEYS -- *Lewis*
multivalues returned unexpectedly
I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with indexing RSS that has channel/title then channel/image/title it tries to add both of them then fails when doing solrindex because title isn't multivalued. I've used nutch indexchecker and I see the two titles being returned. The extra title is the value that in the content-disposition: filename http header. I only see one title when I run nutch readseg. So I'm a little confused why it's I have made title multivalued in the solr schema and it seems to work that way, but it seems wrong to me. Documents shouldn't have more than one title. What is the correct way to fix this?