Re: Remove Header Footer and Menus from crawled content

2015-10-01 Thread John Lafitte
I have been using something similar to this for a while because we came
from Google Search Appliance and had googleon and googleoff all over the
place.  I don't really like having to patch the parse-html plugin everytime
I do an upgrade, wish I could move that into it's own plugin somehow.

Speaking of googleon/googleoff, is there any standard for denoting
indexable elements?  That one seems specific to GSA, it would be nice if
there was something other search engines might also take into consideration.

On Thu, Oct 1, 2015 at 7:20 AM,  wrote:

> Camillo thank you so much for sharing your changes. I am checking it out.
>
>
> On 9/30/15 3:37 PM, "Camilo Tejeiro"  wrote:
>
> >I believe you can do it with Tika,
> >
> >I did it a different way...
> >I recently had to do something similar and I wrote a little parse-filter
> >plugin to accomplish this.
> >
> >For reference look into the Jira Issue 585, it will give you some ideas.
> >https://issues.apache.org/jira/browse/NUTCH-585
> >
> >If it helps here is my open nutch install with the integrated plugin (look
> >for the parse-html-filter-select-nodes plugin). I haven't created a patch
> >but you are free to use it if it helps you...
> >https://github.com/osohm/apache-nutch-1.10
> >
> >cheers,
> >
> >On Wed, Sep 30, 2015 at 11:57 AM,  wrote:
> >
> >> Hi All,
> >>
> >> We need to remove header, footer and menu from the crawled content
> >>before
> >> we index content into SOLR. I researched online and found references to
> >> removal via Tika's boilerpipe support - Nutch-961
> >>
> >> We are currently using Nutch 1.7 but I am looking into updating to Nutch
> >> 1.10. I am hoping that the newer version of Tika in Nutch 1.10 will do a
> >> better job in removing extra content.
> >>
> >> I will be very thankful if you can let me know the best method and steps
> >> to achieve this goal and how effective this is in removal.
> >>
> >> Thanks so much,
> >> Madhvi
> >>
> >>
> >
> >
> >--
> >Camilo Tejeiro
> >*Be **honest, be grateful, be humble.*
> >https://www.linkedin.com/in/camilotejeiro
> >http://camilotejeiro.wordpress.com
>
>


Re: [MASSMAIL]Re: [VOTE] Release Apache Nutch 1.10

2015-04-29 Thread John Lafitte
+0

On Wed, Apr 29, 2015 at 7:22 PM, Jorge Luis Betancourt González 
jlbetanco...@uci.cu wrote:

 +1

 - run small test crawl in local mode and index into Solr.

 - Original Message -
 From: Sebastian Nagel wastl.na...@googlemail.com
 To: user@nutch.apache.org
 Sent: Wednesday, April 29, 2015 6:19:59 PM
 Subject: [MASSMAIL]Re: [VOTE] Release Apache Nutch 1.10

 +1

 - download bin package
 - verified signature
 - run small test crawl (local mode) and index to Solr

 On 04/29/2015 11:54 PM, Lewis John Mcgibbney wrote:
  Hi user@  dev@,This thread is a VOTE for releasing Apache Nutch 1.10.
  The release candidate comprises the following components.* A staging
  repository [0] containing various Maven artifacts* A branch-1.10 of
  the trunk code [1]* The tagged source upon which we are VOTE'ing [2]*
  Finally, the release artifacts [3] which I would encourage you to
  verify for signatures and test.You should use the following KEYS [4]
  file to verify the signatures of all release artifacts.Please VOTE as
  follows[ ] +1 Push the release, I am happy :)[ ] +0 I am not bothered
  either way[ ] -1 I am not happy with this release candidate (please
  state why)Firstly thank you to everyone that contributed to Nutch
  1.10.
  Secondly, thank you to everyone that VOTE's. It is highly
  appreciated.ThanksLewis(on behalf of Nutch PMC)p.s. Here's my +1 [0]
 
 https://repository.apache.org/content/repositories/orgapachenutch-1004[1]
  https://svn.apache.org/repos/asf/nutch/branches/branch-1.10[2]
  https://svn.apache.org/repos/asf/nutch/tags/release-1.10[3]
  https://dist.apache.org/repos/dist/dev/nutch/1.10/[4]
  http://www.apache.org/dist/nutch/KEYS
 
 
 




Re: [VOTE] Release Apache Nutch 2.3

2015-01-10 Thread John Lafitte
+0

Thanks to all the contributors for the hard work.

On Sat, Jan 10, 2015 at 1:36 PM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Tests pass and signature looks good.
 Here is my +1 (non-binding) Thanks for driving this Lewis!


 Renato M.

 2015-01-09 9:58 GMT+01:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 :

  Hi user@  dev@,
 
  This thread is a VOTE for releasing Apache Nutch 2.3.
  Quite incredibly we addressed 143 issues as per the release report
  http://s.apache.org/nutch_2.3
  The release candidate comprises the following components.
 
  * A staging repository [0] containing various Maven artifacts
  * A branch-2.3 [1] of the 2.x codebase from which the tag was
 subsequently
  cut
  * The tagged source (tagged from the above branch-2.3) upon which we are
  VOTE'ing [2]
  * Finally, the release artifacts [3] which I would encourage you to
 verify
  for signatures and test.
 
  You should use the following KEYS [4] file to verify the signatures of
 all
  release artifacts.
 
  Please VOTE as follows
 
  [ ] +1 Push the release, I am happy :)
  [ ] +0 I am not bothered either way
  [ ] -1 I am not happy with this release candidate (please state why)
 
  Firstly thank you to everyone that contributed to Nutch. Secondly, thank
  you to everyone that VOTE's. It is appreciated.
 
  Thanks
  Lewis
  (on behalf of Nutch PMC)
 
  p.s. Here's my +1
 
  [0]
  https://repository.apache.org/content/repositories/orgapachenutch-1003/
  [1] http://svn.apache.org/repos/asf/nutch/branches/branch-2.3/
  [2] http://svn.apache.org/repos/asf/nutch/tags/release-2.3/
  [3] https://dist.apache.org/repos/dist/dev/nutch/
  [4] http://www.apache.org/dist/nutch/KEYS
 
  --
  *Lewis*
 



File not found error

2014-06-24 Thread John Lafitte
Using Nutch 1.7

Out of the blue all of my crawl jobs started failing a few days ago.  I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues.  There is plenty of disk space.  Here
is the error I'm getting, any help is appreciated:

Injector: starting at 2014-06-24 07:26:54
Injector: crawlDb: di/crawl/crawldb
Injector: urlDir: di/urls
Injector: Converting injected urls to crawl db entries.
Injector: ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
 at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
 at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
 at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.Injector.main(Injector.java:308)


Re: File not found error

2014-06-24 Thread John Lafitte
Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
latest thing is trying to determine if there is a filesystem issue.  It's
not really clear what file is not found.  I have about 10 different
configs, this is just one of them and they all have the urls folder.  The
script worked for quite a while before this just started happening on it's
own.  That's why I'm suspecting a filesystem error.


On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote:

 you might want to check to see if

  Injector: urlDir: di/urls

 still exist in your hdfs.




 On 06/24/2014 12:30 AM, John Lafitte wrote:

 Using Nutch 1.7

 Out of the blue all of my crawl jobs started failing a few days ago.  I
 checked the user logs and nobody logged into the server and there were no
 reboots or any other obvious issues.  There is plenty of disk space.  Here
 is the error I'm getting, any help is appreciated:

 Injector: starting at 2014-06-24 07:26:54
 Injector: crawlDb: di/crawl/crawldb
 Injector: urlDir: di/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: ENOENT: No such file or directory
 at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
 at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 RawLocalFileSystem.java:514)
   at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 RawLocalFileSystem.java:349)
 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 FilterFileSystem.java:193)
   at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 JobSubmissionFiles.java:126)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(
 UserGroupInformation.java:1190)
   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Injector.main(Injector.java:308)


 --
 Kaveh Minooie



Re: File not found error

2014-06-24 Thread John Lafitte
Okay, I got it working again.  Not sure exactly what happened, but fsck
didn't help.  I noticed the last line showed native method so moved the
native binaries out of the /lib folder.  Lo and behold, the next time I ran
it, it used the java libs and displayed the filename it was having a
problem with.  It
was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I
just went and moved the /tmp/hadoop-root directory and then it started
working again.  Permissions looked fine, so it might have just been corrupt.

Thanks for the help!


On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte jlafi...@brandextract.com
wrote:

 Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
 latest thing is trying to determine if there is a filesystem issue.  It's
 not really clear what file is not found.  I have about 10 different
 configs, this is just one of them and they all have the urls folder.  The
 script worked for quite a while before this just started happening on it's
 own.  That's why I'm suspecting a filesystem error.


 On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote:

 you might want to check to see if

  Injector: urlDir: di/urls

 still exist in your hdfs.




 On 06/24/2014 12:30 AM, John Lafitte wrote:

 Using Nutch 1.7

 Out of the blue all of my crawl jobs started failing a few days ago.  I
 checked the user logs and nobody logged into the server and there were no
 reboots or any other obvious issues.  There is plenty of disk space.
  Here
 is the error I'm getting, any help is appreciated:

 Injector: starting at 2014-06-24 07:26:54
 Injector: crawlDb: di/crawl/crawldb
 Injector: urlDir: di/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: ENOENT: No such file or directory
 at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
 at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 RawLocalFileSystem.java:514)
   at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 RawLocalFileSystem.java:349)
 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 FilterFileSystem.java:193)
   at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 JobSubmissionFiles.java:126)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(
 UserGroupInformation.java:1190)
   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Injector.main(Injector.java:308)


 --
 Kaveh Minooie





Re: Nutch 1.7 - deleting segments

2014-05-03 Thread John Lafitte
What would be the case where you would want to keep the segments?  I'm
considering automatically deleting them after sending the data to solr
On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote:

 Thanks for your reply!

 Regards,

 --
 Chethan Prasad


 On Sat, May 3, 2014 at 12:22 PM, remi tassing tassingr...@gmail.com
 wrote:

  you are correct
 
 
  On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote:
 
   Hi,
  
   I have a Nutch crawl with 4 segments which are fully indexed using the
   bin/nutch
   solrindexcommand. Now I'm all out of storage on the box, so can I
 delete
   the 4 segments and retain only the crawldb and continue crawling from
  where
   I left it?
  
   Since all the segments are merged and indexed to Solr I don't see a
  problem
   in deleting the segments, or am I wrong there?
  
   Regards,
  
   --
   Chethan Prasad
  
 



Re: Don't fetch all urls in a page

2014-04-16 Thread John Lafitte
Hi Zabini,

I'm a little unclear if you are having a problem with nutch following the
links or indexing the pages.  Have you tried both of these to verify the
links and index data?

https://wiki.apache.org/nutch/bin/nutch%20parsechecker
https://wiki.apache.org/nutch/bin/nutch%20indexchecker

The second link above seems wrong to me, it shows *IndexingFiltersChecker* but
I think it should be *indexchecker*.  That works for me.


On Wed, Apr 16, 2014 at 11:48 AM, Zabini antony.noli...@actimage.comwrote:

 Hi,

 I am facing a problem with the urls nutch fetch.

 I have a page and whithin several URLs, but Nucth does not fetch them.
 They are allowed in the regex-urlfilter and those URLs works fine if I put
 them in my urls seed list.

 Does anyone has any hint on what to do?

 Best Regards,
 Zabini



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Don-t-fetch-all-urls-in-a-page-tp4131531.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Index web folders.

2014-04-09 Thread John Lafitte
Hi Shane,

The way I save on bandwidth for internal sites is by adding the internal IP
into the hosts file for the domain.  If it's the local machine you can
probably just point it to 127.0.0.1 but I kind of wonder if you would be
saving anything but a DNS lookup...  I'm not a networking guru but it
should route to itself automatically.


On Tue, Apr 8, 2014 at 8:30 PM, Shane Wood sh...@cbm8bit.com wrote:

 I have a few sites i mirror via FTP and would like too index them with
 nutch but not hit the site too save bandwidth.
 /var/www/www.somesite.com/ lets say one is located in this folder o the
 same server nutch is located on, how would
 you index it and have the search results point too the actual site.
 http://somesite.com ?

 Thanks
 Shane.



Re: Control and monitor nutch-1.x via Web interface ?

2014-04-08 Thread John Lafitte
gethue seems pretty awesome but it looks like it's more for reporting and
metrics, will it really administer nutch?


On Tue, Apr 8, 2014 at 7:07 AM, Talat Uyarer ta...@uyarer.com wrote:

 Hi anunpak,

 If you want to use pretty ui. You can use Hue[0]

 [0] http://www.gethue.com

 Talat

 2014-04-06 17:22 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:
  Nutch is a Hadoop application = you can use the MapReduce UI to monitor
  your crawl
 
 
  On 3 April 2014 23:17, anupamk anup...@usc.edu wrote:
 
  Hi,
 
  I am not able to find the answer to my question on google.
 
  The question is -- is there a a web-interface enable me to control and
  monitor nutch-1.x ?
 
  I am aware of the the nutch-admin REST API for nutch 2.x
  (http://wiki.apache.org/nutch/NutchAdministrationUserInterface#Download
 :)
 
  Is there a ready-to-user web-interface for nutch 1.x ? If not, can I
  adapt
  the 2.x webUI to nutch-1.x ? How do I go about doing it ?
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Control-and-monitor-nutch-1-x-via-Web-interface-tp4128995.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble



 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304



Re: Unable to crawl wiki pages through Nutch

2014-04-02 Thread John Lafitte
reddibabu,

I cannot resolve wiki.ibm.com so I'm guessing nutch can't either.  Is that
an internal dns record?


On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote:

 Hi All,

 I am using Apache Nutch 1.7. I can able to crawl and index all most all
 sites  except wiki pages.
 While trying to crawl wiki pages it is saying that fetch of
 http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
 wiki.ibm.com.

 Is it require any additional configuration for crawling wiki pages.
 Anyone assist me on the same would be helpful a lot.


 Thanks in advance.
 Reddi Babu



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Freegen and Solr score

2014-03-26 Thread John Lafitte
Thanks Sebastian,

That did work when I set both of those to false, but now the url I'm
inserting has an abnormally high score.  You mentioned two options, the
first was to use FreeGenerator with an initial score, however I cannot find
it documented anywhere how to do that.  The only parameters I see is
normalize and filter and they don't take values.  Can you point me in the
right direction for that?


On Wed, Mar 26, 2014 at 6:59 AM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 There may be no relevant links if all documents are from one single host
 (or domain) and
  (link.ignore.internal.host == true)
 resp.
  (link.ignore.internal.domain == true)
 cf. explanations about that in the wiki.


 2014-03-26 4:09 GMT+01:00 John Lafitte jlafi...@brandextract.com:

  Thanks for that Sebastian.  So given the hint you've given me, I'm trying
  to generate the scoring using this example:
  https://wiki.apache.org/nutch/NewScoringIndexingExample
 
  But when it gets to the LinkRank part I get:
 
  2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
  2014-03-26 02:57:14
  2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter
 job
  2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter
 job
  2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp
  file
  2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
  java.io.IOException: No links to process, is the webgra$
  at
  org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
  at
  org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
  at
  org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at
  org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)
 
  I can see the webgraph directory got created and there are directories
 and
  files in there, but I'm guessing something is not getting populated
  correctly.  Any clue what I may be doing wrong?
 
 
  On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel 
  wastl.na...@googlemail.com
   wrote:
 
   Hi John,
  
   FreeGenerator unlike Injector does not use db.score.injected (default =
   1.0)
   but sets the initial score to 0.0. If all URLs stem from FreeGenerator
  the
   total
   score in the link graph is also 0.0, and no linked documents can get a
   higher score
   that 0.0
   As possible solutions:
   - use FreeGenerator with a initial score  0.0
 (but don't put thousands URLs with a score of 1.0:
  if the total score is too high some pages may get unreasonable
  high scores)
   - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
   scores:
 the default scoring OPIC has the advantage of calculating scores
 online
 while following links. It gives good and plausible scores if crawl is
   started
 from few authoritative seeds. But sometimes, esp. in continuous
 crawls,
 OPIC scores run out of control.
  
   Sebastian
  
   On 03/25/2014 08:31 PM, John Lafitte wrote:
I setup a script that uses freegen to manually index new/updated
 URLs.
   I
thought it was working great, but now I'm just realizing that Solr
   returns
a score of 0 for these new documents.  I thought the score was
  calculated
independent from what Nutch does, just uses the content and other
   metadata
to calculate it, however that doesn't seem to be the case.  Anyone
  have a
clue what might be causing this?  The content and other metadata look
normal and I reloaded the core to no avail.
   
  
  
 



Freegen and Solr score

2014-03-25 Thread John Lafitte
I setup a script that uses freegen to manually index new/updated URLs.  I
thought it was working great, but now I'm just realizing that Solr returns
a score of 0 for these new documents.  I thought the score was calculated
independent from what Nutch does, just uses the content and other metadata
to calculate it, however that doesn't seem to be the case.  Anyone have a
clue what might be causing this?  The content and other metadata look
normal and I reloaded the core to no avail.


Re: Freegen and Solr score

2014-03-25 Thread John Lafitte
Thanks for that Sebastian.  So given the hint you've given me, I'm trying
to generate the scoring using this example:
https://wiki.apache.org/nutch/NewScoringIndexingExample

But when it gets to the LinkRank part I get:

2014-03-26 02:57:14,208 INFO  webgraph.LinkRank - Analysis: starting at
2014-03-26 02:57:14
2014-03-26 02:57:14,913 INFO  webgraph.LinkRank - Starting link counter job
2014-03-26 02:57:17,927 INFO  webgraph.LinkRank - Finished link counter job
2014-03-26 02:57:17,928 INFO  webgraph.LinkRank - Reading numlinks temp file
2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
java.io.IOException: No links to process, is the webgra$
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)

I can see the webgraph directory got created and there are directories and
files in there, but I'm guessing something is not getting populated
correctly.  Any clue what I may be doing wrong?


On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi John,

 FreeGenerator unlike Injector does not use db.score.injected (default =
 1.0)
 but sets the initial score to 0.0. If all URLs stem from FreeGenerator the
 total
 score in the link graph is also 0.0, and no linked documents can get a
 higher score
 that 0.0
 As possible solutions:
 - use FreeGenerator with a initial score  0.0
   (but don't put thousands URLs with a score of 1.0:
if the total score is too high some pages may get unreasonable
high scores)
 - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
 scores:
   the default scoring OPIC has the advantage of calculating scores online
   while following links. It gives good and plausible scores if crawl is
 started
   from few authoritative seeds. But sometimes, esp. in continuous crawls,
   OPIC scores run out of control.

 Sebastian

 On 03/25/2014 08:31 PM, John Lafitte wrote:
  I setup a script that uses freegen to manually index new/updated URLs.  I
  thought it was working great, but now I'm just realizing that Solr
 returns
  a score of 0 for these new documents.  I thought the score was calculated
  independent from what Nutch does, just uses the content and other
 metadata
  to calculate it, however that doesn't seem to be the case.  Anyone have a
  clue what might be causing this?  The content and other metadata look
  normal and I reloaded the core to no avail.
 




Re: Ranking Algorithm

2014-03-23 Thread John Lafitte
I am still new to this, but I believe solr creates the score field.  There
is also a boost field from nutch that you can save into solr.  You'll have
to create your solr query to sort by both.  Score, boost, title was the
most logical sorting to me.


On Sat, Mar 22, 2014 at 11:16 PM, azhar2007 azhar2...@outlook.com wrote:

 Which folder and files is the ranking algorithm stored in? Does nutch and
 solr both have one each



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Ranking-Algorithm-tp4126294.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Ready to use nutch

2014-03-21 Thread John Lafitte
This is a good question.  I was looking for an out-of-the-box Solr/Nutch
solution to replace our google mini.  We use rackspace and they don't seem
to offer it.  Even though it was a bit of a learning curve, I feel it was
good to know how to actually set it up and configure it.

You might look at https://qbox.io though, they seem to offer amazon and
rackspace instances that they keep in their account but offer a managed
version of Elasticsearch.  We didn't go far enough to evaluate how good it
is, but it might work for you.


On Fri, Mar 21, 2014 at 4:21 AM, Jayadeep Reddy
jayad...@ehealthaccess.comwrote:

 Are there any ready to use nutch instances in amazon or any other cloud.



Re: Crawling an authenticated site

2014-03-21 Thread John Lafitte
I haven't done it myself but it's documented here:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

I'm not sure how you would do it with forms based auth, but if it's a
custom app you might be able to just automatically grant it access if it
the user agent and/or IP match up.


On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord lmcc...@ucmerced.edu wrote:

 Hi,

 I have another question... If I have an authenticated site that I want to
 crawl in which I have access with my username/password. Is there a
 configuration step where I would add my credentials or is this something
 that had to be customized on my end?

 Thanks Again,
  Laura



Re: Usage Scenarios

2014-03-18 Thread John Lafitte
Thanks Remi.  I presume I basically just need my own version of the crawl
script that uses freegen instead of generate?

For the BOM issue, I searched all over for it, but just now found that
someone has already brought it up.  So I'll try that patch out.
https://issues.apache.org/jira/browse/NUTCH-1733


On Tue, Mar 18, 2014 at 8:18 AM, remi tassing tassingr...@gmail.com wrote:

 Hi John,

 Try freegen for the second question:
 http://wiki.apache.org/nutch/bin/nutch_freegen

 Remi

 On Tuesday, March 18, 2014, John Lafitte jlafi...@brandextract.com
 wrote:

  We are just starting out using nutch and solr but I have a couple of
 issues
  I can't find any answers for.
 
  1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
  seems to capture it and store it as some strange characters .  I can
  fix it by removing the BOM and indexchecker confirms it no longer will
  index it with those strange characters.  Is there a way to prevent this
  from happening without modifying all of the HTML files that contain it?
 
  2. Often a URL gets updated and we want to recraw/index a specific URL on
  demand.  I see no way to do this currently without deleting the crawl
  directory and starting over.  What is the proper way to handle this
  situation?
 
  These are somewhat related because even though I can go through the files
  and manually remove the BOM I can't figure out how to have nutch reindex
  them.  We are using nutch 1.7 but I have patched a few things and would
 be
  happy to upgrade if it fixes any of this.
 
  Thanks in advance for help.
 



Usage Scenarios

2014-03-17 Thread John Lafitte
We are just starting out using nutch and solr but I have a couple of issues
I can't find any answers for.

1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
seems to capture it and store it as some strange characters .  I can
fix it by removing the BOM and indexchecker confirms it no longer will
index it with those strange characters.  Is there a way to prevent this
from happening without modifying all of the HTML files that contain it?

2. Often a URL gets updated and we want to recraw/index a specific URL on
demand.  I see no way to do this currently without deleting the crawl
directory and starting over.  What is the proper way to handle this
situation?

These are somewhat related because even though I can go through the files
and manually remove the BOM I can't figure out how to have nutch reindex
them.  We are using nutch 1.7 but I have patched a few things and would be
happy to upgrade if it fixes any of this.

Thanks in advance for help.


Re: [VOTE] Apache Nutch 1.8 Release Candidate #1

2014-03-04 Thread John Lafitte
0


On Tue, Mar 4, 2014 at 1:28 PM, S.L simpleliving...@gmail.com wrote:

 +1


 On Tue, Mar 4, 2014 at 12:50 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi All Nutch'ers,
 
  This thread is a VOTE for releasing Apache Nutch 1.8. The release
 candidate
  comprises the following components.
 
  * A staging repository [0] containing various Maven artifacts
  * A branch-1.8 of the trunk code [1]
  * The tagged source upon which we are VOTE'ing [2]
  * Finally, the release artifacts [3] which i would encourage you to
 verify
  for signatures and test.
 
  You should use the following KEYS [4] file to verify the signatures of
 all
  release artifacts.
 
  Please VOTE as follows
 
  [ ] +1 Push the release, I am happy :)
  [ ] +0 I am not bothered either way
  [ ] -1 I am not happy with this release candidate (please state why)
 
  Firstly thank you to everyone that contributed to Nutch. Secondly, thank
  you to everyone that VOTE's. It is appreciated.
 
  Thanks
  Lewis
  (on behalf of Nutch PMC)
 
  p.s. Here's my +1
 
  [0]
  https://repository.apache.org/content/repositories/orgapachenutch-1000/
  [1] https://svn.apache.org/repos/asf/nutch/branches/branch-1.8
  [2] https://svn.apache.org/repos/asf/nutch/tags/release-1.8/
  [3] http://people.apache.org/~lewismc/nutch/1.8/
  https://dist.apache.org/repos/dist/dev/nutch/1.8
  [4] http://people.apache.org/~lewismc/nutch/KEYS
  http://www.apache.org/dist/nutch/KEYS
 
  --
  *Lewis*
 



multivalues returned unexpectedly

2014-02-24 Thread John Lafitte
I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
RSS that has channel/title then channel/image/title it tries to add both of
them then fails when doing solrindex because title isn't multivalued.

I've used nutch indexchecker and I see the two titles being returned.  The
extra title is the value that in the content-disposition: filename http
header.  I only see one title when I run nutch readseg.  So I'm a little
confused why it's

I have made title multivalued in the solr schema and it seems to work that
way, but it seems wrong to me.  Documents shouldn't have more than one
title.  What is the correct way to fix this?