Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Andrzej Bialecki
Robert Young wrote:
 In org.apache.nutch.crawl.LinkDb on line 261 it creates a working
 directory (newLinkDb) based on the current working directory. This
 should be configurable rather than being based on where Tomcat was
 started. I am planning on writing a patch to pull the hadoop.tmp.dir
 setting if it is available, falling back to the current directory.
 
 Can anyone see any obvious problems with doing this?

I'm not sure what Tomcat has to do with this. LinkDb does it this way in 
order to avoid rename() operation across physical volumes - if you 
invoke rename() on a local FS it may trigger a costly copy operation.


-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Robert Young
Tomcat only comes into it because we have to start Tomcat in the
searcher directory, I'm guessing it's the same however you choose to
use Nutch. It would still have to do a rename across physical volumes
if searcher.dir is set to something different would it not?

How does this sound as a sollution? Allow the user to set a
configuration option setting the linkdb working dir, or allow the user
to set a configuration flag to use another particular configuration
option to set the base dir. Otherwise fall back to the default which
is the current working directory.

Cheers
Rob

On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Robert Young wrote:
  In org.apache.nutch.crawl.LinkDb on line 261 it creates a working
  directory (newLinkDb) based on the current working directory. This
  should be configurable rather than being based on where Tomcat was
  started. I am planning on writing a patch to pull the hadoop.tmp.dir
  setting if it is available, falling back to the current directory.
 
  Can anyone see any obvious problems with doing this?

 I'm not sure what Tomcat has to do with this. LinkDb does it this way in
 order to avoid rename() operation across physical volumes - if you
 invoke rename() on a local FS it may trigger a costly copy operation.


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Briggs
I don't use the nutch web application, but  You don't have to
start nutch in the searcher directory.  You can set the location of
the searcher dir within the nutch-site.xml config file.

Add this node and set the location of your index:

property
  namesearcher.dir/name
  value/your/path/to/your/index/value
  description
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory index containing
  merged indexes, or the directory segments containing segment
  indexes.
  /description
/property







On 7/19/07, Robert Young [EMAIL PROTECTED] wrote:
 Tomcat only comes into it because we have to start Tomcat in the
 searcher directory, I'm guessing it's the same however you choose to
 use Nutch. It would still have to do a rename across physical volumes
 if searcher.dir is set to something different would it not?

 How does this sound as a sollution? Allow the user to set a
 configuration option setting the linkdb working dir, or allow the user
 to set a configuration flag to use another particular configuration
 option to set the base dir. Otherwise fall back to the default which
 is the current working directory.

 Cheers
 Rob

 On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
  Robert Young wrote:
   In org.apache.nutch.crawl.LinkDb on line 261 it creates a working
   directory (newLinkDb) based on the current working directory. This
   should be configurable rather than being based on where Tomcat was
   started. I am planning on writing a patch to pull the hadoop.tmp.dir
   setting if it is available, falling back to the current directory.
  
   Can anyone see any obvious problems with doing this?
 
  I'm not sure what Tomcat has to do with this. LinkDb does it this way in
  order to avoid rename() operation across physical volumes - if you
  invoke rename() on a local FS it may trigger a costly copy operation.
 
 
  --
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Robert Young
Yes, I do this for the searcher directory but in the LinkDb class it
makes a reference to a Path which is relative (just for a temporary
working directory). This is the problem, because if I start tomcat in
a path where the java user does not have permissions to create a
directory then LinkDb fails.

On 7/19/07, Briggs [EMAIL PROTECTED] wrote:
 I don't use the nutch web application, but  You don't have to
 start nutch in the searcher directory.  You can set the location of
 the searcher dir within the nutch-site.xml config file.

 Add this node and set the location of your index:

 property
   namesearcher.dir/name
   value/your/path/to/your/index/value
   description
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory index containing
   merged indexes, or the directory segments containing segment
   indexes.
   /description
 /property







 On 7/19/07, Robert Young [EMAIL PROTECTED] wrote:
  Tomcat only comes into it because we have to start Tomcat in the
  searcher directory, I'm guessing it's the same however you choose to
  use Nutch. It would still have to do a rename across physical volumes
  if searcher.dir is set to something different would it not?
 
  How does this sound as a sollution? Allow the user to set a
  configuration option setting the linkdb working dir, or allow the user
  to set a configuration flag to use another particular configuration
  option to set the base dir. Otherwise fall back to the default which
  is the current working directory.
 
  Cheers
  Rob
 
  On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
   Robert Young wrote:
In org.apache.nutch.crawl.LinkDb on line 261 it creates a working
directory (newLinkDb) based on the current working directory. This
should be configurable rather than being based on where Tomcat was
started. I am planning on writing a patch to pull the hadoop.tmp.dir
setting if it is available, falling back to the current directory.
   
Can anyone see any obvious problems with doing this?
  
   I'm not sure what Tomcat has to do with this. LinkDb does it this way in
   order to avoid rename() operation across physical volumes - if you
   invoke rename() on a local FS it may trigger a costly copy operation.
  
  
   --
   Best regards,
   Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
   ___|||__||  \|  ||  |  Embedded Unix, System Integration
   http://www.sigram.com  Contact: info at sigram dot com
  
  
 


 --
 Conscious decisions by conscious minds are what make reality real


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Briggs
Ahh, now I see what you are referring to.  Thanks for the question.
Now I know why I was getting garbage in my directory a while back.
So, I guess you may need to edit that class.  Are you using hadoop in
local mode?


On 7/19/07, Robert Young [EMAIL PROTECTED] wrote:
 Yes, I do this for the searcher directory but in the LinkDb class it
 makes a reference to a Path which is relative (just for a temporary
 working directory). This is the problem, because if I start tomcat in
 a path where the java user does not have permissions to create a
 directory then LinkDb fails.

 On 7/19/07, Briggs [EMAIL PROTECTED] wrote:
  I don't use the nutch web application, but  You don't have to
  start nutch in the searcher directory.  You can set the location of
  the searcher dir within the nutch-site.xml config file.
 
  Add this node and set the location of your index:
 
  property
namesearcher.dir/name
value/your/path/to/your/index/value
description
Path to root of crawl.  This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory index containing
merged indexes, or the directory segments containing segment
indexes.
/description
  /property
 
 
 
 
 
 
 
  On 7/19/07, Robert Young [EMAIL PROTECTED] wrote:
   Tomcat only comes into it because we have to start Tomcat in the
   searcher directory, I'm guessing it's the same however you choose to
   use Nutch. It would still have to do a rename across physical volumes
   if searcher.dir is set to something different would it not?
  
   How does this sound as a sollution? Allow the user to set a
   configuration option setting the linkdb working dir, or allow the user
   to set a configuration flag to use another particular configuration
   option to set the base dir. Otherwise fall back to the default which
   is the current working directory.
  
   Cheers
   Rob
  
   On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Robert Young wrote:
 In org.apache.nutch.crawl.LinkDb on line 261 it creates a working
 directory (newLinkDb) based on the current working directory. This
 should be configurable rather than being based on where Tomcat was
 started. I am planning on writing a patch to pull the hadoop.tmp.dir
 setting if it is available, falling back to the current directory.

 Can anyone see any obvious problems with doing this?
   
I'm not sure what Tomcat has to do with this. LinkDb does it this way in
order to avoid rename() operation across physical volumes - if you
invoke rename() on a local FS it may trigger a costly copy operation.
   
   
--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
   
   
  
 
 
  --
  Conscious decisions by conscious minds are what make reality real
 



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers