Re: [Nutch-dev] Looking to fix relative path issue in linkdb
Robert Young wrote: In org.apache.nutch.crawl.LinkDb on line 261 it creates a working directory (newLinkDb) based on the current working directory. This should be configurable rather than being based on where Tomcat was started. I am planning on writing a patch to pull the hadoop.tmp.dir setting if it is available, falling back to the current directory. Can anyone see any obvious problems with doing this? I'm not sure what Tomcat has to do with this. LinkDb does it this way in order to avoid rename() operation across physical volumes - if you invoke rename() on a local FS it may trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Looking to fix relative path issue in linkdb
Tomcat only comes into it because we have to start Tomcat in the searcher directory, I'm guessing it's the same however you choose to use Nutch. It would still have to do a rename across physical volumes if searcher.dir is set to something different would it not? How does this sound as a sollution? Allow the user to set a configuration option setting the linkdb working dir, or allow the user to set a configuration flag to use another particular configuration option to set the base dir. Otherwise fall back to the default which is the current working directory. Cheers Rob On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Robert Young wrote: In org.apache.nutch.crawl.LinkDb on line 261 it creates a working directory (newLinkDb) based on the current working directory. This should be configurable rather than being based on where Tomcat was started. I am planning on writing a patch to pull the hadoop.tmp.dir setting if it is available, falling back to the current directory. Can anyone see any obvious problems with doing this? I'm not sure what Tomcat has to do with this. LinkDb does it this way in order to avoid rename() operation across physical volumes - if you invoke rename() on a local FS it may trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Looking to fix relative path issue in linkdb
I don't use the nutch web application, but You don't have to start nutch in the searcher directory. You can set the location of the searcher dir within the nutch-site.xml config file. Add this node and set the location of your index: property namesearcher.dir/name value/your/path/to/your/index/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property On 7/19/07, Robert Young [EMAIL PROTECTED] wrote: Tomcat only comes into it because we have to start Tomcat in the searcher directory, I'm guessing it's the same however you choose to use Nutch. It would still have to do a rename across physical volumes if searcher.dir is set to something different would it not? How does this sound as a sollution? Allow the user to set a configuration option setting the linkdb working dir, or allow the user to set a configuration flag to use another particular configuration option to set the base dir. Otherwise fall back to the default which is the current working directory. Cheers Rob On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Robert Young wrote: In org.apache.nutch.crawl.LinkDb on line 261 it creates a working directory (newLinkDb) based on the current working directory. This should be configurable rather than being based on where Tomcat was started. I am planning on writing a patch to pull the hadoop.tmp.dir setting if it is available, falling back to the current directory. Can anyone see any obvious problems with doing this? I'm not sure what Tomcat has to do with this. LinkDb does it this way in order to avoid rename() operation across physical volumes - if you invoke rename() on a local FS it may trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Looking to fix relative path issue in linkdb
Yes, I do this for the searcher directory but in the LinkDb class it makes a reference to a Path which is relative (just for a temporary working directory). This is the problem, because if I start tomcat in a path where the java user does not have permissions to create a directory then LinkDb fails. On 7/19/07, Briggs [EMAIL PROTECTED] wrote: I don't use the nutch web application, but You don't have to start nutch in the searcher directory. You can set the location of the searcher dir within the nutch-site.xml config file. Add this node and set the location of your index: property namesearcher.dir/name value/your/path/to/your/index/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property On 7/19/07, Robert Young [EMAIL PROTECTED] wrote: Tomcat only comes into it because we have to start Tomcat in the searcher directory, I'm guessing it's the same however you choose to use Nutch. It would still have to do a rename across physical volumes if searcher.dir is set to something different would it not? How does this sound as a sollution? Allow the user to set a configuration option setting the linkdb working dir, or allow the user to set a configuration flag to use another particular configuration option to set the base dir. Otherwise fall back to the default which is the current working directory. Cheers Rob On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Robert Young wrote: In org.apache.nutch.crawl.LinkDb on line 261 it creates a working directory (newLinkDb) based on the current working directory. This should be configurable rather than being based on where Tomcat was started. I am planning on writing a patch to pull the hadoop.tmp.dir setting if it is available, falling back to the current directory. Can anyone see any obvious problems with doing this? I'm not sure what Tomcat has to do with this. LinkDb does it this way in order to avoid rename() operation across physical volumes - if you invoke rename() on a local FS it may trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Looking to fix relative path issue in linkdb
Ahh, now I see what you are referring to. Thanks for the question. Now I know why I was getting garbage in my directory a while back. So, I guess you may need to edit that class. Are you using hadoop in local mode? On 7/19/07, Robert Young [EMAIL PROTECTED] wrote: Yes, I do this for the searcher directory but in the LinkDb class it makes a reference to a Path which is relative (just for a temporary working directory). This is the problem, because if I start tomcat in a path where the java user does not have permissions to create a directory then LinkDb fails. On 7/19/07, Briggs [EMAIL PROTECTED] wrote: I don't use the nutch web application, but You don't have to start nutch in the searcher directory. You can set the location of the searcher dir within the nutch-site.xml config file. Add this node and set the location of your index: property namesearcher.dir/name value/your/path/to/your/index/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property On 7/19/07, Robert Young [EMAIL PROTECTED] wrote: Tomcat only comes into it because we have to start Tomcat in the searcher directory, I'm guessing it's the same however you choose to use Nutch. It would still have to do a rename across physical volumes if searcher.dir is set to something different would it not? How does this sound as a sollution? Allow the user to set a configuration option setting the linkdb working dir, or allow the user to set a configuration flag to use another particular configuration option to set the base dir. Otherwise fall back to the default which is the current working directory. Cheers Rob On 7/19/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Robert Young wrote: In org.apache.nutch.crawl.LinkDb on line 261 it creates a working directory (newLinkDb) based on the current working directory. This should be configurable rather than being based on where Tomcat was started. I am planning on writing a patch to pull the hadoop.tmp.dir setting if it is available, falling back to the current directory. Can anyone see any obvious problems with doing this? I'm not sure what Tomcat has to do with this. LinkDb does it this way in order to avoid rename() operation across physical volumes - if you invoke rename() on a local FS it may trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real -- Conscious decisions by conscious minds are what make reality real - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers