Re: Global locking

2006-02-16 Thread Gal Nitzan
well, at the moment it solve the problem I mentioned yesterday where all
tasktrackers will access the same site with hadoop. it seems that the
use of job.setBoolean("mapred.speculative.execution", false); didn't
help and I'm not sure why.

However, though it is one more software it removes the need for special
treatment for fetcher, i.e. special fetch lists built by the generator.
So now fetcher/tasktracker suppose to access politely to hosts but still
its list contains various hosts. Sometimes I noticed that generator
created a fetchlist where (only 2 hosts in the seed) were put in the
same fetchlist which made only one tasktracker work instead of two.

I'm sorry if It sound a little confusing :) or unreasonable... :)

Gal



On Thu, 2006-02-16 at 13:47 -0800, Doug Cutting wrote:
> Gal Nitzan wrote:
> > I have implemented a down and dirty Global Locking:
> >  [ ... ]
> > 
> > I changed FetcherThread constructor to create an instance of
> > SyncManager.
> > 
> > And in also in the run method I try to get a lock on the host. If not
> > successful I add the url into a ListArray for a later
> > processing...
> > 
> > I also changed generator to put each url into a separate array so all
> > fetchlists are even.
> 
> What problem does this fix?
> 
> Doug
> 




Re: Global locking

2006-02-16 Thread Doug Cutting

Gal Nitzan wrote:

I have implemented a down and dirty Global Locking:
 [ ... ]

I changed FetcherThread constructor to create an instance of
SyncManager.

And in also in the run method I try to get a lock on the host. If not
successful I add the url into a ListArray for a later
processing...

I also changed generator to put each url into a separate array so all
fetchlists are even.


What problem does this fix?

Doug


Global locking

2006-02-16 Thread Gal Nitzan
I have implemented a down and dirty Global Locking:

I am currently testing it but I would like to get other people idea on
this:

I used RMI for this purpose:

A RMI server which implements two methods {
boolean lock(String urlString);
void unlock(String urlString);
}

the server holds a map where key is an Integer(host hash) the
val is a very simplistic class:

public class LockObj {
  private int hash;
  private long start;
  private long timeout;
  private int max_locks;
  private int locks = 0;
  private Object sync_obj = new Object();

  public LockObj(int hash, long timeout, int max_locks) {
this.hash = hash;
this.timeout = timeout;
start = new Date().getTime();
this.max_locks = max_locks;
  }

  public synchronized boolean lock() {
boolean ret = false;

if (locks+1 < max_locks) {
  synchronized(sync_obj) {
locks++;
  }
  ret = true;
}
return ret;
  }

  public synchronized void unlock() {
if (locks > 0) {
  synchronized(sync_obj) {
locks--;
  }
}
  }

  public int locks() {
return locks;
  }

  // convert the host part of a url to hash
  // if url exception. use the string input for hash
  public static int make_hash(String urlString) {
URL url = null;
try {
  url = new URL(urlString);
} catch (MalformedURLException e) {
}

return (url==null ? urlString : url.getHost()).hashCode();
  }

  // check if this object timeout has reached.
  // later implement a listener event
  public boolean timeout_reached() {
long current = new Date().getTime();

return (current - start) > timeout;
  }

  // free all
  public void unlock_all() {
synchronized(sync_obj) {
  while (locks != 0)
locks--;
}
  }

  public int hash() {
return hash;
  }
}

not the prettiest thing but just finished the first barrier... it
worked!!!


I changed FetcherThread constructor to create an instance of
SyncManager.

And in also in the run method I try to get a lock on the host. If not
successful I add the url into a ListArray for a later
processing...

I also changed generator to put each url into a separate array so all
fetchlists are even.

Would appreciate your comments and any way to improve.

The RMI is a little cumbersome but hay... for now it works for 5 task
trackers without a problem (so it seems) :)


Gal




On Wed, 2006-02-15 at 14:55 -0800, Doug Cutting wrote:
> Andrzej Bialecki wrote:
> > (FYI: if you wonder how it was working before, the trick was to generate 
> > just 1 split for the fetch job, which then lead to just one task being 
> > created for any input fetchlist.
> 
> I don't think that's right.  The generator uses setNumReduceTasks() to 
> the desired number of fetch tasks, to control how many host-disjoint 
> fetchlists are generated.  Then the fetcher does not permit input files 
> to be split, so that fetch tasks remain host-disjoint.  So lots of 
> splits can be generated, by default one per mapred.map.tasks, permitting 
> lots of parallel fetching.
> 
> This should still work.  If it does not, I'd be interested to hear more 
> details.
> 
> Doug
>