Re: RF=1 w/ hadoop jobs

2011-09-05 Thread Mick Semb Wever
On Fri, 2011-09-02 at 09:28 +0200, Patrik Modesto wrote:
 We use Cassandra as a storage for web-pages, we store the HTML, all
 URLs that has the same HTML data and some computed data. We run Hadoop
 MR jobs to compute lexical and thematical data for each page and for
 exporting the data to a binary files for later use. URL gets to a
 Cassandra on user request (a pageview) so if we delete an URL, it gets
 back quickly if the page is active. Because of that and because there
 is lots of data, we have the keyspace set to RF=1. We can drop the
 whole keyspace and it will regenerate quickly and would contain only
 fresh data, so we don't care about lossing a node. 

I've entered a jira issue covering this request.
https://issues.apache.org/jira/browse/CASSANDRA-3136

Would you mind attaching your patch to the issue.
(No review of it will happen anywhere else.)

~mck

-- 
“Innovators and creative geniuses cannot be reared in schools. They are
precisely the men who defy what the school has taught them.” - Ludwig
von Mises 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |



signature.asc
Description: This is a digitally signed message part


Re: RF=1 w/ hadoop jobs

2011-09-05 Thread Patrik Modesto
On Mon, Sep 5, 2011 at 09:39, Mick Semb Wever m...@apache.org wrote:
 I've entered a jira issue covering this request.
 https://issues.apache.org/jira/browse/CASSANDRA-3136

 Would you mind attaching your patch to the issue.
 (No review of it will happen anywhere else.)

I see Jonathan didn't change his mind, as the ticket was resolved
won't fix. So that's it. I'm not going to attach the patch until
another core Cassandra developer is interested in the use-cases
mentioned there.

I'm not sure about 0.8.x and 0.7.9 (to be released today with your
patch) but 0.7.8 will fail even with RF1 when there is Hadoop
TaskTracer without local Cassandra. So increasing RF is not a
solution.

Regards,
Patrik


Re: RF=1 w/ hadoop jobs

2011-09-05 Thread Mick Semb Wever
On Mon, 2011-09-05 at 21:52 +0200, Patrik Modesto wrote:
 I'm not sure about 0.8.x and 0.7.9 (to be released today with your
 patch) but 0.7.8 will fail even with RF1 when there is Hadoop
 TaskTracer without local Cassandra. So increasing RF is not a
 solution. 

This isn't true (or not the intention).

If you increase RF then yes the task will fail but it will get re-run on
the next replica. So the job takes longer but should still work.

~mck

-- 
This is my simple religion. There is no need for temples; no need for
complicated philosophy. Our own brain, our own heart is our temple; the
philosophy is kindness. The Dalai Lama 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |


signature.asc
Description: This is a digitally signed message part


Re: RF=1 w/ hadoop jobs

2011-09-02 Thread Patrik Modesto
Hi,

On Thu, Sep 1, 2011 at 12:36, Mck m...@apache.org wrote:
 It's available here: http://pastebin.com/hhrr8m9P (for version 0.7.8)

 I'm interested in this patch and see it's usefulness but no one will act
 until you attach it to an issue. (I think a new issue is appropriate
 here).

I'm glad someone is interestet in my patch usefull. As Jonathan
already explained himself: ignoring unavailable ranges is a
misfeature, imo I'm thinking opening a new ticket without support
from more users is useless ATM. Please test the patch and if you like
it, than there is time for ticket.

Regards,
P.


Re: RF=1 w/ hadoop jobs

2011-09-02 Thread Mick Semb Wever
On Fri, 2011-09-02 at 08:20 +0200, Patrik Modesto wrote:
 As Jonathan
 already explained himself: ignoring unavailable ranges is a
 misfeature, imo 

Generally it's not what one would want i think.
But I can see the case when data is to be treated volatile and ignoring
unavailable ranges may be acceptable. 

For example if you searching for something or some-pattern and one hit
is enough. If you get the hit it's a positive result regardless if
ranges were ignored, if you don't and you *know* there was a range
ignored along the way you can re-run the job later. The worse case
scenario here is no worse than the job always failing on you. Although
some indication of ranges ignored is required.

Another example is when your just trying to extract a small random
sample (like a pig SAMPLE) of data out of cassandra.

Patrik: is it possible to describe the use-case you have here?

~mck

-- 
“The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore, all
progress depends on the unreasonable man.” - George Bernard Shaw 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |



signature.asc
Description: This is a digitally signed message part


Re: RF=1 w/ hadoop jobs

2011-09-02 Thread Patrik Modesto
On Fri, Sep 2, 2011 at 08:54, Mick Semb Wever m...@apache.org wrote:
 Patrik: is it possible to describe the use-case you have here?

Sure.

We use Cassandra as a storage for web-pages, we store the HTML, all
URLs that has the same HTML data and some computed data. We run Hadoop
MR jobs to compute lexical and thematical data for each page and for
exporting the data to a binary files for later use. URL gets to a
Cassandra on user request (a pageview) so if we delete an URL, it gets
back quickly if the page is active. Because of that and because there
is lots of data, we have the keyspace set to RF=1. We can drop the
whole keyspace and it will regenerate quickly and would contain only
fresh data, so we don't care about lossing a node. But Hadoop does
care, well to be specific the Cassnadra ColumnInputFormat and
ColumnRecortReader are the problem parts. If I stop one Cassandra node
all MR jobs that read/write Cassandra fail. In our case, it doesn't
matter, we can skip the range of URLs. The MR jobs run in a tight
loop, so when the node is back with it's data, we use them. It's not
only about some HW crash but it makes maintenance quite difficult. To
stop a Cassandra node, you have to stop tasktracker there too which is
unfortunate as there are another MR jobs that don't need Cassandra and
can happily run.

Regards,
P.


Re: RF=1 w/ hadoop jobs

2011-09-01 Thread Mck
On Thu, 2011-08-18 at 08:54 +0200, Patrik Modesto wrote:
 But there is the another problem with Hadoop-Cassandra, if there is no
 node available for a range of keys, it fails on RuntimeError. For
 example having a keyspace with RF=1 and a node is down all MapReduce
 tasks fail. 

CASSANDRA-2388 is related but not the same.

Before 0.8.4 the behaviour was if the local cassandra node didn't have
the split's data the tasktracker would connect to another cassandra node
where the split's data could be found.

So even 0.8.4 with RF=1 you would have your hadoop job fail.

Although I've reopened CASSANDRA-2388 (and reverted the code locally)
because the new behaviour in 0.8.4 leads to abysmal tasktracker
throughput (for me task allocation doesn't seem to honour data-locality
according to split.getLocations()).

 I've reworked my previous patch, that was addressing this
 issue and now there are ConfigHelper methods for enable/disable
 ignoring unavailable ranges.
 It's available here: http://pastebin.com/hhrr8m9P (for version 0.7.8) 

I'm interested in this patch and see it's usefulness but no one will act
until you attach it to an issue. (I think a new issue is appropriate
here).

~mck