Re: Storing and querying IP ranges in Cassandra

2011-11-02 Thread Tamas Marki
On Tue, Nov 1, 2011 at 6:30 PM, Brandon Williams dri...@gmail.com wrote:

 On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote:
  Hello,
 
  I'm new to the list and also to Cassandra. I found it when I was
 searching
  for something to replace our busy mysql server.
 
  One of the things we use the server for is filtering IPs based on a list
 of
  IP ranges. These ranges can be small and big, and there are about 50k of
  them in the database.
 
  In mysql this is pretty quick: they are stored as integers, and the query
  basically looks like (say ip is the ip we want to find the all the ranges
  for):
 
  select range from rangelist where ip_start=ip and ip_end=ip;
 
  I tried to move this schema to Cassandra, but it turned out to be very
 slow,
  even with indexes on both columns. Since I also had to have an EQ
 expression
  in the query, I added an indexed text field which was the same for all
 rows,
  so the query in cassandra was something like this:
 
  select range from rangelist where type='ip' and ip_start=ip and
 ip_end=ip;
 
  This was very slow, and I imagine it is because it has to scan through
 all
  the rows, making the index useless.

 This basically boils down to a binary search problem, so you don't
 really need an index.  Assuming IPv4, what I would do is make the
 first two bytes (class A and B, respectively) the row key. This will
 give you 65025 rows, each with possibly 65025 columns (each column
 name will be the other two bytes.)  When you need to find an ip, you
 go to the row key and then slice the columns to find a match.  This
 works well until you need to search an entire class A, in which case
 you'll need to do 255 checks, but in parallel this won't be too bad,
 especially because the bloom filter will save you on non-existent
 rows.  Presumably there is no need to search all class As, unless for
 some reason you don't know the first byte, which would be somewhat
 strange.  If you do need to span a few class As this will begin to
 fall apart, but hopefully that's not a common use case.


Thans, this sounds like a good plan, as I also need to store some info
about the ranges, and this way the values can store that.
I'll try to implement this and will get back to you on how it performs.


-- 
Tamas Marki


Re: Storing and querying IP ranges in Cassandra

2011-11-02 Thread Tamas Marki
On Tue, Nov 1, 2011 at 7:28 PM, Aaron Turner synfina...@gmail.com wrote:

 On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki tma...@gmail.com wrote:
  Hello,
 
  I'm new to the list and also to Cassandra. I found it when I was
 searching
  for something to replace our busy mysql server.
 
  One of the things we use the server for is filtering IPs based on a list
 of
  IP ranges. These ranges can be small and big, and there are about 50k of
  them in the database.
 
  In mysql this is pretty quick: they are stored as integers, and the query
  basically looks like (say ip is the ip we want to find the all the ranges
  for):
 
  select range from rangelist where ip_start=ip and ip_end=ip;
 
  I tried to move this schema to Cassandra, but it turned out to be very
 slow,
  even with indexes on both columns. Since I also had to have an EQ
 expression
  in the query, I added an indexed text field which was the same for all
 rows,
  so the query in cassandra was something like this:
 
  select range from rangelist where type='ip' and ip_start=ip and
 ip_end=ip;
 
  This was very slow, and I imagine it is because it has to scan through
 all
  the rows, making the index useless.
 
  The second thing I tried was to just expand the ranges and store
 individual
  IPs as the keys to a column family. This is very fast to query, but the
  problem is that I now have over 2.7 million rows, because some of the
 ranges
  are quite large.
 
  As the number of ranges could change, this method could be a problem -
  imagine we add a whole A-class range, it would explode into millions of
  rows.
 
  My question is, is there a more sane way to store this information, while
  still being able to find all the IP ranges that have the given IP in
 them?
 
  I've been only dealing with Cassandra for a week or two, so I don't know
  about the inner details of what can be done, but I do have programming
  experience and am not afraid to get my hands dirty, in case it can be
 solved
  by writing some extension to Cassandra.

 So, this is just off the top of my head, and I'm not an expert, but
 perhaps this will give you some ideas:

 I'm assuming you're talking IPv4.  If it's IPv6, you'll need to do some
 tweaking

 Create a CF with RowKey of AsciiType and Column Names of LongType.
 Values are BytesType.  Basically you create one Row for each /8 and
 name the row with the network address of that row.  So you'd have:

 1.0.0.0
 2.0.0.0
 3.0.0.0

 etc as row keys

 Then for every /16 under that, store a single name/value pair where
 the name is the inet_aton encoded value of the IP address of the /16
 network and the value is a bitmask representing the IP's of the /16.
 Set 1 = filter, 0 = don't filter.  Basically the first bit would be
 0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc.

 So basically you'll have:
 255 rows
 each row with 255 columns
 each column would store 8K bytes (since 2^16/8 = 8K)

 Alternatively, you could store 16K columns per row (each column is a
 /24) and each column would have 8 bytes.  Off the top of my head I'm
 not sure which would be faster, but the first solution would be more
 disk space efficient.  If you need to update your bitmasks regularly,
 you're probably better off with the second solution.

 Wrap a little API around this and you have fast and direct access to
 know if a given IP should be filtered or not.


Thanks, these are also good suggestions, but I'll go with Brandon's
suggestion first (as it seems to be more suitable for my scenario - I not
only need to know if the IP needs to be filtered, but also what range was
it found in.


-- 
Tamas Marki


Storing and querying IP ranges in Cassandra

2011-11-01 Thread Tamas Marki
Hello,

I'm new to the list and also to Cassandra. I found it when I was searching
for something to replace our busy mysql server.

One of the things we use the server for is filtering IPs based on a list of
IP ranges. These ranges can be small and big, and there are about 50k of
them in the database.

In mysql this is pretty quick: they are stored as integers, and the query
basically looks like (say ip is the ip we want to find the all the ranges
for):

select range from rangelist where ip_start=ip and ip_end=ip;

I tried to move this schema to Cassandra, but it turned out to be very
slow, even with indexes on both columns. Since I also had to have an EQ
expression in the query, I added an indexed text field which was the same
for all rows, so the query in cassandra was something like this:

select range from rangelist where type='ip' and ip_start=ip and ip_end=ip;

This was very slow, and I imagine it is because it has to scan through all
the rows, making the index useless.

The second thing I tried was to just expand the ranges and store individual
IPs as the keys to a column family. This is very fast to query, but the
problem is that I now have over 2.7 million rows, because some of the
ranges are quite large.

As the number of ranges could change, this method could be a problem -
imagine we add a whole A-class range, it would explode into millions of
rows.

My question is, is there a more sane way to store this information, while
still being able to find all the IP ranges that have the given IP in them?

I've been only dealing with Cassandra for a week or two, so I don't know
about the inner details of what can be done, but I do have programming
experience and am not afraid to get my hands dirty, in case it can be
solved by writing some extension to Cassandra.

Looking forward to any suggestions.

Thanks,
Tamas


Re: Storing and querying IP ranges in Cassandra

2011-11-01 Thread Zach Richardson
How many total ranges to you expect to have long term?

On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote:
 Hello,

 I'm new to the list and also to Cassandra. I found it when I was searching
 for something to replace our busy mysql server.

 One of the things we use the server for is filtering IPs based on a list of
 IP ranges. These ranges can be small and big, and there are about 50k of
 them in the database.

 In mysql this is pretty quick: they are stored as integers, and the query
 basically looks like (say ip is the ip we want to find the all the ranges
 for):

 select range from rangelist where ip_start=ip and ip_end=ip;

 I tried to move this schema to Cassandra, but it turned out to be very slow,
 even with indexes on both columns. Since I also had to have an EQ expression
 in the query, I added an indexed text field which was the same for all rows,
 so the query in cassandra was something like this:

 select range from rangelist where type='ip' and ip_start=ip and ip_end=ip;

 This was very slow, and I imagine it is because it has to scan through all
 the rows, making the index useless.

 The second thing I tried was to just expand the ranges and store individual
 IPs as the keys to a column family. This is very fast to query, but the
 problem is that I now have over 2.7 million rows, because some of the ranges
 are quite large.

 As the number of ranges could change, this method could be a problem -
 imagine we add a whole A-class range, it would explode into millions of
 rows.

 My question is, is there a more sane way to store this information, while
 still being able to find all the IP ranges that have the given IP in them?

 I've been only dealing with Cassandra for a week or two, so I don't know
 about the inner details of what can be done, but I do have programming
 experience and am not afraid to get my hands dirty, in case it can be solved
 by writing some extension to Cassandra.

 Looking forward to any suggestions.

 Thanks,
 Tamas




Re: Storing and querying IP ranges in Cassandra

2011-11-01 Thread Tamas Marki
It's unpredictable, there could be hundreds of thousands, even millions
(but unlikely).

Tamas

On Tue, Nov 1, 2011 at 5:38 PM, Zach Richardson j.zach.richard...@gmail.com
 wrote:

 How many total ranges to you expect to have long term?

 On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote:
  Hello,
 
  I'm new to the list and also to Cassandra. I found it when I was
 searching
  for something to replace our busy mysql server.
 
  One of the things we use the server for is filtering IPs based on a list
 of
  IP ranges. These ranges can be small and big, and there are about 50k of
  them in the database.
 
  In mysql this is pretty quick: they are stored as integers, and the query
  basically looks like (say ip is the ip we want to find the all the ranges
  for):
 
  select range from rangelist where ip_start=ip and ip_end=ip;
 
  I tried to move this schema to Cassandra, but it turned out to be very
 slow,
  even with indexes on both columns. Since I also had to have an EQ
 expression
  in the query, I added an indexed text field which was the same for all
 rows,
  so the query in cassandra was something like this:
 
  select range from rangelist where type='ip' and ip_start=ip and
 ip_end=ip;
 
  This was very slow, and I imagine it is because it has to scan through
 all
  the rows, making the index useless.
 
  The second thing I tried was to just expand the ranges and store
 individual
  IPs as the keys to a column family. This is very fast to query, but the
  problem is that I now have over 2.7 million rows, because some of the
 ranges
  are quite large.
 
  As the number of ranges could change, this method could be a problem -
  imagine we add a whole A-class range, it would explode into millions of
  rows.
 
  My question is, is there a more sane way to store this information, while
  still being able to find all the IP ranges that have the given IP in
 them?
 
  I've been only dealing with Cassandra for a week or two, so I don't know
  about the inner details of what can be done, but I do have programming
  experience and am not afraid to get my hands dirty, in case it can be
 solved
  by writing some extension to Cassandra.
 
  Looking forward to any suggestions.
 
  Thanks,
  Tamas
 
 




-- 
Tamas Marki


Re: Storing and querying IP ranges in Cassandra

2011-11-01 Thread Brandon Williams
On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote:
 Hello,

 I'm new to the list and also to Cassandra. I found it when I was searching
 for something to replace our busy mysql server.

 One of the things we use the server for is filtering IPs based on a list of
 IP ranges. These ranges can be small and big, and there are about 50k of
 them in the database.

 In mysql this is pretty quick: they are stored as integers, and the query
 basically looks like (say ip is the ip we want to find the all the ranges
 for):

 select range from rangelist where ip_start=ip and ip_end=ip;

 I tried to move this schema to Cassandra, but it turned out to be very slow,
 even with indexes on both columns. Since I also had to have an EQ expression
 in the query, I added an indexed text field which was the same for all rows,
 so the query in cassandra was something like this:

 select range from rangelist where type='ip' and ip_start=ip and ip_end=ip;

 This was very slow, and I imagine it is because it has to scan through all
 the rows, making the index useless.

This basically boils down to a binary search problem, so you don't
really need an index.  Assuming IPv4, what I would do is make the
first two bytes (class A and B, respectively) the row key. This will
give you 65025 rows, each with possibly 65025 columns (each column
name will be the other two bytes.)  When you need to find an ip, you
go to the row key and then slice the columns to find a match.  This
works well until you need to search an entire class A, in which case
you'll need to do 255 checks, but in parallel this won't be too bad,
especially because the bloom filter will save you on non-existent
rows.  Presumably there is no need to search all class As, unless for
some reason you don't know the first byte, which would be somewhat
strange.  If you do need to span a few class As this will begin to
fall apart, but hopefully that's not a common use case.

-Brandon


Re: Storing and querying IP ranges in Cassandra

2011-11-01 Thread Aaron Turner
On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki tma...@gmail.com wrote:
 Hello,

 I'm new to the list and also to Cassandra. I found it when I was searching
 for something to replace our busy mysql server.

 One of the things we use the server for is filtering IPs based on a list of
 IP ranges. These ranges can be small and big, and there are about 50k of
 them in the database.

 In mysql this is pretty quick: they are stored as integers, and the query
 basically looks like (say ip is the ip we want to find the all the ranges
 for):

 select range from rangelist where ip_start=ip and ip_end=ip;

 I tried to move this schema to Cassandra, but it turned out to be very slow,
 even with indexes on both columns. Since I also had to have an EQ expression
 in the query, I added an indexed text field which was the same for all rows,
 so the query in cassandra was something like this:

 select range from rangelist where type='ip' and ip_start=ip and ip_end=ip;

 This was very slow, and I imagine it is because it has to scan through all
 the rows, making the index useless.

 The second thing I tried was to just expand the ranges and store individual
 IPs as the keys to a column family. This is very fast to query, but the
 problem is that I now have over 2.7 million rows, because some of the ranges
 are quite large.

 As the number of ranges could change, this method could be a problem -
 imagine we add a whole A-class range, it would explode into millions of
 rows.

 My question is, is there a more sane way to store this information, while
 still being able to find all the IP ranges that have the given IP in them?

 I've been only dealing with Cassandra for a week or two, so I don't know
 about the inner details of what can be done, but I do have programming
 experience and am not afraid to get my hands dirty, in case it can be solved
 by writing some extension to Cassandra.

So, this is just off the top of my head, and I'm not an expert, but
perhaps this will give you some ideas:

I'm assuming you're talking IPv4.  If it's IPv6, you'll need to do some tweaking

Create a CF with RowKey of AsciiType and Column Names of LongType.
Values are BytesType.  Basically you create one Row for each /8 and
name the row with the network address of that row.  So you'd have:

1.0.0.0
2.0.0.0
3.0.0.0

etc as row keys

Then for every /16 under that, store a single name/value pair where
the name is the inet_aton encoded value of the IP address of the /16
network and the value is a bitmask representing the IP's of the /16.
Set 1 = filter, 0 = don't filter.  Basically the first bit would be
0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc.

So basically you'll have:
255 rows
each row with 255 columns
each column would store 8K bytes (since 2^16/8 = 8K)

Alternatively, you could store 16K columns per row (each column is a
/24) and each column would have 8 bytes.  Off the top of my head I'm
not sure which would be faster, but the first solution would be more
disk space efficient.  If you need to update your bitmasks regularly,
you're probably better off with the second solution.

Wrap a little API around this and you have fast and direct access to
know if a given IP should be filtered or not.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
carpe diem quam minimum credula postero