Re: Storing and querying IP ranges in Cassandra
On Tue, Nov 1, 2011 at 6:30 PM, Brandon Williams dri...@gmail.com wrote: On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. This basically boils down to a binary search problem, so you don't really need an index. Assuming IPv4, what I would do is make the first two bytes (class A and B, respectively) the row key. This will give you 65025 rows, each with possibly 65025 columns (each column name will be the other two bytes.) When you need to find an ip, you go to the row key and then slice the columns to find a match. This works well until you need to search an entire class A, in which case you'll need to do 255 checks, but in parallel this won't be too bad, especially because the bloom filter will save you on non-existent rows. Presumably there is no need to search all class As, unless for some reason you don't know the first byte, which would be somewhat strange. If you do need to span a few class As this will begin to fall apart, but hopefully that's not a common use case. Thans, this sounds like a good plan, as I also need to store some info about the ranges, and this way the values can store that. I'll try to implement this and will get back to you on how it performs. -- Tamas Marki
Re: Storing and querying IP ranges in Cassandra
On Tue, Nov 1, 2011 at 7:28 PM, Aaron Turner synfina...@gmail.com wrote: On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. The second thing I tried was to just expand the ranges and store individual IPs as the keys to a column family. This is very fast to query, but the problem is that I now have over 2.7 million rows, because some of the ranges are quite large. As the number of ranges could change, this method could be a problem - imagine we add a whole A-class range, it would explode into millions of rows. My question is, is there a more sane way to store this information, while still being able to find all the IP ranges that have the given IP in them? I've been only dealing with Cassandra for a week or two, so I don't know about the inner details of what can be done, but I do have programming experience and am not afraid to get my hands dirty, in case it can be solved by writing some extension to Cassandra. So, this is just off the top of my head, and I'm not an expert, but perhaps this will give you some ideas: I'm assuming you're talking IPv4. If it's IPv6, you'll need to do some tweaking Create a CF with RowKey of AsciiType and Column Names of LongType. Values are BytesType. Basically you create one Row for each /8 and name the row with the network address of that row. So you'd have: 1.0.0.0 2.0.0.0 3.0.0.0 etc as row keys Then for every /16 under that, store a single name/value pair where the name is the inet_aton encoded value of the IP address of the /16 network and the value is a bitmask representing the IP's of the /16. Set 1 = filter, 0 = don't filter. Basically the first bit would be 0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc. So basically you'll have: 255 rows each row with 255 columns each column would store 8K bytes (since 2^16/8 = 8K) Alternatively, you could store 16K columns per row (each column is a /24) and each column would have 8 bytes. Off the top of my head I'm not sure which would be faster, but the first solution would be more disk space efficient. If you need to update your bitmasks regularly, you're probably better off with the second solution. Wrap a little API around this and you have fast and direct access to know if a given IP should be filtered or not. Thanks, these are also good suggestions, but I'll go with Brandon's suggestion first (as it seems to be more suitable for my scenario - I not only need to know if the IP needs to be filtered, but also what range was it found in. -- Tamas Marki
Storing and querying IP ranges in Cassandra
Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. The second thing I tried was to just expand the ranges and store individual IPs as the keys to a column family. This is very fast to query, but the problem is that I now have over 2.7 million rows, because some of the ranges are quite large. As the number of ranges could change, this method could be a problem - imagine we add a whole A-class range, it would explode into millions of rows. My question is, is there a more sane way to store this information, while still being able to find all the IP ranges that have the given IP in them? I've been only dealing with Cassandra for a week or two, so I don't know about the inner details of what can be done, but I do have programming experience and am not afraid to get my hands dirty, in case it can be solved by writing some extension to Cassandra. Looking forward to any suggestions. Thanks, Tamas
Re: Storing and querying IP ranges in Cassandra
How many total ranges to you expect to have long term? On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. The second thing I tried was to just expand the ranges and store individual IPs as the keys to a column family. This is very fast to query, but the problem is that I now have over 2.7 million rows, because some of the ranges are quite large. As the number of ranges could change, this method could be a problem - imagine we add a whole A-class range, it would explode into millions of rows. My question is, is there a more sane way to store this information, while still being able to find all the IP ranges that have the given IP in them? I've been only dealing with Cassandra for a week or two, so I don't know about the inner details of what can be done, but I do have programming experience and am not afraid to get my hands dirty, in case it can be solved by writing some extension to Cassandra. Looking forward to any suggestions. Thanks, Tamas
Re: Storing and querying IP ranges in Cassandra
It's unpredictable, there could be hundreds of thousands, even millions (but unlikely). Tamas On Tue, Nov 1, 2011 at 5:38 PM, Zach Richardson j.zach.richard...@gmail.com wrote: How many total ranges to you expect to have long term? On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. The second thing I tried was to just expand the ranges and store individual IPs as the keys to a column family. This is very fast to query, but the problem is that I now have over 2.7 million rows, because some of the ranges are quite large. As the number of ranges could change, this method could be a problem - imagine we add a whole A-class range, it would explode into millions of rows. My question is, is there a more sane way to store this information, while still being able to find all the IP ranges that have the given IP in them? I've been only dealing with Cassandra for a week or two, so I don't know about the inner details of what can be done, but I do have programming experience and am not afraid to get my hands dirty, in case it can be solved by writing some extension to Cassandra. Looking forward to any suggestions. Thanks, Tamas -- Tamas Marki
Re: Storing and querying IP ranges in Cassandra
On Tue, Nov 1, 2011 at 11:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. This basically boils down to a binary search problem, so you don't really need an index. Assuming IPv4, what I would do is make the first two bytes (class A and B, respectively) the row key. This will give you 65025 rows, each with possibly 65025 columns (each column name will be the other two bytes.) When you need to find an ip, you go to the row key and then slice the columns to find a match. This works well until you need to search an entire class A, in which case you'll need to do 255 checks, but in parallel this won't be too bad, especially because the bloom filter will save you on non-existent rows. Presumably there is no need to search all class As, unless for some reason you don't know the first byte, which would be somewhat strange. If you do need to span a few class As this will begin to fall apart, but hopefully that's not a common use case. -Brandon
Re: Storing and querying IP ranges in Cassandra
On Tue, Nov 1, 2011 at 9:17 AM, Tamas Marki tma...@gmail.com wrote: Hello, I'm new to the list and also to Cassandra. I found it when I was searching for something to replace our busy mysql server. One of the things we use the server for is filtering IPs based on a list of IP ranges. These ranges can be small and big, and there are about 50k of them in the database. In mysql this is pretty quick: they are stored as integers, and the query basically looks like (say ip is the ip we want to find the all the ranges for): select range from rangelist where ip_start=ip and ip_end=ip; I tried to move this schema to Cassandra, but it turned out to be very slow, even with indexes on both columns. Since I also had to have an EQ expression in the query, I added an indexed text field which was the same for all rows, so the query in cassandra was something like this: select range from rangelist where type='ip' and ip_start=ip and ip_end=ip; This was very slow, and I imagine it is because it has to scan through all the rows, making the index useless. The second thing I tried was to just expand the ranges and store individual IPs as the keys to a column family. This is very fast to query, but the problem is that I now have over 2.7 million rows, because some of the ranges are quite large. As the number of ranges could change, this method could be a problem - imagine we add a whole A-class range, it would explode into millions of rows. My question is, is there a more sane way to store this information, while still being able to find all the IP ranges that have the given IP in them? I've been only dealing with Cassandra for a week or two, so I don't know about the inner details of what can be done, but I do have programming experience and am not afraid to get my hands dirty, in case it can be solved by writing some extension to Cassandra. So, this is just off the top of my head, and I'm not an expert, but perhaps this will give you some ideas: I'm assuming you're talking IPv4. If it's IPv6, you'll need to do some tweaking Create a CF with RowKey of AsciiType and Column Names of LongType. Values are BytesType. Basically you create one Row for each /8 and name the row with the network address of that row. So you'd have: 1.0.0.0 2.0.0.0 3.0.0.0 etc as row keys Then for every /16 under that, store a single name/value pair where the name is the inet_aton encoded value of the IP address of the /16 network and the value is a bitmask representing the IP's of the /16. Set 1 = filter, 0 = don't filter. Basically the first bit would be 0.1, 255th bit would be 0.255 and the 256th bit 1.0, etc. So basically you'll have: 255 rows each row with 255 columns each column would store 8K bytes (since 2^16/8 = 8K) Alternatively, you could store 16K columns per row (each column is a /24) and each column would have 8 bytes. Off the top of my head I'm not sure which would be faster, but the first solution would be more disk space efficient. If you need to update your bitmasks regularly, you're probably better off with the second solution. Wrap a little API around this and you have fast and direct access to know if a given IP should be filtered or not. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero