Re: Storing large files for later processing through hadoop
1) The FAQ … informs that I can have only files of around 64 MB … See http://wiki.apache.org/cassandra/CassandraLimitations A single column value may not be larger than 2GB; in practice, single digits of MB is a more reasonable limit, since there is no streaming or random access of blob values. CASSANDRA-16 only covers pushing those objects through compaction. Getting the objects in and out of the heap during normal requests is still a problem. You could manually chunk them down to 64Mb pieces. 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No need for backups of it, no need to upgrade data, and we're free to wipe it whenever hadoop has been stopped. Otherwise all our hadoop jobs still read from and write to Cassandra. Cassandra is our big data platform, with hadoop/spark just providing additional aggregation abilities. I think this is the effective way, rather than trying to completely gut out HDFS. There was a datastax project before in being able to replace HDFS with Cassandra, but i don't think it's alive anymore. ~mck
Re: Storing large files for later processing through hadoop
On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote: You could manually chunk them down to 64Mb pieces. Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No need for backups of it, no need to upgrade data, and we're free to wipe it whenever hadoop has been stopped. ~mck Since the hadoop MR streaming job requires the file to be processed to be present in HDFS, I was thinking whether can it get directly from mongodb instead of me manually fetching it and placing it in a directory before submitting the hadoop job? There was a datastax project before in being able to replace HDFS with Cassandra, but i don't think it's alive anymore. I think you are referring to Brisk project ( http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) but I don't know its current status. Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? Regards, Seenu.
Storing large files for later processing through hadoop
Hi All, The problem I am trying to address is: Store the raw files (files are in xml format and of the size arnd 700MB) in cassandra, later fetch it and process it in hadoop cluster and populate back the processed data in cassandra. Regarding this, I wanted few clarifications: 1) The FAQ ( https://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage) informs that I can have only files of around 64 MB but at the same time talks about about an jira issue https://issues.apache.org/jira/browse/CASSANDRA-16 which is solved in 0.6 version itself. So, in the present version of cassandra (2.0.11), is there any limit on the size of the file in a column and if so, what is it? 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? Regards, Seenu.
Re: Number of SSTables grows after repair
On Mon, Dec 15, 2014 at 1:51 AM, Michał Łowicki mlowi...@gmail.com wrote: We've noticed that number of SSTables grows radically after running *repair*. What we did today is to compact everything so for each node number of SStables 10. After repair it jumped to ~1600 on each node. What is interesting is that size of many is very small. The smallest ones are ~60 bytes in size (http://paste.ofcode.org/6yyH2X52emPNrKdw3WXW3d) This is semi-expected if using vnodes. There are various tickets open to address aspects of this issue. Table information - http://paste.ofcode.org/32RijfxQkNeb9cx9GAAnM45 We're using Cassandra 2.1.2. https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ =Rob
Re: Storing large files for later processing through hadoop
Hi, perhaps I totally misunderstood your problem, but why bother with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are more or less completely independent you could perhaps use the master as ResourceManager (yarn) AND NameNode and DataNode (hdfs) and launch your MR job directly and as mentioned use Cassandra as sink for the reduced data. By this you won't need dedicated hardware, as you only need the hdfs once, process and delete the files afterwards. Best wishes, Wilm
Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?
On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan doanduy...@gmail.com wrote: 2) collections and maps are loaded entirely by Cassandra for each query, whereas with clustering columns you can select a slice of columns And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. =Rob
Re: Storing large files for later processing through hadoop
Since the hadoop MR streaming job requires the file to be processed to be present in HDFS, I was thinking whether can it get directly from mongodb instead of me manually fetching it and placing it in a directory before submitting the hadoop job? Hadoop M/R can get data directly from Cassandra. See CqlInputFormat. ~mck
Re: Best Time Series insert strategy
On Tue, Dec 16, 2014 at 1:16 PM, Arne Claassen a...@emotient.com wrote: 3) Go to consistency ANY. Consistency level ANY should probably be renamed to NEVER and removed from the software. It is almost never the correct solution to any problem. =Rob
sstable structure
Hi from some time I try to find the structure of sstable is it documented somewhere or can anyone explain it to me I am speaking about hex dump bytes stored on the disk. Nick.
Re: Storing large files for later processing through hadoop
If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like. __ Sent from iPhone On 3 Jan 2015, at 5:07 pm, Srinivasa T N seen...@gmail.com wrote: Hi Wilm, The reason is that for some auditing purpose, I want to store the original files also. Regards, Seenu. On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wilm.schumac...@gmail.com wrote: Hi, perhaps I totally misunderstood your problem, but why bother with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are more or less completely independent you could perhaps use the master as ResourceManager (yarn) AND NameNode and DataNode (hdfs) and launch your MR job directly and as mentioned use Cassandra as sink for the reduced data. By this you won't need dedicated hardware, as you only need the hdfs once, process and delete the files afterwards. Best wishes, Wilm
Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?
And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. Is this true? I thought updates (adds, removes, but not overwrites) affected just the indicated columns. Isn't it just the reads that involve reading the entire collection? DS docs talk about reading whole collections, but I don't see anything about having to overwrite the entire collection each time. That would indicate a read then write style operation, which is antipatterny. When you query a table containing a collection, Cassandra retrieves the collection in its entirety http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_set_t.html On Fri, Jan 2, 2015 at 11:48 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan doanduy...@gmail.com wrote: 2) collections and maps are loaded entirely by Cassandra for each query, whereas with clustering columns you can select a slice of columns And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. =Rob
Re: STCS limitation with JBOD?
Forcing a major compaction is usually a bad idea. What is your reason for doing that? -- Colin Clark +1-320-221-9531 On Jan 2, 2015, at 1:17 PM, Dan Kinder dkin...@turnitin.com wrote: Hi, Forcing a major compaction (using nodetool compact) with STCS will result in a single sstable (ignoring repair data). However this seems like it could be a problem for large JBOD setups. For example if I have 12 disks, 1T each, then it seems like on this node I cannot have one column family store more than 1T worth of data (more or less), because all the data will end up in a single sstable that can exist only on one disk. Is this accurate? The compaction write path docs give a bit of hope that cassandra could split the one final sstable across the disks, but I doubt it is able to and want to confirm. I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in JBOD mode could be solutions to this (with their own problems), but want to verify that this actually is a problem. -dan
STCS limitation with JBOD?
Hi, Forcing a major compaction (using nodetool compact http://datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCompact.html) with STCS will result in a single sstable (ignoring repair data). However this seems like it could be a problem for large JBOD setups. For example if I have 12 disks, 1T each, then it seems like on this node I cannot have one column family store more than 1T worth of data (more or less), because all the data will end up in a single sstable that can exist only on one disk. Is this accurate? The compaction write path docs http://datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_write_path_c.html give a bit of hope that cassandra could split the one final sstable across the disks, but I doubt it is able to and want to confirm. I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in JBOD mode could be solutions to this (with their own problems), but want to verify that this actually is a problem. -dan
Re: Storing large files for later processing through hadoop
Hi Wilm, The reason is that for some auditing purpose, I want to store the original files also. Regards, Seenu. On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wilm.schumac...@gmail.com wrote: Hi, perhaps I totally misunderstood your problem, but why bother with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are more or less completely independent you could perhaps use the master as ResourceManager (yarn) AND NameNode and DataNode (hdfs) and launch your MR job directly and as mentioned use Cassandra as sink for the reduced data. By this you won't need dedicated hardware, as you only need the hdfs once, process and delete the files afterwards. Best wishes, Wilm
Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?
On Fri, Jan 2, 2015 at 1:13 PM, Eric Stevens migh...@gmail.com wrote: And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. Is this true? I thought updates (adds, removes, but not overwrites) affected just the indicated columns. Isn't it just the reads that involve reading the entire collection? This is not true (with one minor exception). All operations on sets and maps require no reads. The same is true for appends and prepends on lists, but delete and set operations on lists with (non-zero) indexes require the list to be read first. However, the entire list does not need to be re-written to disk. -- Tyler Hobbs DataStax http://datastax.com/
Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?
On Fri, Jan 2, 2015 at 11:35 AM, Tyler Hobbs ty...@datastax.com wrote: This is not true (with one minor exception). All operations on sets and maps require no reads. The same is true for appends and prepends on lists, but delete and set operations on lists with (non-zero) indexes require the list to be read first. However, the entire list does not need to be re-written to disk. Thank you guys for the correction; a case where I am glad to be wrong. I must have been thinking about the delete/set operations and have drawn an erroneous inference. :) =Rob
Re: STCS limitation with JBOD?
On Fri, Jan 2, 2015 at 11:28 AM, Colin co...@clark.ws wrote: Forcing a major compaction is usually a bad idea. What is your reason for doing that? I'd say often and not usually. Lots of people have schema where they create way too much garbage, and major compaction can be a good response. The docs' historic incoherent FUD notwithstanding. =Rob
Re: Tombstones without DELETE
No worries! They're a data type that was introduced in 1.2: http://www.datastax.com/dev/blog/cql3_collections On Fri, Jan 2, 2015 at 12:07 PM, Nikolay Mihaylov n...@nmmm.nu wrote: Hi Tyler, sorry for very stupid question - what is a collection ? Nick On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs ty...@datastax.com wrote: Overwriting an entire collection also results in a tombstone being inserted. On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla rsvi...@datastax.com wrote: You should probably ask on the Cassandra user mailling list. However, TTL is the only other case I can think of. On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino i...@daddye.it wrote: Hi there, Following this: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ Under what conditions Cassandra generates a tombstone? Basically I have not even big table on cassandra (90M rows) in my code there is no delete and I use prepared statements (but binding all necessary values). I'm aware that a tombstone gets created when: 1. You delete the row 2. You set a column to null while previously it had a value 3. When you use prepared statements and you don't bind all the values Anything else that I should be aware of? Thanks! To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Re: Storing large files for later processing through hadoop
Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You can get there from here, but you're not going to have a good time. If you need a document store, you should use a NoSQL solution designed with that in mind (Cassandra is a columnar store). If you need a distributed filesystem, you should use one of those. If you do want to continue forward and do this with Cassandra, then you should definitely not do this on the same cluster as handles normal clients as the kind of workload you'd be subjecting this cluster to is going to cause all sorts of troubles for normal clients, particularly with respect to GC pressure, compaction and streaming problems, and many other consequences of vastly exceeding recommended limits. On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N seen...@gmail.com wrote: On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote: You could manually chunk them down to 64Mb pieces. Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No need for backups of it, no need to upgrade data, and we're free to wipe it whenever hadoop has been stopped. ~mck Since the hadoop MR streaming job requires the file to be processed to be present in HDFS, I was thinking whether can it get directly from mongodb instead of me manually fetching it and placing it in a directory before submitting the hadoop job? There was a datastax project before in being able to replace HDFS with Cassandra, but i don't think it's alive anymore. I think you are referring to Brisk project ( http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) but I don't know its current status. Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? Regards, Seenu.
Re: Tombstones without DELETE
Hi Tyler, sorry for very stupid question - what is a collection ? Nick On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs ty...@datastax.com wrote: Overwriting an entire collection also results in a tombstone being inserted. On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla rsvi...@datastax.com wrote: You should probably ask on the Cassandra user mailling list. However, TTL is the only other case I can think of. On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino i...@daddye.it wrote: Hi there, Following this: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ Under what conditions Cassandra generates a tombstone? Basically I have not even big table on cassandra (90M rows) in my code there is no delete and I use prepared statements (but binding all necessary values). I'm aware that a tombstone gets created when: 1. You delete the row 2. You set a column to null while previously it had a value 3. When you use prepared statements and you don't bind all the values Anything else that I should be aware of? Thanks! To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- Tyler Hobbs DataStax http://datastax.com/
Re: Storing large files for later processing through hadoop
I agree that cassandra is a columnar store. The storing of the raw xml file, parsing the file using hadoop and then storing the extracted value is only once. The extracted data on which further operations will be done suits well with the timeseries storage of the data provided by cassandra and that is the reason I am trying to get the things done for which it is not designed. Regards, Seenu. On Fri, Jan 2, 2015 at 10:42 PM, Eric Stevens migh...@gmail.com wrote: Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You can get there from here, but you're not going to have a good time. If you need a document store, you should use a NoSQL solution designed with that in mind (Cassandra is a columnar store). If you need a distributed filesystem, you should use one of those. If you do want to continue forward and do this with Cassandra, then you should definitely not do this on the same cluster as handles normal clients as the kind of workload you'd be subjecting this cluster to is going to cause all sorts of troubles for normal clients, particularly with respect to GC pressure, compaction and streaming problems, and many other consequences of vastly exceeding recommended limits. On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N seen...@gmail.com wrote: On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote: You could manually chunk them down to 64Mb pieces. Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No need for backups of it, no need to upgrade data, and we're free to wipe it whenever hadoop has been stopped. ~mck Since the hadoop MR streaming job requires the file to be processed to be present in HDFS, I was thinking whether can it get directly from mongodb instead of me manually fetching it and placing it in a directory before submitting the hadoop job? There was a datastax project before in being able to replace HDFS with Cassandra, but i don't think it's alive anymore. I think you are referring to Brisk project ( http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) but I don't know its current status. Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? Regards, Seenu.