Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
 1) The FAQ … informs that I can have only files of around 64 MB …

See http://wiki.apache.org/cassandra/CassandraLimitations
 A single column value may not be larger than 2GB; in practice, single
 digits of MB is a more reasonable limit, since there is no streaming
 or random access of blob values.

CASSANDRA-16  only covers pushing those objects through compaction.
Getting the objects in and out of the heap during normal requests is
still a problem.

You could manually chunk them down to 64Mb pieces.


 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
 the file from cassandra to HDFS when I want to process it in hadoop cluster?


We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
need for backups of it, no need to upgrade data, and we're free to wipe
it whenever hadoop has been stopped. 

Otherwise all our hadoop jobs still read from and write to Cassandra.
Cassandra is our big data platform, with hadoop/spark just providing
additional aggregation abilities. I think this is the effective way,
rather than trying to completely gut out HDFS. 

There was a datastax project before in being able to replace HDFS with
Cassandra, but i don't think it's alive anymore.

~mck


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote:


 You could manually chunk them down to 64Mb pieces.

 Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?



  2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
  the file from cassandra to HDFS when I want to process it in hadoop
 cluster?


 We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
 need for backups of it, no need to upgrade data, and we're free to wipe
 it whenever hadoop has been stopped.
 ~mck


Since the hadoop MR streaming job requires the file to be processed to be
present in HDFS, I was thinking whether can it get directly from mongodb
instead of me manually fetching it and placing it in a directory before
submitting the hadoop job?


 There was a datastax project before in being able to replace HDFS with
 Cassandra, but i don't think it's alive anymore.

I think you are referring to Brisk project (
http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
but I don't know its current status.

Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?

Regards,
Seenu.


Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi All,
   The problem I am trying to address is:  Store the raw files (files are
in xml format and of the size arnd 700MB) in cassandra, later fetch it and
process it in hadoop cluster and populate back the processed data in
cassandra.  Regarding this, I wanted few clarifications:

1) The FAQ (
https://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage) informs
that I can have only files of around 64 MB but at the same time talks about
about an jira issue https://issues.apache.org/jira/browse/CASSANDRA-16
which is solved in 0.6 version itself.  So, in the present version of
cassandra (2.0.11), is there any limit on the size of the file in a column
and if so, what is it?
2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the
file from cassandra to HDFS when I want to process it in hadoop cluster?

Regards,
Seenu.


Re: Number of SSTables grows after repair

2015-01-02 Thread Robert Coli
On Mon, Dec 15, 2014 at 1:51 AM, Michał Łowicki mlowi...@gmail.com wrote:

 We've noticed that number of SSTables grows radically after running
 *repair*. What we did today is to compact everything so for each node
 number of SStables  10. After repair it jumped to ~1600 on each node. What
 is interesting is that size of many is very small. The smallest ones are
 ~60 bytes in size (http://paste.ofcode.org/6yyH2X52emPNrKdw3WXW3d)


This is semi-expected if using vnodes. There are various tickets open to
address aspects of this issue.


 Table information - http://paste.ofcode.org/32RijfxQkNeb9cx9GAAnM45
 We're using Cassandra 2.1.2.


https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

=Rob


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Wilm Schumacher
Hi,

perhaps I totally misunderstood your problem, but why bother with
cassandra for storing in the first place?

If your MR for hadoop is only run once for each file (as you wrote
above), why not copy the data directly to hdfs, run your MR job and use
cassandra as sink?

As hdfs and yarn are more or less completely independent you could
perhaps use the master as ResourceManager (yarn) AND NameNode and
DataNode (hdfs) and launch your MR job directly and as mentioned use
Cassandra as sink for the reduced data. By this you won't need dedicated
hardware, as you only need the hdfs once, process and delete the files
afterwards.

Best wishes,

Wilm


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan doanduy...@gmail.com wrote:

 2) collections and maps are loaded entirely by Cassandra for each query,
 whereas with clustering columns you can select a slice of columns


And also stored entirely for each UPDATE. Change one element, re-serialize
the whole thing to disk.

=Rob


Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
 Since the hadoop MR streaming job requires the file to be processed to be 
 present in HDFS,
  I was thinking whether can it get directly from mongodb instead of me 
 manually fetching it 
 and placing it in a directory before submitting the hadoop job?


Hadoop M/R can get data directly from Cassandra. See CqlInputFormat.

~mck


Re: Best Time Series insert strategy

2015-01-02 Thread Robert Coli
On Tue, Dec 16, 2014 at 1:16 PM, Arne Claassen a...@emotient.com wrote:

 3) Go to consistency ANY.


Consistency level ANY should probably be renamed to NEVER and removed from
the software.

It is almost never the correct solution to any problem.

=Rob


sstable structure

2015-01-02 Thread Nikolay Mihaylov
Hi

from some time I try to find the structure of sstable is it documented
somewhere or can anyone explain it to me

I am speaking about hex dump bytes stored on the disk.

Nick.


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Jacob Rhoden
If it's for auditing, if recommend pushing the files out somewhere reasonably 
external, Amazon S3 works well for this type of thing, and you don't have to 
worry too much about backups and the like.

__
Sent from iPhone

 On 3 Jan 2015, at 5:07 pm, Srinivasa T N seen...@gmail.com wrote:
 
 Hi Wilm,
The reason is that for some auditing purpose, I want to store the original 
 files also.
 
 Regards,
 Seenu.
 
 On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wilm.schumac...@gmail.com 
 wrote:
 Hi,
 
 perhaps I totally misunderstood your problem, but why bother with
 cassandra for storing in the first place?
 
 If your MR for hadoop is only run once for each file (as you wrote
 above), why not copy the data directly to hdfs, run your MR job and use
 cassandra as sink?
 
 As hdfs and yarn are more or less completely independent you could
 perhaps use the master as ResourceManager (yarn) AND NameNode and
 DataNode (hdfs) and launch your MR job directly and as mentioned use
 Cassandra as sink for the reduced data. By this you won't need dedicated
 hardware, as you only need the hdfs once, process and delete the files
 afterwards.
 
 Best wishes,
 
 Wilm
 


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Eric Stevens
 And also stored entirely for each UPDATE. Change one element,
re-serialize the whole thing to disk.

Is this true?  I thought updates (adds, removes, but not overwrites)
affected just the indicated columns.  Isn't it just the reads that involve
reading the entire collection?

DS docs talk about reading whole collections, but I don't see anything
about having to overwrite the entire collection each time.  That would
indicate a read then write style operation, which is antipatterny.

 When you query a table containing a collection, Cassandra retrieves the
collection in its entirety
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_set_t.html



On Fri, Jan 2, 2015 at 11:48 AM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan doanduy...@gmail.com wrote:

 2) collections and maps are loaded entirely by Cassandra for each query,
 whereas with clustering columns you can select a slice of columns


 And also stored entirely for each UPDATE. Change one element, re-serialize
 the whole thing to disk.

 =Rob



Re: STCS limitation with JBOD?

2015-01-02 Thread Colin
Forcing a major compaction is usually a bad idea.  What is your reason for 
doing that?

--
Colin Clark 
+1-320-221-9531
 

 On Jan 2, 2015, at 1:17 PM, Dan Kinder dkin...@turnitin.com wrote:
 
 Hi,
 
 Forcing a major compaction (using nodetool compact) with STCS will result in 
 a single sstable (ignoring repair data). However this seems like it could be 
 a problem for large JBOD setups. For example if I have 12 disks, 1T each, 
 then it seems like on this node I cannot have one column family store more 
 than 1T worth of data (more or less), because all the data will end up in a 
 single sstable that can exist only on one disk. Is this accurate? The 
 compaction write path docs give a bit of hope that cassandra could split the 
 one final sstable across the disks, but I doubt it is able to and want to 
 confirm.
 
 I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in 
 JBOD mode could be solutions to this (with their own problems), but want to 
 verify that this actually is a problem.
 
 -dan


STCS limitation with JBOD?

2015-01-02 Thread Dan Kinder
Hi,

Forcing a major compaction (using nodetool compact
http://datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCompact.html)
with STCS will result in a single sstable (ignoring repair data). However
this seems like it could be a problem for large JBOD setups. For example if
I have 12 disks, 1T each, then it seems like on this node I cannot have one
column family store more than 1T worth of data (more or less), because all
the data will end up in a single sstable that can exist only on one disk.
Is this accurate? The compaction write path docs
http://datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_write_path_c.html
give a bit of hope that cassandra could split the one final sstable across
the disks, but I doubt it is able to and want to confirm.

I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in
JBOD mode could be solutions to this (with their own problems), but want to
verify that this actually is a problem.

-dan


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi Wilm,
   The reason is that for some auditing purpose, I want to store the
original files also.

Regards,
Seenu.

On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wilm.schumac...@gmail.com
wrote:

 Hi,

 perhaps I totally misunderstood your problem, but why bother with
 cassandra for storing in the first place?

 If your MR for hadoop is only run once for each file (as you wrote
 above), why not copy the data directly to hdfs, run your MR job and use
 cassandra as sink?

 As hdfs and yarn are more or less completely independent you could
 perhaps use the master as ResourceManager (yarn) AND NameNode and
 DataNode (hdfs) and launch your MR job directly and as mentioned use
 Cassandra as sink for the reduced data. By this you won't need dedicated
 hardware, as you only need the hdfs once, process and delete the files
 afterwards.

 Best wishes,

 Wilm



Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Tyler Hobbs
On Fri, Jan 2, 2015 at 1:13 PM, Eric Stevens migh...@gmail.com wrote:

  And also stored entirely for each UPDATE. Change one element,
 re-serialize the whole thing to disk.

 Is this true?  I thought updates (adds, removes, but not overwrites)
 affected just the indicated columns.  Isn't it just the reads that involve
 reading the entire collection?


This is not true (with one minor exception).  All operations on sets and
maps require no reads.  The same is true for appends and prepends on lists,
but delete and set operations on lists with (non-zero) indexes require the
list to be read first.  However, the entire list does not need to be
re-written to disk.

-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:35 AM, Tyler Hobbs ty...@datastax.com wrote:


 This is not true (with one minor exception).  All operations on sets and
 maps require no reads.  The same is true for appends and prepends on lists,
 but delete and set operations on lists with (non-zero) indexes require the
 list to be read first.  However, the entire list does not need to be
 re-written to disk.


Thank you guys for the correction; a case where I am glad to be wrong. I
must have been thinking about the delete/set operations and have drawn an
erroneous inference. :)

=Rob


Re: STCS limitation with JBOD?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:28 AM, Colin co...@clark.ws wrote:

 Forcing a major compaction is usually a bad idea.  What is your reason for
 doing that?


I'd say often and not usually. Lots of people have schema where they
create way too much garbage, and major compaction can be a good response.
The docs' historic incoherent FUD notwithstanding.

=Rob


Re: Tombstones without DELETE

2015-01-02 Thread Tyler Hobbs
No worries!  They're a data type that was introduced in 1.2:
http://www.datastax.com/dev/blog/cql3_collections

On Fri, Jan 2, 2015 at 12:07 PM, Nikolay Mihaylov n...@nmmm.nu wrote:

 Hi Tyler,

 sorry for very stupid question - what is a collection ?

 Nick

 On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs ty...@datastax.com wrote:

 Overwriting an entire collection also results in a tombstone being
 inserted.

 On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 You should probably ask on the Cassandra user mailling list.

 However, TTL is the only other case I can think of.

 On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino i...@daddye.it
 wrote:

 Hi there,

 Following this:
 https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ

 Under what conditions Cassandra generates a tombstone?

 Basically I have not even big table on cassandra (90M rows) in my code
 there is no delete and I use prepared statements (but binding all necessary
 values).

 I'm aware that a tombstone gets created when:

 1. You delete the row
 2. You set a column to null while previously it had a value
 3. When you use prepared statements and you don't bind all the values

 Anything else that I should be aware of?

 Thanks!

 To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-user+unsubscr...@lists.datastax.com.




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




 --
 Tyler Hobbs
 DataStax http://datastax.com/





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Eric Stevens
 Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?

There are client libraries which offer recipes for this, but in general,
no.

You're trying to do something with Cassandra that it's not designed to do.
You can get there from here, but you're not going to have a good time.  If
you need a document store, you should use a NoSQL solution designed with
that in mind (Cassandra is a columnar store).  If you need a distributed
filesystem, you should use one of those.

If you do want to continue forward and do this with Cassandra, then you
should definitely not do this on the same cluster as handles normal clients
as the kind of workload you'd be subjecting this cluster to is going to
cause all sorts of troubles for normal clients, particularly with respect
to GC pressure, compaction and streaming problems, and many other
consequences of vastly exceeding recommended limits.

On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N seen...@gmail.com wrote:



 On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote:


 You could manually chunk them down to 64Mb pieces.

 Can this split and combine be done automatically by cassandra when
 inserting/fetching the file without application being bothered about it?



  2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
  the file from cassandra to HDFS when I want to process it in hadoop
 cluster?


 We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
 need for backups of it, no need to upgrade data, and we're free to wipe
 it whenever hadoop has been stopped.
 ~mck


 Since the hadoop MR streaming job requires the file to be processed to be
 present in HDFS, I was thinking whether can it get directly from mongodb
 instead of me manually fetching it and placing it in a directory before
 submitting the hadoop job?


  There was a datastax project before in being able to replace HDFS with
  Cassandra, but i don't think it's alive anymore.

 I think you are referring to Brisk project (
 http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
 but I don't know its current status.

 Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?

 Regards,
 Seenu.



Re: Tombstones without DELETE

2015-01-02 Thread Nikolay Mihaylov
Hi Tyler,

sorry for very stupid question - what is a collection ?

Nick

On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs ty...@datastax.com wrote:

 Overwriting an entire collection also results in a tombstone being
 inserted.

 On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla rsvi...@datastax.com wrote:

 You should probably ask on the Cassandra user mailling list.

 However, TTL is the only other case I can think of.

 On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino i...@daddye.it
 wrote:

 Hi there,

 Following this:
 https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ

 Under what conditions Cassandra generates a tombstone?

 Basically I have not even big table on cassandra (90M rows) in my code
 there is no delete and I use prepared statements (but binding all necessary
 values).

 I'm aware that a tombstone gets created when:

 1. You delete the row
 2. You set a column to null while previously it had a value
 3. When you use prepared statements and you don't bind all the values

 Anything else that I should be aware of?

 Thanks!

 To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-user+unsubscr...@lists.datastax.com.




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




 --
 Tyler Hobbs
 DataStax http://datastax.com/



Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
I agree that cassandra is a columnar store.  The storing of the raw xml
file, parsing the file using hadoop and then storing the extracted value is
only once.  The extracted data on which further operations will be done
suits well with the timeseries storage of the data provided by cassandra
and that is the reason I am trying to get the things done for which it is
not designed.

Regards,
Seenu.



On Fri, Jan 2, 2015 at 10:42 PM, Eric Stevens migh...@gmail.com wrote:

  Can this split and combine be done automatically by cassandra when
 inserting/fetching the file without application being bothered about it?

 There are client libraries which offer recipes for this, but in general,
 no.

 You're trying to do something with Cassandra that it's not designed to
 do.  You can get there from here, but you're not going to have a good
 time.  If you need a document store, you should use a NoSQL solution
 designed with that in mind (Cassandra is a columnar store).  If you need a
 distributed filesystem, you should use one of those.

 If you do want to continue forward and do this with Cassandra, then you
 should definitely not do this on the same cluster as handles normal clients
 as the kind of workload you'd be subjecting this cluster to is going to
 cause all sorts of troubles for normal clients, particularly with respect
 to GC pressure, compaction and streaming problems, and many other
 consequences of vastly exceeding recommended limits.

 On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N seen...@gmail.com wrote:



 On Fri, Jan 2, 2015 at 5:54 PM, mck m...@apache.org wrote:


 You could manually chunk them down to 64Mb pieces.

 Can this split and combine be done automatically by cassandra when
 inserting/fetching the file without application being bothered about it?



  2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
  the file from cassandra to HDFS when I want to process it in hadoop
 cluster?


 We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
 need for backups of it, no need to upgrade data, and we're free to wipe
 it whenever hadoop has been stopped.
 ~mck


 Since the hadoop MR streaming job requires the file to be processed to be
 present in HDFS, I was thinking whether can it get directly from mongodb
 instead of me manually fetching it and placing it in a directory before
 submitting the hadoop job?


  There was a datastax project before in being able to replace HDFS with
  Cassandra, but i don't think it's alive anymore.

 I think you are referring to Brisk project (
 http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
 but I don't know its current status.

 Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?

 Regards,
 Seenu.