[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: https://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=57rev2=58 Comment: Updating and clarifying some of the troubleshooting information. == Troubleshooting == If you are running into timeout exceptions, you might need to tweak one or both of these settings: - * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. - * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml`. The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + * Each input split is divided into sequential batches of rows requested at a time from Cassandra. This is the '''cassandra.range.batch.size''' property and it defaults to 4096. If you are experiencing timeouts, you might first try to reduce the batch size so that it can more easily complete the request within the timeout. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. + * Starting in Cassandra 1.2, there is range request specific timeout called '''range_request_timeout_in_ms''' in the cassandra.yaml. Hadoop will request data in sequential batches and the request has to complete within this timeout. Prior to Cassandra 1.2, you're can set the general '''rpc_timeout_in_ms''' higher, which affects timeouts for reads, writes, and truncate operations in addition to range requests. If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs. An example of usage (in either the job configuration or tasktracker mapred-site.xml):
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: https://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=59rev2=60 Comment: noting new consistency level default }}} The settings normally default to 4 each, but some find that too conservative. If you set it too low, you might have blacklisted tasktrackers and failed jobs because of occasional timeout exceptions. If you set them too high, jobs that would otherwise fail quickly take a long time to fail, sacrificing efficiency. Keep in mind that this can just cover a problem. It may be that you always want these settings to be higher when operating against Cassandra. However, if you run into these exceptions too frequently, there may be a problem with your Cassandra or Hadoop configuration. + If you are seeing inconsistent data coming back, consider the consistency level at which you read ('''cassandra.consistencylevel.read''') and write ('''cassandra.consistencylevel.write'''). Both properties default to !ConsistencyLevel.LOCAL_ONE (Previously [[https://issues.apache.org/jira/browse/CASSANDRA-6214|ONE]]). - If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at. The two relevant properties are: - - * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE. - * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE. Also hadoop integration uses range scans underneath which do not do read repair. However reading at !ConsistencyLevel.QUORUM will reconcile differences among nodes read. See ReadRepair section as well as the !ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]] page for more details.
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=55rev2=56 If you are running into timeout exceptions, you might need to tweak one or both of these settings: * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. - * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml`. The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs. An example of usage (in either the job configuration or tasktracker mapred-site.xml):
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=46rev2=47 Comment: removed redundant hadoop property. value20/value /property property - namemapred.max.tracker.failures/name - value20/value - /property - property namemapred.map.max.attempts/name value20/value /property
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=45rev2=46 Comment: Correcting the hive support information. Anchor(Hive) == Hive == - Hive comes bundled as part of the open-source Brisk project. For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and [[http://github.com/riptano/brisk|code]] + Hive support is currently a standalone project but will become part of the main Cassandra source tree in the future. See [[https://github.com/riptano/hive]] for details. [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=41rev2=42 Comment: Adding a link to pygmalion in the pig section. * Set the `HADOOP_HOME` environment variable to `hadoop_dir`, e.g. `/opt/hadoop` or `/etc/hadoop` * Set the `PIG_CONF` environment variable to `hadoop_dir/conf` * Set the `JAVA_HOME` + + [[https://github.com/jeromatron/pygmalion/|Pygmalion]] is a project created to help with using Pig with Cassandra, especially for tabular (static column names) data. [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=39rev2=40 Comment: Adding more troubleshooting information and a caveat to OSS Brisk in the main description == Overview == Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]]. - [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) + [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax. Future development of Brisk is now part of a pay-for offering. [[#Top|Top]] @@ -92, +92 @@ * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + If you still see timeout exceptions with resultant failed jobs and/or blacklisted tasktrackers, there are settings that can give Cassandra more latitude before failing the jobs. An example of usage (in either the job configuration or taskracker mapred-site.xml): + {{{ + property + namemapred.max.tracker.failures/name + value20/value + /property + property + namemapred.max.tracker.failures/name + value20/value + /property + property + namemapred.map.max.attempts/name + value20/value + /property + property + namemapred.reduce.max.attempts/name + value20/value + /property + }}} + The settings normally default to 4 each, but some find that too conservative. If you set it too low, you might have blacklisted tasktrackers and failed jobs because of occasional timeout exceptions. If you set them too high, jobs that would otherwise fail quickly take a long time to fail, sacrificing efficiency. Keep in mind that this can just cover a problem. It may be that you always want these settings to be higher when operating against Cassandra. However, if you run into these exceptions too frequently, there may be a problem with your Cassandra or Hadoop configuration. + If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at. The two relevant properties are: * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE. * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=40rev2=41 Comment: Added a section on Oozie and another cluster configuration note about speculative execution. * [[#MapReduce|MapReduce]] * [[#Pig|Pig]] * [[#Hive|Hive]] + * [[#Oozie|Oozie]] * [[#ClusterConfig|Cluster Configuration]] * [[#Troubleshooting|Troubleshooting]] * [[#Support|Support]] @@ -16, +17 @@ == Overview == Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]]. - [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax. Future development of Brisk is now part of a pay-for offering. + [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax. Future DataStax development of Brisk is now part of a pay-for offering. [[#Top|Top]] @@ -34, +35 @@ ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate); }}} As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the `word_count` and `pig` contrib modules for more details. + Output To Cassandra As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra. The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to the filesystem and one to output data to Cassandra (default) using this new mechanism. See that example in the latest release for details. @@ -65, +67 @@ [[#Top|Top]] + Anchor(Oozie) + + == Oozie == + [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine originally from Yahoo!, can be used with Cassandra/Hadoop. Cassandra configuration information needs to go into the oozie action configuration like so: + {{{ + property + namecassandra.thrift.address/name + value${cassandraHost}/value + /property + property + namecassandra.thrift.port/name + value${cassandraPort}/value + /property + property + namecassandra.partitioner.class/name + valueorg.apache.cassandra.dht.RandomPartitioner/value + /property + property + namecassandra.consistencylevel.read/name + value${cassandraReadConsistencyLevel}/value + /property + property + namecassandra.consistencylevel.write/name + value${cassandraWriteConsistencyLevel}/value + /property + property + namecassandra.range.batch.size/name + value${cassandraRangeBatchSize}/value + /property + }}} + Note that with Oozie you can specify values outright like the partitioner here, or via variable that is typically found in the properties file. + One other item of note is that Oozie assumes that it can detect a filemarker for successful completion of the job. This means that when writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job that called it will fail because filemarkers aren't written to Cassandra. So when you write to Cassandra with Hadoop, specify this property to avoid that check. Oozie will still get completion updates from a callback from the job tracker, but it just won't look for the filemarker. + {{{ + property + namemapreduce.fileoutputcommitter.marksuccessfuljobs/name + valuefalse/value + /property + }}} + + [[#Top|Top]] + Anchor(ClusterConfig) == Cluster Configuration == @@ -74, +117 @@ Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `NameNode`/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `JobTracker` to assign tasks to the Cassandra nodes that contain data for those tasks. Also install a Hadoop `DataNode` on each Cassandra node. Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored. The nice thing about having a `TaskTracker` on every node is that you get data locality and your analytics engine scales with your data. You also never need to shuttle around your data once you've performed analytics on it - you simply output to
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=37rev2=38 Comment: Updated cluster config section. Took out single datanode thought as intermediate results need more than that realistically. Added bit about Brisk. Anchor(ClusterConfig) == Cluster Configuration == - If you would like to configure a Cassandra cluster so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop namenode/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data for those tasks. At least one node in your cluster will also need to be a datanode. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. + + The simplest way to configure your cluster to run Cassandra with Hadoop is to use Brisk, the open-source packaging of Cassandra with Hadoop. That will start the `JobTracker` and `TaskTracker` processes for you. It also uses CFS, an HDFS compatible distributed filesystem built on Cassandra that removes the need for a Hadoop `NameNode` and `DataNode` processes. For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and [[http://github.com/riptano/brisk|code]] + + Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `NameNode/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `JobTracker` to assign tasks to the Cassandra nodes that contain data for those tasks. Also install a Hadoop `DataNode` on each Cassandra node. Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored. The nice thing about having a `TaskTracker` on every node is that you get data locality and your analytics engine scales with your data. You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. @@ -79, +82 @@ }}} Virtual Datacenter One thing that many have asked about is whether Cassandra with Hadoop will be usable from a random access perspective. For example, you may need to use Cassandra for serving web latency requests. You may also need to run analytics over your data. In Cassandra 0.7+ there is the !NetworkTopologyStrategy which allows you to customize your cluster's replication strategy by datacenter. What you can do with this is create a 'virtual datacenter' to separate nodes that serve data with high random-read performance from nodes that are meant to be used for analytics. You need to have a snitch configured with your topology and then according to the datacenters defined there (either explicitly or implicitly), you can indicate how many replicas you would like in each datacenter. You would install task trackers on nodes in your analytics section and make sure that a replica is written to that 'datacenter' in your !NetworkTopologyStrategy configuration. The practical upshot of this is your analytics nodes always have current data and your high random-read performance nodes always serve data with predictable performance. - - For an example of configuring Cassandra with Hadoop in the cloud, see the [[http://github.com/digitalreasoning/PyStratus|PyStratus]] project on Github. [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=35rev2=36 Comment: Removing old troubleshooting tip about pre 0.6.2 connection leak and added remarks about range scans and CL. * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + If you are seeing inconsistent data coming back, consider the consistency level that you are reading and writing at. The two relevant properties are: + * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE. + * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE. + Also since hadoop integration uses range scans underneath which do not do read repair. However reading at !ConsistencyLevel.QUORUM will reconcile differences among nodes read. See ReadRepair section as well as the !ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]] page for more details. - Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail (connections are not released properly, causing a resource leak). Depending on your local setup you may hit this issue, and workaround it by raising the limit of open file descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`). The error will be reported on the hadoop job side as a thrift !TimedOutException. - - If you are testing the integration against a single node and you obtain some failures, this may be normal: you are probably overloading the single machine, which may again result in timeout errors. You can workaround it by reducing the number of concurrent tasks - - {{{ - Configuration conf = job.getConfiguration(); - conf.setInt(mapred.tasktracker.map.tasks.maximum,1); - }}} - Also, you may reduce the size in rows of the batch you are reading from cassandra - - {{{ - ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000); - }}} [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=32rev2=33 Comment: Adding updated Hive support info Anchor(Hive) == Hive == - Work is being finalized to add support for Hive - see [[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]]. + Hive comes bundled as part of the open-source Brisk project. For details, see the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and [[http://github.com/riptano/brisk|Code]] [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=33rev2=34 Comment: Updated the Streaming section. As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra. The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to the filesystem and one to output data to Cassandra (default) using this new mechanism. See that example in the latest release for details. Hadoop Streaming - As of 0.7, there is support for [[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop Streaming]]. For examples on how to use Streaming with Cassandra, see the contrib section of the Cassandra source. The relevant tickets are [[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]]. + Hadoop output streaming was introduced in 0.7 but was removed from 0.8 due to lack of interest and the additional complexity it added to the Hadoop integration code. To use output streaming with 0.7.x, see the contrib directory of the source download of Cassandra. [[#Top|Top]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna. The comment on this change is: Adding a bit of info on the pig storefunc.. http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=28rev2=29 -- == Pig == Cassandra 0.6+ also adds support for [[http://pig.apache.org|Pig]] with its own implementation of [[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]]. This allows Pig queries to be run against data stored in Cassandra. For an example of this, see the `contrib/pig` example in 0.6 and later. + + Cassandra 0.7.4+ brings additional support in to form of a [[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/StoreFunc.html|StoreFunc]] implementation. This allows Pig queries to output data to Cassandra. It is handled by the same class as the `LoadFunc`: `CassandraStorage`. See the `README` in `contrib/pig` for more information. When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the `README` notes in the `cassandra_src/contrib/pig` directory, the [[#ClusterConfig|Cluster Configuration]] section on this page, and some additional notes here:
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna. The comment on this change is: Adding some more troubleshooting info in a separate section.. http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=26rev2=27 -- * [[#Pig|Pig]] * [[#Hive|Hive]] * [[#ClusterConfig|Cluster Configuration]] + * [[#Troubleshooting|Troubleshooting]] * [[#Support|Support]] Anchor(Overview) @@ -37, +38 @@ Hadoop Streaming As of 0.7, there is support for [[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop Streaming]]. For examples on how to use Streaming with Cassandra, see the contrib section of the Cassandra source. The relevant tickets are [[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]]. - - Some troubleshooting - Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail (connections are not released properly, causing a resource leak). Depending on your local setup you may hit this issue, and workaround it by raising the limit of open file descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`). The error will be reported on the hadoop job side as a thrift !TimedOutException. - - If you are testing the integration against a single node and you obtain some failures, this may be normal: you are probably overloading the single machine, which may again result in timeout errors. You can workaround it by reducing the number of concurrent tasks - - {{{ - Configuration conf = job.getConfiguration(); - conf.setInt(mapred.tasktracker.map.tasks.maximum,1); - }}} - Also, you may reduce the size in rows of the batch you are reading from cassandra - - {{{ - ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000); - }}} [[#Top|Top]] Anchor(Pig) @@ -93, +79 @@ [[#Top|Top]] + Anchor(Troubleshooting) + + == Troubleshooting == + If you are running into timeout exceptions, you might need to tweak one or both of these settings: + * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this depending on your data. This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`. + * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis` in `storage-conf.xml`). The rpc timeout is not for timing out from the client but between nodes. This can be increased to reduce chances of timing out. + + Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail (connections are not released properly, causing a resource leak). Depending on your local setup you may hit this issue, and workaround it by raising the limit of open file descriptors for the process (e.g. in linux/bash using `ulimit -n 32000`). The error will be reported on the hadoop job side as a thrift !TimedOutException. + + If you are testing the integration against a single node and you obtain some failures, this may be normal: you are probably overloading the single machine, which may again result in timeout errors. You can workaround it by reducing the number of concurrent tasks + + {{{ + Configuration conf = job.getConfiguration(); + conf.setInt(mapred.tasktracker.map.tasks.maximum,1); + }}} + Also, you may reduce the size in rows of the batch you are reading from cassandra + + {{{ + ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000); + }}} + + [[#Top|Top]] + Anchor(Support) == Support ==
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna. The comment on this change is: Updating with more information about the virtual datacenter concept and more configuration help.. http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=19rev2=20 -- * [[#Pig|Pig Support]] * [[#Hive|Hive Support]] * [[#ClusterConfig|Cluster Configuration]] + * [[#Support|Support]] Anchor(Overview) == Overview == - Cassandra version 0.6 and later enable certain Hadoop functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://hadoop.apache.org/pig/|Pig]] and [[http://hive.apache.org/|Hive]]. + Cassandra 0.6+ enables certain Hadoop functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]]. [[#Top|Top]] @@ -21, +22 @@ == MapReduce == Input from Cassandra - Cassandra 0.6 (and later) adds support for retrieving data from Cassandra. This is based on implementations of [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]], [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]], and [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]] so that Hadoop !MapReduce jobs can retrieve data from Cassandra. For an example of how this works, see the contrib/word_count example in 0.6 or later. Cassandra rows or row fragments (that is, pairs of key + `SortedMap` of columns) are input to Map tasks for processing by your job, as specified by a `SlicePredicate` that describes which columns to fetch from each row. + Cassandra 0.6+ adds support for retrieving data from Cassandra. This is based on implementations of [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]], [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]], and [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]] so that Hadoop !MapReduce jobs can retrieve data from Cassandra. For an example of how this works, see the contrib/word_count example in 0.6 or later. Cassandra rows or row fragments (that is, pairs of key + `SortedMap` of columns) are input to Map tasks for processing by your job, as specified by a `SlicePredicate` that describes which columns to fetch from each row. Here's how this looks in the word_count example, which selects just one configurable columnName from each row: @@ -31, +32 @@ ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate); }}} - As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the word_count and pig contrib modules for more details. + As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the `word_count` and `pig` contrib modules for more details. Output To Cassandra - As of 0.7, there is be a basic mechanism included in Cassandra for outputting data to Cassandra. The contrib/word_count example in 0.7 contains two reducers - one for outputting data to the filesystem (default) and one to output data to Cassandra using this new mechanism. See that example in the latest release for details. + As of 0.7, there is be a basic mechanism included in Cassandra for outputting data to Cassandra. The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to the filesystem (default) and one to output data to Cassandra using this new mechanism. See that example in the latest release for details. Hadoop Streaming @@ -62, +63 @@ Anchor(Pig) == Pig == - Cassandra 0.6+ also adds support for [[http://hadoop.apache.org/pig/|Pig]] with its own implementation of [[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]]. This allows Pig queries to be run against data stored in Cassandra. For an example of this, see the contrib/pig example in 0.6 and later. + Cassandra 0.6+ also adds support for [[http://pig.apache.org|Pig]] with its own implementation of [[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]]. This allows Pig queries to be run against data stored in Cassandra. For an example of this, see the `contrib/pig` example in 0.6 and later. + + When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna. http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=22rev2=23 -- == Contents == * [[#overview|Overview]] - * [[#MapReduce|MapReduce Support]] + * [[#MapReduce|MapReduce]] - * [[#Pig|Pig Support]] + * [[#Pig|Pig]] - * [[#Hive|Hive Support]] + * [[#Hive|Hive]] * [[#ClusterConfig|Cluster Configuration]] * [[#Support|Support]]
[Cassandra Wiki] Update of HadoopSupport by jeremyhanna
Dear Wiki user, You have subscribed to a wiki page or wiki category on Cassandra Wiki for change notification. The HadoopSupport page has been changed by jeremyhanna. The comment on this change is: Adding a support options link for hadoop as well.. http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=23rev2=24 -- == Support == Sometimes configuration and integration can get tricky. To get support for this functionality, start with the `contrib` examples in the source download of Cassandra. Make sure you are following instructions in the `README` file for that example. You can search the Cassandra user mailing list or post on there as it is very active. You can also ask in the #Cassandra irc channel on freenode for help. Other channels that might be of use are #hadoop, #hadoop-pig, and #hive. Those projects' mailing lists are also very active. - There are professional support options for Cassandra that can help you get everything working together. For more information, see ThirdPartySupport. + There are professional support options for Cassandra that can help you get everything working together. For more information, see ThirdPartySupport. There are also professional support options specifically for Hadoop. For more information on that, see Hadoop's third party support [[http://wiki.apache.org/hadoop/Support|wiki page]]. [[#Top|Top]]