[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2014-05-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=57rev2=58

Comment:
Updating and clarifying some of the troubleshooting information.

  == Troubleshooting ==
  If you are running into timeout exceptions, you might need to tweak one or 
both of these settings:
  
-  * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
-  * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml`.  The rpc 
timeout is not for timing out from the client but between nodes.  This can be 
increased to reduce chances of timing out.
+  * Each input split is divided into sequential batches of rows requested at a 
time from Cassandra.  This is the '''cassandra.range.batch.size''' property and 
it defaults to 4096.  If you are experiencing timeouts, you might first try to 
reduce the batch size so that it can more easily complete the request within 
the timeout.  This is either specified in your hadoop configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
+  * Starting in Cassandra 1.2, there is range request specific timeout called 
'''range_request_timeout_in_ms''' in the cassandra.yaml.  Hadoop will request 
data in sequential batches and the request has to complete within this timeout. 
 Prior to Cassandra 1.2, you're can set the general '''rpc_timeout_in_ms''' 
higher, which affects timeouts for reads, writes, and truncate operations in 
addition to range requests.
  
  If you still see timeout exceptions with resultant failed jobs and/or 
blacklisted tasktrackers, there are settings that can give Cassandra more 
latitude before failing the jobs.  An example of usage (in either the job 
configuration or tasktracker mapred-site.xml):
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2014-05-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=59rev2=60

Comment:
noting new consistency level default

  }}}
  The settings normally default to 4 each, but some find that too conservative. 
 If you set it too low, you might have blacklisted tasktrackers and failed jobs 
because of occasional timeout exceptions.  If you set them too high, jobs that 
would otherwise fail quickly take a long time to fail, sacrificing efficiency.  
Keep in mind that this can just cover a problem.  It may be that you always 
want these settings to be higher when operating against Cassandra.  However, if 
you run into these exceptions too frequently, there may be a problem with your 
Cassandra or Hadoop configuration.
  
+ If you are seeing inconsistent data coming back, consider the consistency 
level at which you read ('''cassandra.consistencylevel.read''') and write 
('''cassandra.consistencylevel.write''').  Both properties default to 
!ConsistencyLevel.LOCAL_ONE (Previously 
[[https://issues.apache.org/jira/browse/CASSANDRA-6214|ONE]]).
- If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
- 
-  * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
-  * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
  
  Also hadoop integration uses range scans underneath which do not do read 
repair.  However reading at !ConsistencyLevel.QUORUM will reconcile differences 
among nodes read.  See ReadRepair section as well as the !ConsistencyLevel 
section of the [[http://wiki.apache.org/cassandra/API|API]] page for more 
details.
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2013-02-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=55rev2=56

  If you are running into timeout exceptions, you might need to tweak one or 
both of these settings:
  
   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
-  * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
+  * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml`.  The rpc 
timeout is not for timing out from the client but between nodes.  This can be 
increased to reduce chances of timing out.
  
  If you still see timeout exceptions with resultant failed jobs and/or 
blacklisted tasktrackers, there are settings that can give Cassandra more 
latitude before failing the jobs.  An example of usage (in either the job 
configuration or tasktracker mapred-site.xml):
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2012-02-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=46rev2=47

Comment:
removed redundant hadoop property.

value20/value
  /property
  property
-   namemapred.max.tracker.failures/name
-   value20/value
- /property
- property
namemapred.map.max.attempts/name
value20/value
  /property


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2012-01-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=45rev2=46

Comment:
Correcting the hive support information.

  Anchor(Hive)
  
  == Hive ==
- Hive comes bundled as part of the open-source Brisk project. For details, see 
the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|code]]
+ Hive support is currently a standalone project but will become part of the 
main Cassandra source tree in the future.  See 
[[https://github.com/riptano/hive]] for details.
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-09-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=41rev2=42

Comment:
Adding a link to pygmalion in the pig section.

   * Set the `HADOOP_HOME` environment variable to `hadoop_dir`, e.g. 
`/opt/hadoop` or `/etc/hadoop`
   * Set the `PIG_CONF` environment variable to `hadoop_dir/conf`
   * Set the `JAVA_HOME`
+ 
+ [[https://github.com/jeromatron/pygmalion/|Pygmalion]] is a project created 
to help with using Pig with Cassandra, especially for tabular (static column 
names) data.
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-09-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=39rev2=40

Comment:
Adding more troubleshooting information and a caveat to OSS Brisk in the main 
description

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] 
functionality against Cassandra's data store.  Specifically, support has been 
added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) 
+ [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future development of Brisk is now part of a 
pay-for offering.
  
  [[#Top|Top]]
  
@@ -92, +92 @@

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
  
+ If you still see timeout exceptions with resultant failed jobs and/or 
blacklisted tasktrackers, there are settings that can give Cassandra more 
latitude before failing the jobs.  An example of usage (in either the job 
configuration or taskracker mapred-site.xml):
+ {{{
+ property
+   namemapred.max.tracker.failures/name
+   value20/value
+ /property
+ property
+   namemapred.max.tracker.failures/name
+   value20/value
+ /property
+ property
+   namemapred.map.max.attempts/name
+   value20/value
+ /property
+ property
+   namemapred.reduce.max.attempts/name
+   value20/value
+ /property
+ }}}
+ The settings normally default to 4 each, but some find that too conservative. 
 If you set it too low, you might have blacklisted tasktrackers and failed jobs 
because of occasional timeout exceptions.  If you set them too high, jobs that 
would otherwise fail quickly take a long time to fail, sacrificing efficiency.  
Keep in mind that this can just cover a problem.  It may be that you always 
want these settings to be higher when operating against Cassandra.  However, if 
you run into these exceptions too frequently, there may be a problem with your 
Cassandra or Hadoop configuration.
+ 
  If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
   * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
   * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-09-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=40rev2=41

Comment:
Added a section on Oozie and another cluster configuration note about 
speculative execution.

   * [[#MapReduce|MapReduce]]
   * [[#Pig|Pig]]
   * [[#Hive|Hive]]
+  * [[#Oozie|Oozie]]
   * [[#ClusterConfig|Cluster Configuration]]
   * [[#Troubleshooting|Troubleshooting]]
   * [[#Support|Support]]
@@ -16, +17 @@

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] 
functionality against Cassandra's data store.  Specifically, support has been 
added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future development of Brisk is now part of a 
pay-for offering.
+ [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future DataStax development of Brisk is now part 
of a pay-for offering.
  
  [[#Top|Top]]
  
@@ -34, +35 @@

  ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
  As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the `word_count` and `pig` contrib 
modules for more details.
+ 
  
   Output To Cassandra 
  As of 0.7, there is a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
@@ -65, +67 @@

  
  [[#Top|Top]]
  
+ Anchor(Oozie)
+ 
+ == Oozie ==
+ [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine 
originally from Yahoo!, can be used with Cassandra/Hadoop.  Cassandra 
configuration information needs to go into the oozie action configuration like 
so:
+ {{{
+ property
+ namecassandra.thrift.address/name
+ value${cassandraHost}/value
+ /property
+ property
+ namecassandra.thrift.port/name
+ value${cassandraPort}/value
+ /property
+ property
+ namecassandra.partitioner.class/name
+ valueorg.apache.cassandra.dht.RandomPartitioner/value
+ /property
+ property
+ namecassandra.consistencylevel.read/name
+ value${cassandraReadConsistencyLevel}/value
+ /property
+ property
+ namecassandra.consistencylevel.write/name
+ value${cassandraWriteConsistencyLevel}/value
+ /property
+ property
+ namecassandra.range.batch.size/name
+ value${cassandraRangeBatchSize}/value
+ /property
+ }}}
+ Note that with Oozie you can specify values outright like the partitioner 
here, or via variable that is typically found in the properties file.
+ One other item of note is that Oozie assumes that it can detect a filemarker 
for successful completion of the job.  This means that when writing to 
Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job 
that called it will fail because filemarkers aren't written to Cassandra.  So 
when you write to Cassandra with Hadoop, specify this property to avoid that 
check.  Oozie will still get completion updates from a callback from the job 
tracker, but it just won't look for the filemarker.
+ {{{
+ property
+ namemapreduce.fileoutputcommitter.marksuccessfuljobs/name
+ valuefalse/value
+ /property
+ }}}
+ 
+ [[#Top|Top]]
+ 
  Anchor(ClusterConfig)
  
  == Cluster Configuration ==
@@ -74, +117 @@

  Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
  
  The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to 

[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-08-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=37rev2=38

Comment:
Updated cluster config section.  Took out single datanode thought as 
intermediate results need more than that realistically.  Added bit about Brisk.

  Anchor(ClusterConfig)
  
  == Cluster Configuration ==
- If you would like to configure a Cassandra cluster so that Hadoop may operate 
over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. 
 You'll want to have a separate server for your Hadoop namenode/`JobTracker`.  
Then install a Hadoop `TaskTracker` on each of your Cassandra nodes.  That will 
allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data 
for those tasks.  At least one node in your cluster will also need to be a 
datanode.  That's because Hadoop uses HDFS to store information like jar 
dependencies for your job, static data (like stop words for a word count), and 
things like that - it's the distributed cache.  It's a very small amount of 
data but the Hadoop cluster needs it to run properly.
+ 
+ The simplest way to configure your cluster to run Cassandra with Hadoop is to 
use Brisk, the open-source packaging of Cassandra with Hadoop.  That will start 
the `JobTracker` and `TaskTracker` processes for you.  It also uses CFS, an 
HDFS compatible distributed filesystem built on Cassandra that removes the need 
for a Hadoop `NameNode` and `DataNode` processes.  For details, see the Brisk 
[[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|code]]
+ 
+ Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
  
  The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance.
  
@@ -79, +82 @@

  }}}
   Virtual Datacenter 
  One thing that many have asked about is whether Cassandra with Hadoop will be 
usable from a random access perspective. For example, you may need to use 
Cassandra for serving web latency requests. You may also need to run analytics 
over your data. In Cassandra 0.7+ there is the !NetworkTopologyStrategy which 
allows you to customize your cluster's replication strategy by datacenter. What 
you can do with this is create a 'virtual datacenter' to separate nodes that 
serve data with high random-read performance from nodes that are meant to be 
used for analytics. You need to have a snitch configured with your topology and 
then according to the datacenters defined there (either explicitly or 
implicitly), you can indicate how many replicas you would like in each 
datacenter. You would install task trackers on nodes in your analytics section 
and make sure that a replica is written to that 'datacenter' in your 
!NetworkTopologyStrategy configuration. The practical upshot of this is your 
analytics nodes always have current data and your high random-read performance 
nodes always serve data with predictable performance.
- 
- For an example of configuring Cassandra with Hadoop in the cloud, see the 
[[http://github.com/digitalreasoning/PyStratus|PyStratus]] project on Github.
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-07-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=35rev2=36

Comment:
Removing old troubleshooting tip about pre 0.6.2 connection leak and added 
remarks about range scans and CL.

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
  
+ If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
+  * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
+  * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
+ Also since hadoop integration uses range scans underneath which do not do 
read repair.  However reading at !ConsistencyLevel.QUORUM will reconcile 
differences among nodes read.  See ReadRepair section as well as the 
!ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]] 
page for more details.
- Releases before 0.6.2/0.7 are affected by a small resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you may hit this issue, and workaround it 
by raising the limit of open file descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on the hadoop 
job side as a thrift !TimedOutException.
- 
- If you are testing the integration against a single node and you obtain some 
failures, this may be normal: you are probably overloading the single machine, 
which may again result in timeout errors. You can workaround it by reducing the 
number of concurrent tasks
- 
- {{{
-  Configuration conf = job.getConfiguration();
-  conf.setInt(mapred.tasktracker.map.tasks.maximum,1);
- }}}
- Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
- 
- {{{
-  ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
- }}}
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=32rev2=33

Comment:
Adding updated Hive support info

  Anchor(Hive)
  
  == Hive ==
- Work is being finalized to add support for Hive - see 
[[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]].
+ Hive comes bundled as part of the open-source Brisk project. For details, see 
the Brisk [[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|Code]]
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=33rev2=34

Comment:
Updated the Streaming section.

  As of 0.7, there is a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
  
   Hadoop Streaming 
- As of 0.7, there is support for 
[[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop 
Streaming]].  For examples on how to use Streaming with Cassandra, see the 
contrib section of the Cassandra source.  The relevant tickets are 
[[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
+ Hadoop output streaming was introduced in 0.7 but was removed from 0.8 due to 
lack of interest and the additional complexity it added to the Hadoop 
integration code.  To use output streaming with 0.7.x, see the contrib 
directory of the source download of Cassandra.
  
  [[#Top|Top]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-03-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna.
The comment on this change is: Adding a bit of info on the pig storefunc..
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=28rev2=29

--

  
  == Pig ==
  Cassandra 0.6+ also adds support for [[http://pig.apache.org|Pig]] with its 
own implementation of 
[[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the `contrib/pig` example in 0.6 and later.
+ 
+ Cassandra 0.7.4+ brings additional support in to form of a 
[[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/StoreFunc.html|StoreFunc]]
 implementation.  This allows Pig queries to output data to Cassandra.  It is 
handled by the same class as the `LoadFunc`: `CassandraStorage`.  See the 
`README` in `contrib/pig` for more information.
  
  When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the 
`README` notes in the `cassandra_src/contrib/pig` directory, the 
[[#ClusterConfig|Cluster Configuration]] section on this page, and some 
additional notes here:
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-03-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna.
The comment on this change is: Adding some more troubleshooting info in a 
separate section..
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=26rev2=27

--

   * [[#Pig|Pig]]
   * [[#Hive|Hive]]
   * [[#ClusterConfig|Cluster Configuration]]
+  * [[#Troubleshooting|Troubleshooting]]
   * [[#Support|Support]]
  
  Anchor(Overview)
@@ -37, +38 @@

  
   Hadoop Streaming 
  As of 0.7, there is support for 
[[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop 
Streaming]].  For examples on how to use Streaming with Cassandra, see the 
contrib section of the Cassandra source.  The relevant tickets are 
[[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
- 
-  Some troubleshooting 
- Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift !TimedOutException.
- 
- If you are testing the integration against a single node and you obtain  some 
failures, this may be normal: you are probably overloading the  single machine, 
which may again result in timeout errors. You can  workaround it by reducing 
the number of concurrent tasks
- 
- {{{
-  Configuration conf = job.getConfiguration();
-  conf.setInt(mapred.tasktracker.map.tasks.maximum,1);
- }}}
- Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
- 
- {{{
-  ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
- }}}
  [[#Top|Top]]
  
  Anchor(Pig)
@@ -93, +79 @@

  
  [[#Top|Top]]
  
+ Anchor(Troubleshooting)
+ 
+ == Troubleshooting ==
+ If you are running into timeout exceptions, you might need to tweak one or 
both of these settings:
+  * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
+  * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
+ 
+ Releases before 0.6.2/0.7 are affected by a small resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you may hit this issue, and workaround it 
by raising the limit of open file descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on the hadoop 
job side as a thrift !TimedOutException.
+ 
+ If you are testing the integration against a single node and you obtain some 
failures, this may be normal: you are probably overloading the single machine, 
which may again result in timeout errors. You can workaround it by reducing the 
number of concurrent tasks
+ 
+ {{{
+  Configuration conf = job.getConfiguration();
+  conf.setInt(mapred.tasktracker.map.tasks.maximum,1);
+ }}}
+ Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
+ 
+ {{{
+  ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
+ }}}
+ 
+ [[#Top|Top]]
+ 
  Anchor(Support)
  
  == Support ==


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-03-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna.
The comment on this change is: Updating with more information about the virtual 
datacenter concept and more configuration help..
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=19rev2=20

--

   * [[#Pig|Pig Support]]
   * [[#Hive|Hive Support]]
   * [[#ClusterConfig|Cluster Configuration]]
+  * [[#Support|Support]]
  
  Anchor(Overview)
  
  == Overview ==
- Cassandra version 0.6 and later enable certain Hadoop functionality against 
Cassandra's data store.  Specifically, support has been added for 
[[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://hadoop.apache.org/pig/|Pig]] and [[http://hive.apache.org/|Hive]].
+ Cassandra 0.6+ enables certain Hadoop functionality against Cassandra's data 
store.  Specifically, support has been added for 
[[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
  [[#Top|Top]]
  
@@ -21, +22 @@

  == MapReduce ==
  
   Input from Cassandra 
- Cassandra 0.6 (and later) adds support for retrieving data from Cassandra.  
This is based on implementations of 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
 and 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
 so that Hadoop !MapReduce jobs can retrieve data from Cassandra.  For an 
example of how this works, see the contrib/word_count example in 0.6 or later.  
Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of 
columns) are input to Map tasks for  processing by your job, as specified by a 
`SlicePredicate`  that describes which columns to fetch from each row.
+ Cassandra 0.6+ adds support for retrieving data from Cassandra.  This is 
based on implementations of 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
 and 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
 so that Hadoop !MapReduce jobs can retrieve data from Cassandra.  For an 
example of how this works, see the contrib/word_count example in 0.6 or later.  
Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of 
columns) are input to Map tasks for  processing by your job, as specified by a 
`SlicePredicate`  that describes which columns to fetch from each row.
  
  Here's how this looks in the word_count example, which selects just one  
configurable columnName from each row:
  
@@ -31, +32 @@

  ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
  
- As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the word_count and pig contrib modules 
for more details.
+ As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the `word_count` and `pig` contrib 
modules for more details.
  
   Output To Cassandra 
  
- As of 0.7, there is be a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The contrib/word_count example in 0.7 contains two reducers 
- one for outputting data to the filesystem (default) and one to output data to 
Cassandra using this new mechanism.  See that example in the latest release for 
details.
+ As of 0.7, there is be a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem (default) and one to 
output data to Cassandra using this new mechanism.  See that example in the 
latest release for details.
  
   Hadoop Streaming 
  
@@ -62, +63 @@

  Anchor(Pig)
  
  == Pig ==
- Cassandra 0.6+ also adds support for [[http://hadoop.apache.org/pig/|Pig]] 
with its own implementation of 
[[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the contrib/pig example in 0.6 and later.
+ Cassandra 0.6+ also adds support for [[http://pig.apache.org|Pig]] with its 
own implementation of 
[[http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the `contrib/pig` example in 0.6 and later.
+ 
+ When running Pig with Cassandra + Hadoop on a cluster, be sure to follow the 

[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-03-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna.
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=22rev2=23

--

  
  == Contents ==
   * [[#overview|Overview]]
-  * [[#MapReduce|MapReduce Support]]
+  * [[#MapReduce|MapReduce]]
-  * [[#Pig|Pig Support]]
+  * [[#Pig|Pig]]
-  * [[#Hive|Hive Support]]
+  * [[#Hive|Hive]]
   * [[#ClusterConfig|Cluster Configuration]]
   * [[#Support|Support]]
  


[Cassandra Wiki] Update of HadoopSupport by jeremyhanna

2011-03-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Cassandra Wiki for 
change notification.

The HadoopSupport page has been changed by jeremyhanna.
The comment on this change is: Adding a support options link for hadoop as 
well..
http://wiki.apache.org/cassandra/HadoopSupport?action=diffrev1=23rev2=24

--

  == Support ==
  Sometimes configuration and integration can get tricky. To get support for 
this functionality, start with the `contrib` examples in the source download of 
Cassandra. Make sure you are following instructions in the `README` file for 
that example. You can search the Cassandra user mailing list or post on there 
as it is very active. You can also ask in the #Cassandra irc channel on 
freenode for help. Other channels that might be of use are #hadoop, 
#hadoop-pig, and #hive. Those projects' mailing lists are also very active.
  
- There are professional support options for Cassandra that can help you get 
everything working together. For more information, see ThirdPartySupport.
+ There are professional support options for Cassandra that can help you get 
everything working together. For more information, see ThirdPartySupport. There 
are also professional support options specifically for Hadoop. For more 
information on that, see Hadoop's third party support 
[[http://wiki.apache.org/hadoop/Support|wiki page]].
  
  [[#Top|Top]]