Hi, IIRC, the code in Crunch is inherently sequential and meant for small(ish) amounts of data. After all, distributed read with Hadoop from a RDBMS is often considered a DDoS attack :)
Regards, Matthias On Monday, 2013-03-18, Josh Wills wrote: > Hey Martjin, > > I don't have any intuition on this one-- is this code that you could post > as a gist or something so I could play with it and see if I see anything > amiss? The trick will be figuring out if the problem is in Crunch, the > underlying DB library, or the config. > > J > > > On Mon, Mar 18, 2013 at 6:50 AM, Martijn Lenderink > <[email protected]>wrote: > > > Hello, > > > > I have a working JDBC-connection to get data from an MSSQL source. > > Its all works great except my cluster only opens one connection to the > > MSSQL server. > > > > I have multiple nodes running but the data gets pulled only from one node > > and then the data get send to other nodes for processing. > > > > I'am using code similar to the following: > > > > https://github.com/apache/incubator-crunch/blob/master/crunch-contrib/src/it/java/org/apache/crunch/contrib/io/jdbc/DataBaseSourceIT.java > > > > The only difference is the i'am using the DataDrivenDBInputFormat. > > > > When i debug the source-code the query gets split into multiple queries > > but only get executed on one machine. > > Why isn't this executed in parallel with multiple connections to the MSSQL > > server? > > > > Greetings, > > Martijn Lenderink > > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
