Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Sean Owen
True, although a number of other little issues make me, personally, not want to continue down this road: - There are already a lot of build profiles to try to cover Hadoop versions - I don't think it's quite right to have vendor-specific builds in Spark to begin with - We should be moving to only

The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Hi all, Related to https://issues.apache.org/jira/browse/SPARK-3039, the default CDH4 build, which is built with mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package”, pulls in avro-mapred hadoop1, as opposed to avro-mapred hadoop2. This ends up in the same error as mentioned in

OSGI bundles for spark project..

2015-02-20 Thread Niranda Perera
Hi, I am interested in a Spark OSGI bundle. While checking the maven repository I found out that it is still not being implemented. Can we see an OSGI bundle being released soon? Is it in the Spark Project roadmap? Rgds -- Niranda

Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi all, I'm running Spark 1.2.0, in Stand alone mode, on different cluster and server sizes. All of my data is cached in memory. Basically I have a mass of data, about 8gb, with about 37k of columns, and I'm running different configs of an BinaryLogisticRegressionBFGS. When I put spark to run on 9

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
It sounds like your computation just isn't CPU bound, right? or maybe that only some stages are. It's not clear what work you are doing beyond the core LR. Stages don't wait on each other unless one depends on the other. You'd have to clarify what you mean by running stages in parallel, like what

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
Yes that makes sense, but it doesn't make the jobs CPU-bound. What is the bottleneck? the model building or other stages? I would think you can get the model building to be CPU bound, unless you have chopped it up into really small partitions. I think it's best to look further into what stages are

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi Sean, I'm trying to increase the cpu usage by running logistic regression in different datasets in parallel. They shouldn't depend on each other. I train several logistic regression models from different column combinations of a main dataset. I processed the combinations in a ParArray in an

Re: Spark SQL, Hive Parquet data types

2015-02-20 Thread Cheng Lian
For the second question, we do plan to support Hive 0.14, possibly in Spark 1.4.0. For the first question: 1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp type, so you can’t. 2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its own Parquet support

Re: Spark SQL, Hive Parquet data types

2015-02-20 Thread The Watcher
1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its own Parquet support to handle both read path and write path when dealing with Parquet tables declared in Hive metastore, as long as you’re not writing to a partitioned table. So yes, you can. Ah, I had

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-20 Thread Tom Graves
Trying to run pyspark on yarn in client mode with basic wordcount example I see the following error when doing the collect: Error from python worker:  /usr/bin/python: No module named sqlPYTHONPATH was: 

Re: Spark SQL, Hive Parquet data types

2015-02-20 Thread yash datta
For the old parquet path (available in 1.2.1) , i made a few changes for being able to read/write to a table partitioned on timestamp type column https://github.com/apache/spark/pull/4469 On Fri, Feb 20, 2015 at 8:28 PM, The Watcher watche...@gmail.com wrote: 1. In Spark 1.3.0,

Spark 1.3 RC1 Generate schema based on string of schema

2015-02-20 Thread Denny Lee
In the Spark SQL 1.2 Programmers Guide, we can generate the schema based on the string of schema via val schema = StructType( schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) But when running this on Spark 1.3.0 (RC1), I get the error: val schema =

Re: Spark 1.3 RC1 Generate schema based on string of schema

2015-02-20 Thread Denny Lee
Oh, I just realized that I never imported all of sql._ . My bad! On Fri Feb 20 2015 at 7:51:32 AM Denny Lee denny.g@gmail.com wrote: In the Spark SQL 1.2 Programmers Guide, we can generate the schema based on the string of schema via val schema = StructType( schemaString.split(

Re: OSGI bundles for spark project..

2015-02-20 Thread Niranda Perera
Hi Sean, does it mean that Spark is not encouraged to be embedded on other products? On Fri, Feb 20, 2015 at 3:29 PM, Sean Owen so...@cloudera.com wrote: I don't think an OSGI bundle makes sense for Spark. It's part JAR, part lifecycle manager. Spark has its own lifecycle management and is

Re: OSGI bundles for spark project..

2015-02-20 Thread Sean Owen
No, you usually run Spark apps via the spark-submit script, and the Spark machinery is already deployed on a cluster. Although it's possible to embed the driver and get it working that way, it's not supported. On Fri, Feb 20, 2015 at 4:48 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi

Re: The default CDH4 build uses avro-mapred hadoop1

2015-02-20 Thread Mingyu Kim
Thanks for the explanation. To be clear, I meant to speak for any hadoop 2 releases before 2.2, which have profiles in Spark. I referred to CDH4, since that¹s the only Hadoop 2.0/2.1 version Spark ships a prebuilt package for. I understand the hesitation of making a code change if Spark doesn¹t