Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Thanks Judy, this is exactly what I'm looking for. However, and plz forgive
me if it's a dump question is: It seems to me that thrift is the same as
hive2 JDBC driver, does this mean that starting thrift will start hive as
well on the server?

On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:

  You can use thrift server for this purpose then test it with beeline.



 See doc:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server





 *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
 *Sent:* Monday, December 8, 2014 11:01 AM
 *To:* user@spark.apache.org
 *Subject:* Spark-SQL JDBC driver



 Hello Everyone,



 I'm brand new to spark and was wondering if there's a JDBC driver to
 access spark-SQL directly. I'm running spark in standalone mode and don't
 have hadoop in this environment.



 --



 *Best Regards/أطيب المنى,*



 *Anas Mosaad*






-- 

*Best Regards/أطيب المنى,*

*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*


Re: do not assemble the spark example jar

2014-12-09 Thread lihu
Can this assembly get faster if we do not need the Spark SQL or some other
component in spark ?  such as we only need the core of spark.

On Wed, Nov 26, 2014 at 3:37 PM, lihu lihu...@gmail.com wrote:

 Matei, sorry for my last typo error. And the tip can improve about 30s in
 my computer.

 On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote:

 Mater, thank you very much!
 After take your advice, the time for assembly from about 20min down to
 6min in my computer. that's a very big improvement.

 On Wed, Nov 26, 2014 at 12:32 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 BTW as another tip, it helps to keep the SBT console open as you make
 source changes (by just running sbt/sbt with no args). It's a lot faster
 the second time it builds something.

 Matei

 On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 You can do sbt/sbt assembly/assembly to assemble only the main package.

 Matei

 On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote:

 Hi,
 The spark assembly is time costly. If  I only need
 the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need
 the spark-examples-1.1.0-hadoop2.3.0.jar.  How to configure the spark to
 avoid assemble the example jar. I know *export SPARK_PREPEND_CLASSES=*
 *true* method can reduce the assembly, but I do not
 develop locally. Any advice?

 --
 *Best Wishes!*







 --
 *Best Wishes!*

 *Li Hu(李浒) | Graduate Student*

 *Institute for Interdisciplinary Information Sciences(IIIS
 http://iiis.tsinghua.edu.cn/)*
 *Tsinghua University, China*

 *Email: lihu...@gmail.com lihu...@gmail.com*
 *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 http://iiis.tsinghua.edu.cn/zh/lihu/*





 --
 *Best Wishes!*

 *Li Hu(李浒) | Graduate Student*

 *Institute for Interdisciplinary Information Sciences(IIIS
 http://iiis.tsinghua.edu.cn/)*
 *Tsinghua University, China*

 *Email: lihu...@gmail.com lihu...@gmail.com*
 *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
 http://iiis.tsinghua.edu.cn/zh/lihu/*





-- 
*Best Wishes!*

*Li Hu(李浒) | Graduate Student*

*Institute for Interdisciplinary Information Sciences(IIIS
http://iiis.tsinghua.edu.cn/)*
*Tsinghua University, China*

*Email: lihu...@gmail.com lihu...@gmail.com*
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
http://iiis.tsinghua.edu.cn/zh/lihu/*


Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of 
HiveServer2. You don't need to run Hive, but you do need a working 
Metastore.


On 12/9/14 3:59 PM, Anas Mosaad wrote:
Thanks Judy, this is exactly what I'm looking for. However, and plz 
forgive me if it's a dump question is: It seems to me that thrift is 
the same as hive2 JDBC driver, does this mean that starting thrift 
will start hive as well on the server?


On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash 
judyn...@exchange.microsoft.com 
mailto:judyn...@exchange.microsoft.com wrote:


You can use thrift server for this purpose then test it with beeline.

See doc:


https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

*From:*Anas Mosaad [mailto:anas.mos...@incorta.com
mailto:anas.mos...@incorta.com]
*Sent:* Monday, December 8, 2014 11:01 AM
*To:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* Spark-SQL JDBC driver

Hello Everyone,

I'm brand new to spark and was wondering if there's a JDBC driver
to access spark-SQL directly. I'm running spark in standalone mode
and don't have hadoop in this environment.

-- 


*Best Regards/أطيب المنى,*

*Anas Mosaad*




--

*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*




Re: How can I create an RDD with millions of entries created programmatically

2014-12-09 Thread Daniel Darabos
Ah... I think you're right about the flatMap then :). Or you could use
mapPartitions. (I'm not sure if it makes a difference.)

On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 looks good but how do I say that in Java
 as far as I can see sc.parallelize (in Java)  has only one implementation
 which takes a List - requiring an in memory representation

 On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos 
 daniel.dara...@lynxanalytics.com wrote:

 Hi,
 I think you have the right idea. I would not even worry about flatMap.

 val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =
 generateRandomObject(x))

 Then when you try to evaluate something on this RDD, it will happen
 partition-by-partition. So 1000 random objects will be generated at a time
 per executor thread.

 On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

  I have a function which generates a Java object and I want to explore
 failures which only happen when processing large numbers of these object.
 the real code is reading a many gigabyte file but in the test code I can
 generate similar objects programmatically. I could create a small list,
 parallelize it and then use flatmap to inflate it several times by a factor
 of 1000 (remember I can hold a list of 1000 items in memory but not a
 million)
 Are there better ideas - remember I want to create more objects than can
 be held in memory at once.





 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




KafkaUtils explicit acks

2014-12-09 Thread Mukesh Jha
Hello Experts,

I'm working on a spark app which reads data from kafka  persists it in
hbase.

Spark documentation states the below *[1]* that in case of worker failure
we can loose some data. If not how can I make my kafka stream more reliable?
I have seen there is a simple consumer *[2]* but I'm not sure if it has
been used/tested extensively.

I was wondering if there is a way to explicitly acknowledge the kafka
offsets once they are replicated in memory of other worker nodes (if it's
not already done) to tackle this issue.

Any help is appreciated in advance.


   1. *Using any input source that receives data through a network* - For
   network-based data sources like *Kafka *and Flume, the received input
   data is replicated in memory between nodes of the cluster (default
   replication factor is 2). So if a worker node fails, then the system can
   recompute the lost from the the left over copy of the input data. However,
   if the *worker node where a network receiver was running fails, then a
   tiny bit of data may be lost*, that is, the data received by the system
   but not yet replicated to other node(s). The receiver will be started on a
   different node and it will continue to receive data.
   2. https://github.com/dibbhatt/kafka-spark-consumer

Txz,

*Mukesh Jha me.mukesh@gmail.com*


Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Thanks Cheng,

I thought spark-sql is using the same exact metastore, right? However, it
didn't work as expected. Here's what I did.

In spark-shell, I loaded a csv files and registered the table, say
countries.
Started the thrift server.
Connected using beeline. When I run show tables or !tables, I get empty
list of tables as follow:

*0: jdbc:hive2://localhost:1 !tables*

*++--+-+-+--+*

*| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |*

*++--+-+-+--+*

*++--+-+-+--+*

*0: jdbc:hive2://localhost:1 show tables ;*

*+-+*

*| result  |*

*+-+*

*+-+*

*No rows selected (0.106 seconds)*

*0: jdbc:hive2://localhost:1 *



Kindly advice, what am I missing? I want to read the RDD using SQL from
outside spark-shell (i.e. like any other relational database)


On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com wrote:

  Essentially, the Spark SQL JDBC Thrift server is just a Spark port of
 HiveServer2. You don't need to run Hive, but you do need a working
 Metastore.


 On 12/9/14 3:59 PM, Anas Mosaad wrote:

 Thanks Judy, this is exactly what I'm looking for. However, and plz
 forgive me if it's a dump question is: It seems to me that thrift is the
 same as hive2 JDBC driver, does this mean that starting thrift will start
 hive as well on the server?

 On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash judyn...@exchange.microsoft.com
  wrote:

  You can use thrift server for this purpose then test it with beeline.



 See doc:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server





 *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
 *Sent:* Monday, December 8, 2014 11:01 AM
 *To:* user@spark.apache.org
 *Subject:* Spark-SQL JDBC driver



 Hello Everyone,



 I'm brand new to spark and was wondering if there's a JDBC driver to
 access spark-SQL directly. I'm running spark in standalone mode and don't
 have hadoop in this environment.



 --



 *Best Regards/أطيب المنى,*



 *Anas Mosaad*






  --

 *Best Regards/أطيب المنى,*

  *Anas Mosaad*
 *Incorta Inc.*
 *+20-100-743-4510*





-- 

*Best Regards/أطيب المنى,*

*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*


Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian

How did you register the table under spark-shell? Two things to notice:

1. To interact with Hive, HiveContext instead of SQLContext must be used.
2. `registerTempTable` doesn't persist the table into Hive metastore, 
and the table is lost after quitting spark-shell. Instead, you must use 
`saveAsTable`.


On 12/9/14 5:27 PM, Anas Mosaad wrote:

Thanks Cheng,

I thought spark-sql is using the same exact metastore, right? However, 
it didn't work as expected. Here's what I did.


In spark-shell, I loaded a csv files and registered the table, say 
countries.

Started the thrift server.
Connected using beeline. When I run show tables or !tables, I get 
empty list of tables as follow:


/0: jdbc:hive2://localhost:1 !tables/

/++--+-+-+--+/

/| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |/

/++--+-+-+--+/

/++--+-+-+--+/

/0: jdbc:hive2://localhost:1 show tables ;/

/+-+/

/| result  |/

/+-+/

/+-+/

/No rows selected (0.106 seconds)/

/0: jdbc:hive2://localhost:1 /



Kindly advice, what am I missing? I want to read the RDD using SQL 
from outside spark-shell (i.e. like any other relational database)



On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Essentially, the Spark SQL JDBC Thrift server is just a Spark port
of HiveServer2. You don't need to run Hive, but you do need a
working Metastore.


On 12/9/14 3:59 PM, Anas Mosaad wrote:

Thanks Judy, this is exactly what I'm looking for. However, and
plz forgive me if it's a dump question is: It seems to me that
thrift is the same as hive2 JDBC driver, does this mean that
starting thrift will start hive as well on the server?

On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash
judyn...@exchange.microsoft.com
mailto:judyn...@exchange.microsoft.com wrote:

You can use thrift server for this purpose then test it with
beeline.

See doc:


https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server

*From:*Anas Mosaad [mailto:anas.mos...@incorta.com
mailto:anas.mos...@incorta.com]
*Sent:* Monday, December 8, 2014 11:01 AM
*To:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* Spark-SQL JDBC driver

Hello Everyone,

I'm brand new to spark and was wondering if there's a JDBC
driver to access spark-SQL directly. I'm running spark in
standalone mode and don't have hadoop in this environment.

-- 


*Best Regards/أطيب المنى,*

*Anas Mosaad*




-- 


*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*





--

*Best Regards/أطيب المنى,*
*
*
*Anas Mosaad*
*Incorta Inc.*
*+20-100-743-4510*




spark 1.1.1 Maven dependency

2014-12-09 Thread sivarani
Dear All,

I am using spark streaming, It was working fine when i was using spark1.0.2,
now i repeatedly getting few issue

Like class not found, i am using the same pom.xml with the updated version
for all spark modules
i am using  spark-core,streaming, streaming with kafka modules..

Its constantly keeps throwing errors for no commons-configuation,
commons-langs, logging 

How to get all the dependencies for running spark streaming.. Is there any
way or we just have to find by trial and error methord?

my pom dependencies

dependencies
dependency
groupIdjavax.servlet/groupId
artifactIdservlet-api/artifactId
version2.5/version
/dependency 
dependency 
  groupIdorg.apache.spark/groupId
  artifactIdspark-core_2.10/artifactId
  version1.0.2/version
/dependency

  dependency 
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_2.10/artifactId
  version1.0.2/version
/dependency
  dependency 
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming-kafka_2.10/artifactId
  version1.0.2/version
/dependency

dependency
  groupIdorg.slf4j/groupId
  artifactIdslf4j-log4j12/artifactId
  version1.7.5/version
/dependency
dependency
groupIdcommons-logging/groupId
artifactIdcommons-logging/artifactId
version1.1.1/version
/dependency
dependency
groupIdcommons-configuration/groupId
artifactIdcommons-configuration/artifactId
version1.6/version
/dependency

Am i missing something here?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-1-Maven-dependency-tp20590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Issues on schemaRDD's function got stuck

2014-12-09 Thread LinQili
I checked my code again, and located the issue that, if we do the `load data 
inpath` before select statement, the application will get stuck, if don't, it 
won't get stuck.Log info: 14/12/09 17:29:33 ERROR actor.ActorSystemImpl: 
Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-18] 
shutting down ActorSystem [sparkDriver]java.lang.OutOfMemoryError: PermGen 
space 14/12/09 17:29:34 WARN io.nio:

 java.lang.OutOfMemoryError: PermGen space 
14/12/09 17:29:34 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem 
[sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 WARN 
io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 ERROR 
actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-2] shutting down ActorSystem 
[sparkDriver]
From: lin_q...@outlook.com
To: u...@spark.incubator.apache.org
Subject: Issues on schemaRDD's function got stuck
Date: Tue, 9 Dec 2014 15:54:14 +0800




Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive 
select's result schemaRDD and tried to run `collect()` on it, the application 
got stuck and don't know what's wrong with it. Here is my code:
val sqlStat = sSELECT * FROM $TABLE_NAME val result = 
hiveContext.hql(sqlStat) // got the select's result schemaRDDval rows = 
result.collect()  // This is where the application getting stuck
It was ok when running on yarn-client mode.
Here is the Log===14/12/09 15:40:58 WARN util.AkkaUtils: Error sending message 
in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:41:31 WARN util.AkkaUtils: Error sending message in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:42:04 WARN util.AkkaUtils: Error sending message in 3 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:42:07 WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.SparkException: Error sending message [message = 
Heartbeat(2,[Lscala.Tuple2;@a810606,BlockManagerId(2, longzhou-hdp1.lz.dscc, 
53356, 0))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
... 1 more
14/12/09 15:42:47 WARN util.AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at 

RE: Issues on schemaRDD's function got stuck

2014-12-09 Thread LinQili
I checked my code again, and located the issue that, if we do the `load data 
inpath` before select statement, the application will get stuck, if don't, it 
won't get stuck.Get stuck code:  val sqlLoadData = sLOAD DATA INPATH 
'$currentFile' OVERWRITE INTO TABLE $tableName   
hiveContext.hql(sqlLoadData) val sqlStat = sSELECT * FROM $TABLE_NAME 
 val result = hiveContext.hql(sqlStat) // got the select's result schemaRDD 
val rows = result.collect()  // This is where the application getting stuckLog 
info: 14/12/09 17:29:33 ERROR actor.ActorSystemImpl: Uncaught fatal error from 
thread [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem 
[sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 WARN 
io.nio: 

java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:34 ERROR 
actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem 
[sparkDriver]java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 WARN 
io.nio: java.lang.OutOfMemoryError: PermGen space 14/12/09 17:29:35 ERROR 
actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.actor.default-dispatcher-2] shutting down ActorSystem 
[sparkDriver]
From: lin_q...@outlook.com
To: u...@spark.incubator.apache.org
Subject: Issues on schemaRDD's function got stuck
Date: Tue, 9 Dec 2014 15:54:14 +0800




Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive 
select's result schemaRDD and tried to run `collect()` on it, the application 
got stuck and don't know what's wrong with it. Here is my code:
val sqlStat = sSELECT * FROM $TABLE_NAME val result = 
hiveContext.hql(sqlStat) // got the select's result schemaRDDval rows = 
result.collect()  // This is where the application getting stuck
It was ok when running on yarn-client mode.
Here is the Log===14/12/09 15:40:58 WARN util.AkkaUtils: Error sending message 
in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:41:31 WARN util.AkkaUtils: Error sending message in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:42:04 WARN util.AkkaUtils: Error sending message in 3 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
14/12/09 15:42:07 WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.SparkException: Error sending message [message = 
Heartbeat(2,[Lscala.Tuple2;@a810606,BlockManagerId(2, longzhou-hdp1.lz.dscc, 
53356, 0))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:373)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  

NoSuchMethodError: writing spark-streaming data to cassandra

2014-12-09 Thread m.sarosh

Hi,

I am intending to save the streaming data from kafka into Cassandra, using 
spark-streaming:
But there seems to be problem with line
javaFunctions(data).writerBuilder(testkeyspace, test_table, 
mapToRow(TestTable.class)).saveToCassandra();
I am getting NoSuchMethodError.
The code, the error-log and POM.xml dependencies are listed below:
Please help me find the reason as to why is this happening.


public class SparkStream {
static int key=0;
public static void main(String args[]) throws Exception
{
  if(args.length != 3)
  {
System.out.println(SparkStream zookeeper_ip 
group_nm topic1,topic2,...);
System.exit(1);
  }

  Logger.getLogger(org).setLevel(Level.OFF);
  Logger.getLogger(akka).setLevel(Level.OFF);
  MapString,Integer topicMap = new HashMapString,Integer();
  String[] topic = args[2].split(,);
  for(String t: topic)
  {
topicMap.put(t, new Integer(3));
  }

  /* Connection to Spark */
  SparkConf conf = new SparkConf();
  JavaSparkContext sc = new JavaSparkContext(local[4], 
SparkStream,conf);
  JavaStreamingContext jssc = new JavaStreamingContext(sc, new 
Duration(3000));


  /* Receive Kafka streaming inputs */
  JavaPairReceiverInputDStreamString, String messages = 
KafkaUtils.createStream(jssc, args[0], args[1], topicMap );


  /* Create DStream */
  JavaDStreamTestTable data = messages.map(new 
FunctionTuple2String,String, TestTable ()
  {
public TestTable call(Tuple2String, String message)
{
return new TestTable(new Integer(++key), 
message._2() );
}
  }
  );

  /* Write to cassandra */

javaFunctions(data,TestTable.class).saveToCassandra(testkeyspace,test_table);
//  data.print(); //creates console output stream.


  jssc.start();
  jssc.awaitTermination();

}
}

class TestTable implements Serializable
{
Integer key;
String value;

public TestTable() {}

public TestTable(Integer k, String v)
{
  key=k;
  value=v;
}

public Integer getKey(){
  return key;
}

public void setKey(Integer k){
  key=k;
}

public String getValue(){
  return value;
}

public void setValue(String v){
  value=v;
}

public String toString(){
  return 
MessageFormat.format(TestTable'{'key={0},value={1}'}', key, value);
}
}

The output log is:
Exception in thread main java.lang.NoSuchMethodError: 
com.datastax.spark.connector.streaming.DStreamFunctions.init(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;)V
at 
com.datastax.spark.connector.DStreamJavaFunctions.init(DStreamJavaFunctions.java:17)
at 
com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(CassandraJavaUtil.java:89)
at com.spark.SparkStream.main(SparkStream.java:83)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


And the POM dependencies are:

dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming-kafka_2.10/artifactId
  version1.1.0/version
/dependency

dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-streaming_2.10/artifactId
  version1.1.0/version
/dependency

dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector_2.10/artifactId
version1.1.0/version
/dependency
dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector-java_2.10/artifactId
version1.0.4/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.1.1/version
/dependency


dependency
  groupIdcom.msiops.footing/groupId
  

Re: Spark-SQL JDBC driver

2014-12-09 Thread Anas Mosaad
Back to the first question, this will mandate that hive is up and running?

When I try it, I get the following exception. The documentation says that
this method works only on SchemaRDD. I though that countries.saveAsTable
did not work for that a reason so I created a tmp that contains the results
from the registered temp table. Which I could validate that it's a
SchemaRDD as shown below.


*@Judy,* I do really appreciate your kind support and I want to understand
and off course don't want to wast your time. If you can direct me the
documentation describing this details, this will be great.

scala val tmp = sqlContext.sql(select * from countries)

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36


scala tmp.saveAsTable(Countries)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
plan found, tree:

'CreateTableAsSelect None, Countries, false, None

 Project
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]

  Subquery countries

   LogicalRDD
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36


 at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)

at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)

at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)

at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)

at
org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)

at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)

at $iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC.init(console:31)

at $iwC.init(console:33)

at init(console:35)

at .init(console:39)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 

Re: Mllib native netlib-java/OpenBLAS

2014-12-09 Thread Jaonary Rabarisoa
+1 with 1.3-SNAPSHOT.

On Mon, Dec 1, 2014 at 5:49 PM, agg212 alexander_galaka...@brown.edu
wrote:

 Thanks for your reply, but I'm still running into issues
 installing/configuring the native libraries for MLlib.  Here are the steps
 I've taken, please let me know if anything is incorrect.

 - Download Spark source
 - unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean package `
 - Run `sbt/sbt publish-local`

 The last step fails with the following error (full stack trace is attached
 here:  error.txt
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n20110/error.txt
 
 ):
 [error] (sql/compile:compile) java.lang.AssertionError: assertion failed:
 List(object package$DebugNode, object package$DebugNode)

 Do I still have to install OPENBLAS/anything else if I build Spark from the
 source using the -Pnetlib-lgpl flag?  Also, do I change the Spark version
 (from 1.1.0 to 1.2.0-SNAPSHOT) in the .sbt file for my app?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p20110.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Using S3 block file system

2014-12-09 Thread Paul Colomiets
Hi,

I'm  trying to use S3 Block file system in spark, i.e. s3:// urls
(*not* s3n://). And I always get the following error:

Py4JJavaError: An error occurred while calling o3188.saveAsParquetFile.
: org.apache.hadoop.fs.s3.S3FileSystemException: Not a Hadoop S3 file.
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.checkMetadata(Jets3tFileSystemStore.java:206)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:165)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)
at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy24.retrieveINode(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.mkdir(S3FileSystem.java:158)
at org.apache.hadoop.fs.s3.S3FileSystem.mkdirs(S3FileSystem.java:151)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
at org.apache.hadoop.fs.s3.S3FileSystem.create(S3FileSystem.java:234)
[.. snip ..]

I believe that I must somehow initialize file system (in particular
the metadata), but I can't find out how to do it.

I use spark 1.2.0rc1 with hadoop 2.4 and Riak CS (instead of S3) if
that matters. The s3n:// protocol with same settings work.


Thanks.
-- 
Paul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError: writing spark-streaming data to cassandra

2014-12-09 Thread Gerard Maas
You're using two conflicting versions of the connector: the Scala version
at 1.1.0 and the Java version at 1.0.4.

I don't use Java, but I guess you only need the java dependency for your
job - and with the version fixed.

dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector_2.10/artifactId
version*1.1.0*/version
/dependency
dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector-java_2.10/artifactId
version*1.0.4*/version
/dependency

On Tue, Dec 9, 2014 at 11:16 AM, m.sar...@accenture.com wrote:


 Hi,

 I am intending to save the streaming data from kafka into Cassandra, using
 spark-streaming:
 But there seems to be problem with line
 javaFunctions(data).writerBuilder(testkeyspace, test_table,
 mapToRow(TestTable.class)).saveToCassandra();
 I am getting NoSuchMethodError.
 The code, the error-log and POM.xml dependencies are listed below:
 Please help me find the reason as to why is this happening.


 public class SparkStream {
 static int key=0;
 public static void main(String args[]) throws Exception
 {
   if(args.length != 3)
   {
 System.out.println(SparkStream zookeeper_ip
 group_nm topic1,topic2,...);
 System.exit(1);
   }

   Logger.getLogger(org).setLevel(Level.OFF);
   Logger.getLogger(akka).setLevel(Level.OFF);
   MapString,Integer topicMap = new
 HashMapString,Integer();
   String[] topic = args[2].split(,);
   for(String t: topic)
   {
 topicMap.put(t, new Integer(3));
   }

   /* Connection to Spark */
   SparkConf conf = new SparkConf();
   JavaSparkContext sc = new JavaSparkContext(local[4],
 SparkStream,conf);
   JavaStreamingContext jssc = new
 JavaStreamingContext(sc, new Duration(3000));


   /* Receive Kafka streaming inputs */
   JavaPairReceiverInputDStreamString, String messages =
 KafkaUtils.createStream(jssc, args[0], args[1], topicMap );


   /* Create DStream */
   JavaDStreamTestTable data = messages.map(new
 FunctionTuple2String,String, TestTable ()
   {
 public TestTable call(Tuple2String, String
 message)
 {
 return new TestTable(new Integer(++key),
 message._2() );
 }
   }
   );

   /* Write to cassandra */

 javaFunctions(data,TestTable.class).saveToCassandra(testkeyspace,test_table);
 //  data.print(); //creates console output stream.


   jssc.start();
   jssc.awaitTermination();

 }
 }

 class TestTable implements Serializable
 {
 Integer key;
 String value;

 public TestTable() {}

 public TestTable(Integer k, String v)
 {
   key=k;
   value=v;
 }

 public Integer getKey(){
   return key;
 }

 public void setKey(Integer k){
   key=k;
 }

 public String getValue(){
   return value;
 }

 public void setValue(String v){
   value=v;
 }

 public String toString(){
   return
 MessageFormat.format(TestTable'{'key={0},value={1}'}', key, value);
 }
 }

 The output log is:
 Exception in thread main java.lang.NoSuchMethodError:
 com.datastax.spark.connector.streaming.DStreamFunctions.init(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;)V
 at
 com.datastax.spark.connector.DStreamJavaFunctions.init(DStreamJavaFunctions.java:17)
 at
 com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(CassandraJavaUtil.java:89)
 at com.spark.SparkStream.main(SparkStream.java:83)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 And the POM dependencies are:

 dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-streaming-kafka_2.10/artifactId
   version1.1.0/version
 /dependency

 dependency
   

Stack overflow Error while executing spark SQL

2014-12-09 Thread jishnu.prathap
Hi

I am getting Stack overflow Error
Exception in main java.lang.stackoverflowerror
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

while executing the following code
sqlContext.sql(SELECT text FROM tweetTable LIMIT 
10).collect().foreach(println)

The complete code is from github
https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on them.
*/
object ExamineAndTrain {
val jsonParser = new JsonParser()
val gson = new GsonBuilder().setPrettyPrinting().create()
def main(args: Array[String]) {
// Process program arguments and set properties
/*if (args.length  3) {
System.err.println(Usage:  + this.getClass.getSimpleName +
 tweetInput outputModelDir numClusters numIterations)
System.exit(1)
}
*
*/
   val outputModelDir=C:\\MLModel
 val tweetInput=C:\\MLInput
   val numClusters=10
   val numIterations=20

//val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), 
Utils.IntParam(numIterations)) = args

val conf = new 
SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4])
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Pretty print some of the tweets.
val tweets = sc.textFile(tweetInput)
println(Sample JSON Tweets---)
for (tweet - tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}
val tweetTable = sqlContext.jsonFile(tweetInput).cache()
tweetTable.registerTempTable(tweetTable)
println(--Tweet table Schema---)
tweetTable.printSchema()
println(Sample Tweet Text-)

sqlContext.sql(SELECT text FROM tweetTable LIMIT 
10).collect().foreach(println)



println(--Sample Lang, Name, text---)
sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 
1000).collect().foreach(println)
println(--Total count by languages Lang, count(*)---)
sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY 
user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println)
println(--- Training the model and persist it)
val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString)
// Cache the vectors RDD since it will be used for all the KMeans iterations.
val vectors = texts.map(Utils.featurize).cache()
vectors.count() // Calls an action on the RDD to populate the vectors cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir)
val some_tweets = texts.take(100)
println(Example tweets from the clusters)
for (i - 0 until numClusters) {
println(s\nCLUSTER $i:)
some_tweets.foreach { t =
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}
}
}

Thanks  Regards
Jishnu Menath Prathap




Stack overflow Error while executing spark SQL

2014-12-09 Thread jishnu.prathap
Hi

I am getting Stack overflow Error
Exception in main java.lang.stackoverflowerror
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

while executing the following code
sqlContext.sql(SELECT text FROM tweetTable LIMIT 
10).collect().foreach(println)

The complete code is from github
https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on them.
*/
object ExamineAndTrain {
val jsonParser = new JsonParser()
val gson = new GsonBuilder().setPrettyPrinting().create()
def main(args: Array[String]) {
// Process program arguments and set properties
/*if (args.length  3) {
System.err.println(Usage:  + this.getClass.getSimpleName +
 tweetInput outputModelDir numClusters numIterations)
System.exit(1)
}
*
*/
   val outputModelDir=C:\\MLModel
 val tweetInput=C:\\MLInput
   val numClusters=10
   val numIterations=20

//val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters), 
Utils.IntParam(numIterations)) = args

val conf = new 
SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4])
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Pretty print some of the tweets.
val tweets = sc.textFile(tweetInput)
println(Sample JSON Tweets---)
for (tweet - tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}
val tweetTable = sqlContext.jsonFile(tweetInput).cache()
tweetTable.registerTempTable(tweetTable)
println(--Tweet table Schema---)
tweetTable.printSchema()
println(Sample Tweet Text-)

sqlContext.sql(SELECT text FROM tweetTable LIMIT 
10).collect().foreach(println)



println(--Sample Lang, Name, text---)
sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT 
1000).collect().foreach(println)
println(--Total count by languages Lang, count(*)---)
sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY 
user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println)
println(--- Training the model and persist it)
val texts = sqlContext.sql(SELECT text from tweetTable).map(_.head.toString)
// Cache the vectors RDD since it will be used for all the KMeans iterations.
val vectors = texts.map(Utils.featurize).cache()
vectors.count() // Calls an action on the RDD to populate the vectors cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir)
val some_tweets = texts.take(100)
println(Example tweets from the clusters)
for (i - 0 until numClusters) {
println(s\nCLUSTER $i:)
some_tweets.foreach { t =
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}
}
}

Thanks  Regards
Jishnu Menath Prathap




Re: spark 1.1.1 Maven dependency

2014-12-09 Thread Sean Owen
Are you using the Commons libs in your app? then you need to depend on
them directly, and not rely on them accidentally being provided by
Spark. There is no trial and error; you must declare all the
dependencies you use in your own code.

Otherwise perhaps you are not actually running with the Spark assembly
JAR at runtime but somehow only trying to run against the Spark jar
itself. This has none of the dependencies that Spark needs.

On Tue, Dec 9, 2014 at 3:46 AM, sivarani whitefeathers...@gmail.com wrote:
 Dear All,

 I am using spark streaming, It was working fine when i was using spark1.0.2,
 now i repeatedly getting few issue

 Like class not found, i am using the same pom.xml with the updated version
 for all spark modules
 i am using  spark-core,streaming, streaming with kafka modules..

 Its constantly keeps throwing errors for no commons-configuation,
 commons-langs, logging

 How to get all the dependencies for running spark streaming.. Is there any
 way or we just have to find by trial and error methord?

 my pom dependencies

 dependencies
 dependency
 groupIdjavax.servlet/groupId
 artifactIdservlet-api/artifactId
 version2.5/version
 /dependency
 dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-core_2.10/artifactId
   version1.0.2/version
 /dependency

   dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-streaming_2.10/artifactId
   version1.0.2/version
 /dependency
   dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-streaming-kafka_2.10/artifactId
   version1.0.2/version
 /dependency

 dependency
   groupIdorg.slf4j/groupId
   artifactIdslf4j-log4j12/artifactId
   version1.7.5/version
 /dependency
 dependency
 groupIdcommons-logging/groupId
 artifactIdcommons-logging/artifactId
 version1.1.1/version
 /dependency
 dependency
 groupIdcommons-configuration/groupId
 artifactIdcommons-configuration/artifactId
 version1.6/version
 /dependency

 Am i missing something here?






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-1-Maven-dependency-tp20590.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Saving Data only if Dstream is not empty

2014-12-09 Thread Sean Owen
I don't believe you can do this unless you implement the save to HDFS
logic yourself. To keep the semantics consistent, these saveAs*
methods will always output a file per partition.

On Mon, Dec 8, 2014 at 11:53 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote:
 Hi Experts!

 I want to save DStream to HDFS only if it is not empty such that it contains
 some kafka messages to be stored. What is an efficient way to do this.

var data = KafkaUtils.createStream[Array[Byte], Array[Byte],
 DefaultDecoder, DefaultDecoder](ssc, params, topicMap,
 StorageLevel.MEMORY_ONLY).map(_._2)


 val streams = data.window(Seconds(interval*4),
 Seconds(interval*2)).map(x = new String(x))
 //streams.foreachRDD(rdd=rdd.foreach(println))

 //what condition can be applied here to store only non empty DStream
 streams.saveAsTextFiles(sink, msg)
 Thanks




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



reg JDBCRDD code

2014-12-09 Thread Deepa Jayaveer
Hi All,
am new to Spark.  I tried to connect to Mysql using Spark.  want to write 
a code in Java but 
getting runtime exception. I guess that the issue is with the function0 
and function1 objects being passed in JDBCRDD .

I tried my level best and attached the code, can you please help us to fix 
the issue.



Thanks 
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




JdbcRddTest.java
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

registerTempTable: Table not found

2014-12-09 Thread Hao Ren
Hi, 

I am using Spark SQL on 1.2.1-snapshot.

Here is problem I encountered. Bacially, I want to save a schemaRDD to
HiveContext

val scm = StructType(
  Seq(
StructField(name, StringType, nullable = false),
StructField(cnt, IntegerType, nullable = false)
  ))

val schRdd = hiveContext.applySchema(ranked, scm)
// ranked above is RDD[Row] whose row contains 2 fields
schRdd.registerTempTable(schRdd)

hiveContext sql select count(name) from schRdd limit 20 // = ok

hiveContext sql create table t as select * from schRdd // = table not
found


A query like select works well and gives the correct answer, but when I
try to save the temple table into Hive Context by createTableAsSelect, it
does not work.

*Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32
Table not found 'schRdd'
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)*

I thought that was caused by registerTempTable, so I replace it by
saveAsTable. It does not work neither.

*Exception in thread main java.lang.AssertionError: assertion failed: No
plan for CreateTableAsSelect Some(sephcn), schRdd, false, None
 LogicalRDD [name#6,cnt#7], MappedRDD[3] at map at Creation.scala:70

at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)*

I also checked source code of QueryPlanner:

 def apply(plan: LogicalPlan): Iterator[PhysicalPlan] = {
// Obviously a lot to do here still...
val iter = strategies.view.flatMap(_(plan)).toIterator
assert(iter.hasNext, sNo plan for $plan)
iter
  }

The comment shows that there are some works to do with it. 

Any help is appreciated.

Thx.

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help realted with spark streaming usage error

2014-12-09 Thread Saurabh Pateriya
Hi ,

I am running spark streaming in standalone mode on twitter source into single 
machine(using HDP virtual box)  I receive status from streaming context and I 
can print the same but when I try to save those statuses as RDD into Hadoop 
using rdd.saveAsTextFiles or 
saveAsHadoopFiles(hdfs://10.20.32.204:50070/user/hue/test,txt) I get below 
connection error.
My Hadoop version:2.4.0.2.1.1.0-385
Spark 1.1.0

ERROR-

14/12/09 04:45:12 ERROR scheduler.JobScheduler: Error running job streaming job 
141812911 ms.1
java.io.IOException: Call to /10.20.32.204:50070 failed on local exception: 
java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107)
at org.apache.hadoop.ipc.Client.call(Client.java:1075)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:193)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:685)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:762)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:760)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:811)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:749)
[error] (run-main-0) java.io.IOException: Call to /10.20.32.204:50070 failed on 
local exception: java.io.EOFException
java.io.IOException: Call to /10.20.32.204:50070 failed on local exception: 
java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107)
at org.apache.hadoop.ipc.Client.call(Client.java:1075)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.SparkHadoopWriter$.createPathFromString(SparkHadoopWriter.scala:193)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:685)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894)
at 

Re: registerTempTable: Table not found

2014-12-09 Thread nitin
Looks like this issue has been fixed very recently and should be available in
next RC :-

http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592p20593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Specifying number of executors in Mesos

2014-12-09 Thread Gerard Maas
Hi,

We've a number of Spark Streaming /Kafka jobs that would benefit of an even
spread of consumers over physical hosts in order to maximize network usage.
As far as I can see, the Spark Mesos scheduler accepts resource offers
until all required Mem + CPU allocation has been satisfied.

This basic resource allocation policy results in large executors spread
over few nodes, resulting in many Kafka consumers in a single node (e.g.
from 12 consumers, I've seen allocations of 7/3/2)

Is there a way to tune this behavior to achieve executor allocation on a
given number of hosts?

-kr, Gerard.


Re: registerTempTable: Table not found

2014-12-09 Thread Hao Ren
Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/registerTempTable-Table-not-found-tp20592p20594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
According to the stacktrace, you were still using SQLContext rather than 
HiveContext. To interact with Hive, HiveContext *must* be used.


Please refer to this page 
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



On 12/9/14 6:26 PM, Anas Mosaad wrote:
Back to the first question,**this will mandate that hive is up and 
running?


When I try it, I get the following exception. The documentation says 
that this method works only on SchemaRDD. I though that 
countries.saveAsTable did not work for that a reason so I created a 
tmp that contains the results from the registered temp table. Which I 
could validate that it's a SchemaRDD as shown below.


*
@Judy,* I do really appreciate your kind support and I want to 
understand and off course don't want to wast your time. If you can 
direct me the documentation describing this details, this will be great.


scala val tmp = sqlContext.sql(select * from countries)

tmp: org.apache.spark.sql.SchemaRDD =

SchemaRDD[12] at RDD at SchemaRDD.scala:108

== Query Plan ==

== Physical Plan ==

PhysicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], 
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



scala tmp.saveAsTable(Countries)

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Unresolved plan found, tree:


'CreateTableAsSelect None, Countries, false, None

 Project 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]


  Subquery countries

   LogicalRDD 
[COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29], 
MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36



at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)


at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)


at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)


at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)


at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)


at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)


at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)


at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)


at scala.collection.immutable.List.foreach(List.scala:318)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)


at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)


at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)


at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)


at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)


at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)


at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)


at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)


at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)


at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)


at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)


at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)


at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)


at 
org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)


at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)

at $iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC.init(console:31)

at $iwC.init(console:33)

at init(console:35)

at .init(console:39)

at .clinit(console)

at .init(console:7)

at .clinit(console)

at $print(console)

Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-09 Thread Daniel Haviv
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:

14/12/09 06:54:24 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI'
on port 4040.
14/12/09 06:54:24 INFO ui.SparkUI: Started SparkUI at http://hdname:4040
14/12/09 06:54:25 INFO impl.TimelineClientImpl: Timeline service address:
http://0.0.0.0:8188/ws/v1/timeline/
java.lang.NoClassDefFoundError:
org/codehaus/jackson/map/deser/std/StdDeserializer
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

Any idea why ?

Thanks,
Daniel


Submit application to spark on mesos cluster

2014-12-09 Thread Han JU
Hi,

I have a little problem in submitting our application to a mesos cluster.

Basically the mesos cluster is configured and I'm able to have spark-shell
working correctly. Then I tried to launch our application jar (a uber, sbt
assembly jar with all deps):

bin/spark-submit --master mesos://10.192.222.232:5050 --jars
$ADDITIONAL_JARS --class my.package.BenchmarkDriver
file:///home/jclouds/spark-benchmark-assembly-0.1.0-SNAPSHOT.jar
 --application_configs

So you can see I follow the documentation, provide the application jar in
file://... format and I make sure that the application jar is available in
this path in each worker of the cluster. Plus I provide the application jar
with --jars. However there's always ClassNotFound exception in worker:

  java.lang.ClassNotFoundException: my.package.UrlLink

or, if I tried custom kryo serializer:

  org.apache.spark.SparkException: Failed to invoke
my.package.CustomKryoRegistrator

I've tried using hdfs://... for the application jar, but it seems ignored
completely by spark-submit.

I'm using spark 1.1.1 on hadoop 2.4.

Any suggestions? How should I submit the application jar?
-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


Re: Programmatically running spark jobs using yarn-client

2014-12-09 Thread Aniket Bhatnagar
Thanks Akhil. I was wondering why it isn't available to find the class even
though it existed in the same class loader as SparkContext. As a
workaround, I used the following code the add all dependent jars in a
playframework application to spark context.

@tailrec
private def addClassPathJars(sparkContext: SparkContext, classLoader:
ClassLoader): Unit = {
  classLoader match {
case urlClassLoader: URLClassLoader = {
  urlClassLoader.getURLs.foreach(classPathUrl = {
if (classPathUrl.toExternalForm.endsWith(.jar)) {
  LOGGER.debug(sAdded $classPathUrl to spark context
$sparkContext)
  sparkContext.addJar(classPathUrl.toExternalForm)
} else {
  LOGGER.debug(sIgnored $classPathUrl while adding to spark
context $sparkContext)
}
  })
}
case _ = LOGGER.debug(sIgnored class loader $classLoader as it does
not subclasses URLClassLoader)
  }
  if (classLoader.getParent != null){
addClassPathJars(sparkContext, classLoader.getParent)
  }
}

On Mon Dec 08 2014 at 21:39:42 Akhil Das ak...@sigmoidanalytics.com wrote:

 How are you submitting the job? You need to create a jar of your code (sbt
 package will give you one inside target/scala-*/projectname-*.jar) and then
 use it while submitting. If you are not using spark-submit then you can
 simply add this jar to spark by
 sc.addJar(/path/to/target/scala*/projectname*jar)

 Thanks
 Best Regards

 On Mon, Dec 8, 2014 at 7:23 PM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am trying to create (yet another) spark as a service tool that lets you
 submit jobs via REST APIs. I think I have nearly gotten it to work baring a
 few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but
 I have hit the road block with the following issue.

 I have created a simple spark job as following:

 class StaticJob {
 import SparkContext._
 override def run(sc: SparkContext): Result = {
   val array = Range(1, 1000).toArray
   val rdd = sc.parallelize(array)
   val paired = rdd.map(i = (i % 1, i)).sortByKey()
   val sum = paired.countByKey()
   SimpleResult(sum)
 }
 }

 When I submit this job programmatically, it gives me a class not found
 error:

 2014-12-08 05:41:18,421 [Result resolver thread-0] [warn]
 o.a.s.s.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0,
 localhost.localdomain): java.lang.ClassNotFoundException:
 com.blah.server.examples.StaticJob$$anonfun$1
 java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 java.security.AccessController.doPrivileged(Native Method)
 java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 java.lang.Class.forName0(Native Method)
 java.lang.Class.forName(Class.java:270)

 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)

 I decompiled the StaticJob$$anonfun$1 class and it seems to point to
 closure 'rdd.map(i = (i % 1, i))'. I am sure why this is happening.
 Any help will be greatly appreciated.





Re: Saving Data only if Dstream is not empty

2014-12-09 Thread Gerard Maas
We have a similar case in which we don't want to save data to Cassandra if
the data is empty.
In our case, we filter the initial DStream to process messages that go to a
given table.

To do so, we're using something like this:

dstream.foreachRDD{ (rdd,time) =
   tables.foreach{ table =
 val filteredRdd = rdd.filter(record =  predicate to assign records to
tables)
 filteredRdd.cache
 if (filteredRdd.count0) {
filteredRdd.saveAsFoo(...) // we do here saveToCassandra, you could
do saveAsTextFile(s$path/$time)
 }
 filteredRdd.unpersist
}

Using the 'time' parameter you can implement an unique name based on the
timestamp for the  saveAsTextfile(filename) call which is what the
Dstream.saveAsTextFile(...) gives you.  (so it boils down to what Sean
said... you implement the saveAs yourself)

-kr, Gerard.
@maasg

On Tue, Dec 9, 2014 at 1:56 PM, Sean Owen so...@cloudera.com wrote:

 I don't believe you can do this unless you implement the save to HDFS
 logic yourself. To keep the semantics consistent, these saveAs*
 methods will always output a file per partition.

 On Mon, Dec 8, 2014 at 11:53 PM, Hafiz Mujadid hafizmujadi...@gmail.com
 wrote:
  Hi Experts!
 
  I want to save DStream to HDFS only if it is not empty such that it
 contains
  some kafka messages to be stored. What is an efficient way to do this.
 
 var data = KafkaUtils.createStream[Array[Byte], Array[Byte],
  DefaultDecoder, DefaultDecoder](ssc, params, topicMap,
  StorageLevel.MEMORY_ONLY).map(_._2)
 
 
  val streams = data.window(Seconds(interval*4),
  Seconds(interval*2)).map(x = new String(x))
  //streams.foreachRDD(rdd=rdd.foreach(println))
 
  //what condition can be applied here to store only non empty DStream
  streams.saveAsTextFiles(sink, msg)
  Thanks
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark on YARN memory utilization

2014-12-09 Thread Denny Lee
Thanks Sandy!
On Mon, Dec 8, 2014 at 23:15 Sandy Ryza sandy.r...@cloudera.com wrote:

 Another thing to be aware of is that YARN will round up containers to the
 nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults
 to 1024.

 -Sandy

 On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee denny.g@gmail.com wrote:

 Got it - thanks!

 On Sat, Dec 6, 2014 at 14:56 Arun Ahuja aahuj...@gmail.com wrote:

 Hi Denny,

 This is due the spark.yarn.memoryOverhead parameter, depending on what
 version of Spark you are on the default of this may differ, but it should
 be the larger of 1024mb per executor or .07 * executorMemory.

 When you set executor memory, the yarn resource request is
 executorMemory + yarnOverhead.

 - Arun

 On Sat, Dec 6, 2014 at 4:27 PM, Denny Lee denny.g@gmail.com wrote:

 This is perhaps more of a YARN question than a Spark question but i was
 just curious to how is memory allocated in YARN via the various
 configurations.  For example, if I spin up my cluster with 4GB with a
 different number of executors as noted below

  4GB executor-memory x 10 executors = 46GB  (4GB x 10 = 40 + 6)
  4GB executor-memory x 4 executors = 19GB (4GB x 4 = 16 + 3)
  4GB executor-memory x 2 executors = 10GB (4GB x 2 = 8 + 2)

 The pattern when observing the RM is that there is a container for each
 executor and one additional container.  From the basis of memory, it looks
 like there is an additional (1GB + (0.5GB x # executors)) that is allocated
 in YARN.

 Just wondering why is this  - or is this just an artifact of YARN
 itself?

 Thanks!






RE: Learning rate or stepsize automation

2014-12-09 Thread Bui, Tri
Thanks!  Will try it out.

From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Monday, December 08, 2014 5:13 PM
To: Bui, Tri
Cc: user@spark.apache.org
Subject: Re: Learning rate or stepsize automation

Hi Bui,

Please use BFGS based solvers...For BFGS you don't have to specify step size 
since the line search will find sufficient decrease each time...

Regularization you still have to do grid search...it's not possible to automate 
that but on master you will find nice ways to automate grid search...

Thanks.
Deb


On Mon, Dec 8, 2014 at 3:04 PM, Bui, Tri 
tri@verizonwireless.com.invalidmailto:tri@verizonwireless.com.invalid
 wrote:
Hi,

Is there any way to auto calculate the optimum learning rate or stepsize via 
MLLIB for SGD ?

Thx
tri



Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
Hi Cristovao,

I have seen a very similar issue that I have posted about in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
I think your main issue here is somewhat similar, in that the MapWrapper
Scala class is not registered. This gets registered by the Twitter
chill-scala AllScalaRegistrar class that you are currently not using.

As far as I understand, in order to use Avro with Spark, you also have to
use Kryo. This means you have to use the Spark KryoSerializer. This in turn
uses Twitter chill. I posted the basic code that I am using here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

Maybe there is a simpler solution to your problem but I am not that much of
an expert yet. I hope this helps.

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception
 in thread Thread[Executor task launch worker-0,5,main]
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 

Re: reg JDBCRDD code

2014-12-09 Thread Akhil Das
Hi Deepa,

In Scala, You will do something like
https://gist.github.com/akhld/ccafb27f098163bea622

With Java API's it will be something like
https://gist.github.com/akhld/0d9299aafc981553bc34



Thanks
Best Regards

On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com
wrote:

 Hi All,
 am new to Spark.  I tried to connect to Mysql using Spark.  want to write
 a code in Java but
 getting runtime exception. I guess that the issue is with the function0
 and function1 objects being passed in JDBCRDD .

 I tried my level best and attached the code, can you please help us to fix
 the issue.



 Thanks
 Deepa

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
You can use this Maven dependency:

dependency
groupIdcom.twitter/groupId
artifactIdchill-avro/artifactId
version0.4.0/version
/dependency

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Thanks for the reply!

 I've tried in fact your code. But I lack the twiter chill package and I
 can not find it online. So I am now trying this
 http://spark.apache.org/docs/latest/tuning.html#data-serialization . But
 in case I can't do it, could you tell me where to get that Twiter package
 you used?

 Thanks

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 09 December 2014 16:42
 *To:* Cristovao Jose Domingues Cordeiro; user

 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   Hi Cristovao,

 I have seen a very similar issue that I have posted about in this thread:

 http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
  I think your main issue here is somewhat similar, in that the MapWrapper
 Scala class is not registered. This gets registered by the Twitter
 chill-scala AllScalaRegistrar class that you are currently not using.

  As far as I understand, in order to use Avro with Spark, you also have
 to use Kryo. This means you have to use the Spark KryoSerializer. This in
 turn uses Twitter chill. I posted the basic code that I am using here:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

  Maybe there is a simpler solution to your problem but I am not that much
 of an expert yet. I hope this helps.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
 cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
 (TID 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 

Re: PySpark elasticsearch question

2014-12-09 Thread Mohamed Lrhazi
Thanks Nick... still no luck.

If I use ?q=somerandomcharsfields=title,_source

I get an exception about empty collection, which seems to indicate it is
actually using the supplied es.query, but somehow when I do rdd.take(1) or
take(10), all I get is the id and an empty dict, apparently... maybe
something to do how my index is setup in ES ?

In [19]: es_rdd.take(4)
14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at
PythonRDD.scala:300
14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at
PythonRDD.scala:300) with 1 output partitions (allowLocal=true)
14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at
PythonRDD.scala:300)
14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List()
14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List()
14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at
RDD at PythonRDD.scala:43), which has no missing parents
14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with
curMem=1979220, maxMem=278302556
14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in
memory (estimated size 4.7 KB, free 263.5 MB)
14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage
18 (PythonRDD[30] at RDD at PythonRDD.scala:43)
14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks
14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0 (TID
19, localhost, ANY, 24823 bytes)
14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19)
14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit
[node=[VKgl4LAgRZyFaSopAWQL5Q/rap-es2-12|141.161.88.237:9200],shard=2]
14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id...
14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init = 284,
finish = 0
14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed (
141.161.88.237:9200); selected next node [141.161.88.233:9200]
14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19).
1886 bytes result sent to driver
14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0 (TID
19) in 316 ms on localhost (1/1)
14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks
have all completed, from pool
14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at
PythonRDD.scala:300) finished in 0.324 s
14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at
PythonRDD.scala:300, took 0.337848207 s
Out[19]:
[(u'en_20040726_fbis_116728340038', {}),
 (u'en_20040726_fbis_116728320448', {}),
 (u'en_20040726_fbis_116728330192', {}),
 (u'en_20040726_fbis_116728330145', {})]

In [20]:



On Tue, Dec 9, 2014 at 10:18 AM, Nick  wrote:

 try es.query something like ?q=*fields=title,_source for a match all
 query. you need the q=* which is actually the query part of the query

 On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi 
 mohamed.lrh...@georgetown.edu wrote:

 Hello,

 Following a couple of tutorials, I cant seem to get pysprak to get any
 fields from ES other than the document id?

 I tried like so:

 es_rdd =
 sc.newAPIHadoopRDD(inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,keyClass=org.apache.hadoop.io.NullWritable,valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,conf={
 es.resource : en_2004/doc,es.nodes:rap-es2.uis,es.query :
 ?fields=title,_source })

 es_rdd.take(1)

 Always shows:

 Out[13]: [(u'en_20040726_fbis_116728340038', {})]

 How does one get more fields?


 Thanks,
 Mohamed.





Re: PySpark elasticsearch question

2014-12-09 Thread Mohamed Lrhazi
found a format that worked, kind of accidentally:

es.query :  {query:{match_all:{}},fields:[title,_source]}

Thanks,
Mohamed.


On Tue, Dec 9, 2014 at 11:27 AM, Mohamed Lrhazi 
mohamed.lrh...@georgetown.edu wrote:

 Thanks Nick... still no luck.

 If I use ?q=somerandomcharsfields=title,_source

 I get an exception about empty collection, which seems to indicate it is
 actually using the supplied es.query, but somehow when I do rdd.take(1)
 or take(10), all I get is the id and an empty dict, apparently... maybe
 something to do how my index is setup in ES ?

 In [19]: es_rdd.take(4)
 14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at
 PythonRDD.scala:300
 14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at
 PythonRDD.scala:300) with 1 output partitions (allowLocal=true)
 14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at
 PythonRDD.scala:300)
 14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List()
 14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List()
 14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at
 RDD at PythonRDD.scala:43), which has no missing parents
 14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with
 curMem=1979220, maxMem=278302556
 14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in
 memory (estimated size 4.7 KB, free 263.5 MB)
 14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage
 18 (PythonRDD[30] at RDD at PythonRDD.scala:43)
 14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks
 14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0
 (TID 19, localhost, ANY, 24823 bytes)
 14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19)
 14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit
 [node=[VKgl4LAgRZyFaSopAWQL5Q/rap-es2-12|141.161.88.237:9200],shard=2]
 14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id...
 14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init =
 284, finish = 0
 14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed (
 141.161.88.237:9200); selected next node [141.161.88.233:9200]
 14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19).
 1886 bytes result sent to driver
 14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0
 (TID 19) in 316 ms on localhost (1/1)
 14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose
 tasks have all completed, from pool
 14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at
 PythonRDD.scala:300) finished in 0.324 s
 14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at
 PythonRDD.scala:300, took 0.337848207 s
 Out[19]:
 [(u'en_20040726_fbis_116728340038', {}),
  (u'en_20040726_fbis_116728320448', {}),
  (u'en_20040726_fbis_116728330192', {}),
  (u'en_20040726_fbis_116728330145', {})]

 In [20]:



 On Tue, Dec 9, 2014 at 10:18 AM, Nick  wrote:

 try es.query something like ?q=*fields=title,_source for a match all
 query. you need the q=* which is actually the query part of the query

 On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi 
 mohamed.lrh...@georgetown.edu wrote:

 Hello,

 Following a couple of tutorials, I cant seem to get pysprak to get any
 fields from ES other than the document id?

 I tried like so:

 es_rdd =
 sc.newAPIHadoopRDD(inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,keyClass=org.apache.hadoop.io.NullWritable,valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,conf={
 es.resource : en_2004/doc,es.nodes:rap-es2.uis,es.query :
 ?fields=title,_source })

 es_rdd.take(1)

 Always shows:

 Out[13]: [(u'en_20040726_fbis_116728340038', {})]

 How does one get more fields?


 Thanks,
 Mohamed.






Re: NoSuchMethodError: writing spark-streaming data to cassandra

2014-12-09 Thread m.sarosh
Hi,

@Gerard- Thanks, i added one more dependency for 
conf.set(spark.cassandra.connection.host, localhost).

But now, i am able to create a connection, but the data is not getting inserted 
into the cassandra table.
and the logs show its getting connected and the next second getting 
disconnected.
the full code and the logs and dependencies are below:

public class SparkStream {
static int key=0;
public static void main(String args[]) throws Exception
{

if(args.length != 3)
{
System.out.println(parameters not given properly);
System.exit(1);
}

Logger.getLogger(org).setLevel(Level.OFF);
Logger.getLogger(akka).setLevel(Level.OFF);
MapString,Integer topicMap = new HashMapString,Integer();
String[] topic = args[2].split(,);
for(String t: topic)
{
topicMap.put(t, new Integer(3));
}

/* Connection to Spark */
SparkConf conf = new SparkConf();
conf.set(spark.cassandra.connection.host, localhost);
JavaSparkContext sc = new JavaSparkContext(local[4], SparkStream,conf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(5000));


/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
System.out.println(+++cassandra connector 
created);

/* Receive Kafka streaming inputs */
JavaPairReceiverInputDStreamString, String messages = 
KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
System.out.println(+streaming Connection 
done!+++);


/* Create DStream */
JavaDStreamTestTable data = messages.map(new Function Tuple2String,String, 
TestTable ()
{
public TestTable call(Tuple2String, String message)
{
return new TestTable(new Integer(++key), message._2() );
}
}
);
System.out.println(JavaDStreamTestTable 
created);


/* Write to cassandra */
javaFunctions(data).writerBuilder(testkeyspace, test_table, 
mapToRow(TestTable.class)).saveToCassandra();

jssc.start();
jssc.awaitTermination();

}
}

class TestTable implements Serializable
{
Integer key;
String value;

public TestTable() {}

public TestTable(Integer k, String v)
{
key=k;
value=v;
}

public Integer getKey(){
return key;
}

public void setKey(Integer k){
key=k;
}

public String getValue(){
return value;
}

public void setValue(String v){
value=v;
}

public String toString(){
return MessageFormat.format(TestTable'{'key={0}, value={1}'}', key, 
value);

}
}

The log is:
+++cassandra connector created
+streaming Connection done!+++
JavaDStreamTestTable created
14/12/09 12:07:33 INFO core.Cluster: New Cassandra host 
localhost/127.0.0.1:9042 added
14/12/09 12:07:33 INFO cql.CassandraConnector: Connected to Cassandra cluster: 
Test Cluster
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 
127.0.0.1 (datacenter1)
14/12/09 12:07:33 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 
127.0.0.1 (datacenter1)
14/12/09 12:07:34 INFO cql.CassandraConnector: Disconnected from Cassandra 
cluster: Test Cluster

14/12/09 12:07:45 INFO core.Cluster: New Cassandra host 
localhost/127.0.0.1:9042 added
14/12/09 12:07:45 INFO cql.CassandraConnector: Connected to Cassandra cluster: 
Test Cluster
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 
127.0.0.1 (datacenter1)
14/12/09 12:07:45 INFO cql.LocalNodeFirstLoadBalancingPolicy: Adding host 
127.0.0.1 (datacenter1)
14/12/09 12:07:46 INFO cql.CassandraConnector: Disconnected from Cassandra 
cluster: Test Cluster

The POM.xml dependencies are:
   dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming-kafka_2.10/artifactId
version1.1.0/version
/dependency

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId
version1.1.0/version
/dependency

dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector_2.10/artifactId
version1.1.0/version
/dependency
dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector-java_2.10/artifactId
version1.1.0/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.1.1/version
/dependency


dependency
groupIdcom.msiops.footing/groupId
artifactIdfooting-tuple/artifactId
version0.2/version
/dependency

dependency
groupIdcom.datastax.cassandra/groupId
artifactIdcassandra-driver-core/artifactId
version2.1.3/version
/dependency



Thanks and Regards,

Md. Aiman Sarosh.
Accenture Services Pvt. Ltd.
Mob #:  (+91) - 9836112841.

From: Gerard Maas gerard.m...@gmail.com
Sent: Tuesday, December 9, 2014 4:39 PM
To: Sarosh, M.
Cc: spark users
Subject: Re: NoSuchMethodError: writing spark-streaming data to 

pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-09 Thread Gautham
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
sc.textFile to load data from a number of gz files, it does not progress as
fast as expected. When I log-in to a child node and run top, I see only 4
threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
when processing the strings after loading, when all the cores are used to
process the data.

Please help me with this? What setting can be changed to get the CPU usage
back up to full?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-12-09 Thread Michael Armbrust
Not yet unfortunately.  You could cast the timestamp to a long if you don't
need nanosecond precision.

Here is a related JIRA: https://issues.apache.org/jira/browse/SPARK-4768

On Mon, Dec 8, 2014 at 11:27 PM, ZHENG, Xu-dong dong...@gmail.com wrote:

 I meet the same issue. Any solution?

 On Wed, Nov 12, 2014 at 2:54 PM, tridib tridib.sama...@live.com wrote:

 Hi Friends,
 I am trying to save a json file to parquet. I got error Unsupported
 datatype TimestampType.
 Is not parquet support date? Which parquet version does spark uses? Is
 there
 any work around?


 Here the stacktrace:

 java.lang.RuntimeException: Unsupported datatype TimestampType
 at scala.sys.package$.error(package.scala:27)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
 at scala.Option.getOrElse(Option.scala:120)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2$$anonfun$3.apply(ParquetTypes.scala:320)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:319)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
 at scala.Option.getOrElse(Option.scala:120)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:363)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:362)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:361)
 at

 org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:407)
 at

 org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:151)
 at

 org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:130)
 at

 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:204)
 at

 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at

 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at

 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at

 org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
 at

 org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42)

 Thanks  Regards
 Tridib




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-file-Unsupported-datatype-TimestampType-tp18691.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 郑旭东
 ZHENG, Xu-dong



PySprak and UnsupportedOperationException

2014-12-09 Thread Mohamed Lrhazi
While trying simple examples of PySpark code, I systematically get these
failures when I try this.. I dont see any prior exceptions in the output...
How can I debug further to find root cause?


es_rdd = sc.newAPIHadoopRDD(
inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,
keyClass=org.apache.hadoop.io.NullWritable,
valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,
conf={
es.resource : en_2014/doc,
es.nodes:rap-es2,
es.query :  {query:{match_all:{}},fields:[title],
size: 100}
}
)


titles=es_rdd.map(lambda d: d[1]['title'][0])
counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
1)).reduceByKey(add)


output = counts.collect()



...
14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93
14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93
14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448 dropped
from memory (free 274984768)
14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93
14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92
14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92
14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391
dropped from memory (free 275148159)
14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92
14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91
14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91
14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391
dropped from memory (free 275311550)
14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91
14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID
72)
java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at java.util.AbstractMap.putAll(AbstractMap.java:273)
at
org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373)
at
org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322)
at
org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299)
at
org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
at
scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
at
scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
14/12/09 19:27:30 INFO TaskSetManager: Starting task 2.0 in stage 67.0 (TID
74, localhost, ANY, 26266 bytes)
14/12/09 19:27:30 INFO Executor: Running task 2.0 in stage 67.0 (TID 74)
14/12/09 19:27:30 WARN TaskSetManager: Lost task 0.0 in stage 67.0 (TID 72,
localhost): java.lang.UnsupportedOperationException:


Re: PhysicalRDD problem?

2014-12-09 Thread Michael Armbrust

 val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
 existingSchemaRDD.schema)


This line is throwing away the logical information about existingSchemaRDD
and thus Spark SQL can't know how to push down projections or predicates
past this operator.

Can you describe more the problems that you see if you don't do this
reapplication of the schema.


Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-09 Thread Michael Armbrust
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ah... I see. Thanks for pointing it out.

 Then it means we cannot mount external table using customized column
 names. hmm...

 Then the only option left is to use a subquery to add a bunch of column
 alias. I'll try it later.

 Thanks,
 Jianshi

 On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
 wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new
 SchemaRDD, had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang 
 jianshi.hu...@gmail.com wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet
 files for Spark to consume.

 However, due to Pig's limitation, the generated schema contains
 Pig's identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



yarn log on EMR

2014-12-09 Thread Tyson

Dear all,

I would like to run a simple spark job on EMR with yarn.

My job is the follows:

public voidEMRRun() {
SparkConf sparkConf 
=newSparkConf().setAppName(RunEMR).setMaster(yarn-cluster);
sparkConf.set(spark.executor.memory,13000m);
JavaSparkContext ctx =newJavaSparkContext(sparkConf);
System.out.println(ctx.appName());

ListInteger list =newLinkedListInteger();
for(inti =0;i1;i++){
list.add(i);
}

JavaRDDInteger listRDD = ctx.parallelize(list);
ListInteger results = listRDD.collect();

for(Integer i : results){
System.out.println(i);
}

ctx.stop();

}

public static voidmain(String[] args) {
SparkTest sp =newSparkTest();
sp.EMRRun();
}


On EMR I run the spark with spark-submit with the following:

./spark-submit --class 
com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest 
--master yarn-cluster --executor-memory 512m --num-executors 10 
/home/hadoop/MLyBigData.jar


After that finished I tried to see yarn log, but I got this:
 yarn logs -applicationId application_1418123020170_0032
14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at 
/172.31.3.155:9022

Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032
Log aggregation has not completed or is not enabled.

But I modified the yarn-site.xml as:
propertynameyarn.log-aggregation-enable/namevaluetrue/value/property
propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property
propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property

I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4

Any suggestions how can I see the logs of the yarn?
Thanks,
Istvan


Caching RDDs with shared memory - bug or feature?

2014-12-09 Thread insperatum
If all RDD elements within a partition contain pointers to a single shared
object, Spark persists as expected when the RDD is small. However, if the
RDD is more than *200 elements* then Spark reports requiring much more
memory than it actually does. This becomes a problem for large RDDs, as
Spark refuses to persist even though it can. Is this a bug or is there a
feature that I'm missing?
Cheers, Luke

*val* /n/ = ???
*class* Elem(*val* s:Array[Int])
*val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ = {
*val* sharedArray = Array./ofDim/[Int](1000)   /// Should require
~40MB/
(1 to /n/).toIterator.map(_ = *new* Elem(sharedArray))
}).cache().count()   /// force computation/

For n = 100: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *38.1 MB*, free 898.7 MB)/
For n = 200: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *38.2 MB*, free 898.7 MB)/
For n = 201: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *76.7 MB*, free 860.2 MB)/
For n = 5000: /MemoryStore: *Not enough space to cache rdd_1_0 in memory!*
(computed 781.3 MB so far)/

Note: For medium sized n (where n200 but spark can still cache), the actual
application memory still stays where it should - Spark just seems to vastly
overreport how much memory it's using. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-RDDs-with-shared-memory-bug-or-feature-tp20596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark shell and hive context problem

2014-12-09 Thread minajagi
Hi I'm working on Spark that comes with CDH 5.2.0

I'm trying to get a hive context in the shell and I'm running into problems
that I don't understand.

I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as
indicated elsewhere


Here is what I see.Pls help

---
scala import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to
term hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
error: 
 while compiling: console
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: 

  last tree to typer: Apply(value $outer)
  symbol: value $outer (flags: method synthetic stable
expandedname triedcooking)
   symbol definition: val $outer(): $iwC.$iwC.type
 tpe: $iwC.$iwC.type
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC
- class $read - package $line9
  context owners: class $iwC - class $iwC - class $iwC - class $iwC
- class $read - package $line9

== Enclosing template or block ==

ClassDef( // class $iwC extends Serializable
  0
  $iwC
  []
  Template( // val local $iwC: notype, tree.tpe=$iwC
java.lang.Object, scala.Serializable // parents
ValDef(
  private
  _
  tpt
  empty
)
// 5 statements
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
  method triedcooking
  init
  []
  // 1 parameter list
  ValDef( // $outer: $iwC.$iwC.$iwC.type

$outer
tpt // tree.tpe=$iwC.$iwC.$iwC.type
empty
  )
  tpt // tree.tpe=$iwC
  Block( // tree.tpe=Unit
Apply( // def init(): Object in class Object, tree.tpe=Object
  $iwC.super.init // def init(): Object in class Object,
tree.tpe=()Object
  Nil
)
()
  )
)
ValDef( // private[this] val hiveCtx:
org.apache.spark.sql.hive.HiveContext
  private local triedcooking
  hiveCtx 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  Apply( // def init(sc: org.apache.spark.SparkContext):
org.apache.spark.sql.hive.HiveContext in class HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
new org.apache.spark.sql.hive.HiveContext.init // def init(sc:
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in
class HiveContext, tree.tpe=(sc:
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
Apply( // val sc(): org.apache.spark.SparkContext,
tree.tpe=org.apache.spark.SparkContext
 
$iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
// val sc(): org.apache.spark.SparkContext,
tree.tpe=()org.apache.spark.SparkContext
  Nil
)
  )
)
DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext
  method stable accessor
  hiveCtx
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  $iwC.this.hiveCtx  // private[this] val hiveCtx:
org.apache.spark.sql.hive.HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
)
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
  protected synthetic paramaccessor triedcooking
  $outer 
  tpt // tree.tpe=$iwC.$iwC.$iwC.type
  empty
)
DefDef( // val $outer(): $iwC.$iwC.$iwC.type
  method synthetic stable expandedname triedcooking
  $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer
  []
  List(Nil)
  tpt // tree.tpe=Any
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type,
tree.tpe=$iwC.$iwC.$iwC.type
)
  )
)

== Expanded type of tree ==

ThisType(class $iwC)

uncaught exception during compilation:
scala.reflect.internal.Types$TypeError
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature
in HiveContext.class refers to term conf
in value org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-and-hive-context-problem-tp20597.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 

Re: spark shell and hive context problem

2014-12-09 Thread Marcelo Vanzin
Hello,

In CDH 5.2 you need to manually add Hive classes to the classpath of
your Spark job if you want to use the Hive integration. Also, be aware
that since Spark 1.1 doesn't really support the version of Hive
shipped with CDH 5.2, this combination is to be considered extremely
experimental.

On Tue, Dec 9, 2014 at 2:07 PM, minajagi chetan.v.minaj...@jpmorgan.com wrote:
 Hi I'm working on Spark that comes with CDH 5.2.0

 I'm trying to get a hive context in the shell and I'm running into problems
 that I don't understand.

 I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as
 indicated elsewhere


 Here is what I see.Pls help

 ---
 scala import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.hive.HiveContext

 scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
 error: bad symbolic reference. A signature in HiveContext.class refers to
 term hive
 in package org.apache.hadoop which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling
 HiveContext.class.
 error:
  while compiling: console
 during phase: erasure
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args:

   last tree to typer: Apply(value $outer)
   symbol: value $outer (flags: method synthetic stable
 expandedname triedcooking)
symbol definition: val $outer(): $iwC.$iwC.type
  tpe: $iwC.$iwC.type
symbol owners: value $outer - class $iwC - class $iwC - class $iwC
 - class $read - package $line9
   context owners: class $iwC - class $iwC - class $iwC - class $iwC
 - class $read - package $line9

 == Enclosing template or block ==

 ClassDef( // class $iwC extends Serializable
   0
   $iwC
   []
   Template( // val local $iwC: notype, tree.tpe=$iwC
 java.lang.Object, scala.Serializable // parents
 ValDef(
   private
   _
   tpt
   empty
 )
 // 5 statements
 DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
   method triedcooking
   init
   []
   // 1 parameter list
   ValDef( // $outer: $iwC.$iwC.$iwC.type

 $outer
 tpt // tree.tpe=$iwC.$iwC.$iwC.type
 empty
   )
   tpt // tree.tpe=$iwC
   Block( // tree.tpe=Unit
 Apply( // def init(): Object in class Object, tree.tpe=Object
   $iwC.super.init // def init(): Object in class Object,
 tree.tpe=()Object
   Nil
 )
 ()
   )
 )
 ValDef( // private[this] val hiveCtx:
 org.apache.spark.sql.hive.HiveContext
   private local triedcooking
   hiveCtx 
   tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
   Apply( // def init(sc: org.apache.spark.SparkContext):
 org.apache.spark.sql.hive.HiveContext in class HiveContext,
 tree.tpe=org.apache.spark.sql.hive.HiveContext
 new org.apache.spark.sql.hive.HiveContext.init // def init(sc:
 org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in
 class HiveContext, tree.tpe=(sc:
 org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
 Apply( // val sc(): org.apache.spark.SparkContext,
 tree.tpe=org.apache.spark.SparkContext

 $iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
 // val sc(): org.apache.spark.SparkContext,
 tree.tpe=()org.apache.spark.SparkContext
   Nil
 )
   )
 )
 DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext
   method stable accessor
   hiveCtx
   []
   List(Nil)
   tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
   $iwC.this.hiveCtx  // private[this] val hiveCtx:
 org.apache.spark.sql.hive.HiveContext,
 tree.tpe=org.apache.spark.sql.hive.HiveContext
 )
 ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
   protected synthetic paramaccessor triedcooking
   $outer 
   tpt // tree.tpe=$iwC.$iwC.$iwC.type
   empty
 )
 DefDef( // val $outer(): $iwC.$iwC.$iwC.type
   method synthetic stable expandedname triedcooking
   $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer
   []
   List(Nil)
   tpt // tree.tpe=Any
   $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type,
 tree.tpe=$iwC.$iwC.$iwC.type
 )
   )
 )

 == Expanded type of tree ==

 ThisType(class $iwC)

 uncaught exception during compilation:
 scala.reflect.internal.Types$TypeError
 scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature
 in HiveContext.class refers to term conf
 in value org.apache.hadoop.hive which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when 

spark shell session crashes when trying to obtain hive context

2014-12-09 Thread minajagi
Hi I'm working on Spark that comes with CDH 5.2.0

I'm trying to get a hive context in the shell and I'm running into problems
that I don't understand.

I have added hive-site.xml to the conf folder under /usr/lib/spark/conf as
indicated elsewhere


Here is what I see.Pls help

---
scala import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to
term hive
in package org.apache.hadoop which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
error: 
 while compiling: console
during phase: erasure
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: 

  last tree to typer: Apply(value $outer)
  symbol: value $outer (flags: method synthetic stable
expandedname triedcooking)
   symbol definition: val $outer(): $iwC.$iwC.type
 tpe: $iwC.$iwC.type
   symbol owners: value $outer - class $iwC - class $iwC - class $iwC
- class $read - package $line9
  context owners: class $iwC - class $iwC - class $iwC - class $iwC
- class $read - package $line9

== Enclosing template or block ==

ClassDef( // class $iwC extends Serializable
  0
  $iwC
  []
  Template( // val local $iwC: notype, tree.tpe=$iwC
java.lang.Object, scala.Serializable // parents
ValDef(
  private
  _
  tpt
  empty
)
// 5 statements
DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC
  method triedcooking
  init
  []
  // 1 parameter list
  ValDef( // $outer: $iwC.$iwC.$iwC.type

$outer
tpt // tree.tpe=$iwC.$iwC.$iwC.type
empty
  )
  tpt // tree.tpe=$iwC
  Block( // tree.tpe=Unit
Apply( // def init(): Object in class Object, tree.tpe=Object
  $iwC.super.init // def init(): Object in class Object,
tree.tpe=()Object
  Nil
)
()
  )
)
ValDef( // private[this] val hiveCtx:
org.apache.spark.sql.hive.HiveContext
  private local triedcooking
  hiveCtx 
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  Apply( // def init(sc: org.apache.spark.SparkContext):
org.apache.spark.sql.hive.HiveContext in class HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
new org.apache.spark.sql.hive.HiveContext.init // def init(sc:
org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in
class HiveContext, tree.tpe=(sc:
org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext
Apply( // val sc(): org.apache.spark.SparkContext,
tree.tpe=org.apache.spark.SparkContext
 
$iwC.this.$line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$iwC$$$outer().$line9$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc
// val sc(): org.apache.spark.SparkContext,
tree.tpe=()org.apache.spark.SparkContext
  Nil
)
  )
)
DefDef( // val hiveCtx(): org.apache.spark.sql.hive.HiveContext
  method stable accessor
  hiveCtx
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext
  $iwC.this.hiveCtx  // private[this] val hiveCtx:
org.apache.spark.sql.hive.HiveContext,
tree.tpe=org.apache.spark.sql.hive.HiveContext
)
ValDef( // protected val $outer: $iwC.$iwC.$iwC.type
  protected synthetic paramaccessor triedcooking
  $outer 
  tpt // tree.tpe=$iwC.$iwC.$iwC.type
  empty
)
DefDef( // val $outer(): $iwC.$iwC.$iwC.type
  method synthetic stable expandedname triedcooking
  $line9$$read$$iwC$$iwC$$iwC$$iwC$$$outer
  []
  List(Nil)
  tpt // tree.tpe=Any
  $iwC.this.$outer  // protected val $outer: $iwC.$iwC.$iwC.type,
tree.tpe=$iwC.$iwC.$iwC.type
)
  )
)

== Expanded type of tree ==

ThisType(class $iwC)

uncaught exception during compilation:
scala.reflect.internal.Types$TypeError
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature
in HiveContext.class refers to term conf
in value org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-session-crashes-when-trying-to-obtain-hive-context-tp20598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, 

implement query to sparse vector representation in spark

2014-12-09 Thread Huang,Jin
I know quite a lot about machine learning, but new to scala and spark. Got 
stuck due to Spark API, so please advise.

I have a txt file with each line format like this

#label \t   # query, a strong of words, delimited by space
1  wireless amazon kindle

2  apple iPhone 5

1  kindle fire 8G

2  apple iPad
first field is the label, second field is the string My plan is to split the 
data into label and feature, transform the string into sparse vector using 
build in function Word2Vec(I assume it is using bag of words to get dict 
first), then classify using SVMWithSGD to train

object QueryClassification {


  def main(args: Array[String]) {
val conf = new SparkConf().setAppName(Query 
Classification).setMaster(local)
val sc = new SparkContext(conf)
val input = sc.textFile(spark_data.txt)

val word2vec = new Word2Vec()

val parsedData = input.map {line =
  val parts = line.split(\t)

  ## How to write code here? I need to parse into feature vector 
  ## properly and then apply word2vec function after the map
  *LabeledPoint(parts(0).toDouble, )*   
}

## * is the item I got from parsing parts(1) above
word2vec.fit(*)  





val numIterations = 20
val model = SVMWithSGD.train(parsedData,numIterations)


  }
}
Thanks a lot

equivalent to sql in

2014-12-09 Thread dizzy5112
i have and RDD i want to filter and for a single term all works good:
ie
dataRDD.filter(x=x._2 ==apple)

how can i use multiple values, for example if i wanted to filter my rdd to
take out apples and oranges and pears with out using .  This could
get long winded as there may be quite a few. Can you filter using a set or a
list?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: equivalent to sql in

2014-12-09 Thread Malte
This is more a scala specific question. I would look at the List contains
implementation



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20600.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: query classification using spark Mlib

2014-12-09 Thread sparkEbay
the format is bad, the question link is here 
http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib
http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/query-classification-using-spark-Mlib-tp20601p20602.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-12-09 Thread Judy Nash
To report back how I ultimately solved this issue and someone else can do:

1) Check each jar class path and make sure the jars are listed in the order of 
Guava class version (i.e. spark-assembly needs to list before Hadoop 2.4 
because spark-assembly has guava 14 and Hadoop 2.4 has guava 11). May require 
update compute-classpath.sh to get the ordering right. 

2) If the other jars uses a higher version, bump spark guava library to higher 
version. Guava supposedly to be very backward compatible.  

Hope this helps. 

-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com] 
Sent: Tuesday, December 2, 2014 11:35 AM
To: Judy Nash
Cc: Patrick Wendell; Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
Subject: Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on 
Guava

On Tue, Dec 2, 2014 at 11:22 AM, Judy Nash judyn...@exchange.microsoft.com 
wrote:
 Any suggestion on how can user with custom Hadoop jar solve this issue?

You'll need to include all the dependencies for that custom Hadoop jar to the 
classpath. Those will include Guava (which is not included in its original form 
as part of the Spark dependencies).



 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Sunday, November 30, 2014 11:06 PM
 To: Judy Nash
 Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava

 Thanks Judy. While this is not directly caused by a Spark issue, it is likely 
 other users will run into this. This is an unfortunate consequence of the way 
 that we've shaded Guava in this release, we rely on byte code shading of 
 Hadoop itself as well. And if the user has their own Hadoop classes present 
 it can cause issues.

 On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash judyn...@exchange.microsoft.com 
 wrote:
 Thanks Patrick and Cheng for the suggestions.

 The issue was Hadoop common jar was added to a classpath. After I removed 
 Hadoop common jar from both master and slave, I was able to bypass the error.
 This was caused by a local change, so no impact on the 1.2 release.
 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Wednesday, November 26, 2014 8:17 AM
 To: Judy Nash
 Cc: Denny Lee; Cheng Lian; u...@spark.incubator.apache.org
 Subject: Re: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava

 Just to double check - I looked at our own assembly jar and I confirmed that 
 our Hadoop configuration class does use the correctly shaded version of 
 Guava. My best guess here is that somehow a separate Hadoop library is 
 ending up on the classpath, possible because Spark put it there somehow.

 tar xvzf spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 cd org/apache/hadoop/
 javap -v Configuration | grep Precond

 Warning: Binary file Configuration contains 
 org.apache.hadoop.conf.Configuration

#497 = Utf8   
 org/spark-project/guava/common/base/Preconditions

#498 = Class  #497 //
 org/spark-project/guava/common/base/Preconditions

#502 = Methodref  #498.#501//
 org/spark-project/guava/common/base/Preconditions.checkArgument:(ZL
 j
 ava/lang/Object;)V

 12: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLj
 a
 va/lang/Object;)V

 50: invokestatic  #502// Method
 org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLj
 a
 va/lang/Object;)V

 On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hi Judy,

 Are you somehow modifying Spark's classpath to include jars from 
 Hadoop and Hive that you have running on the machine? The issue 
 seems to be that you are somehow including a version of Hadoop that 
 references the original guava package. The Hadoop that is bundled in 
 the Spark jars should not do this.

 - Patrick

 On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash 
 judyn...@exchange.microsoft.com wrote:
 Looks like a config issue. I ran spark-pi job and still failing 
 with the same guava error

 Command ran:

 .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class 
 org.apache.spark.examples.SparkPi --master 
 spark://headnodehost:7077 --executor-memory 1G --num-executors 1 
 .\lib\spark-examples-1.2.1-SNAPSHOT-hadoop2.4.0.jar 100



 Had used the same build steps on spark 1.1 and had no issue.



 From: Denny Lee [mailto:denny.g@gmail.com]
 Sent: Tuesday, November 25, 2014 5:47 PM
 To: Judy Nash; Cheng Lian; u...@spark.incubator.apache.org


 Subject: Re: latest Spark 1.2 thrift server fail with 
 NoClassDefFoundError on Guava



 To determine if this is a Windows vs. other configuration, can you 
 just try to call the Spark-class.cmd SparkSubmit without actually 
 referencing the Hadoop or Thrift server classes?





 On Tue Nov 25 2014 at 5:42:09 PM Judy Nash 
 judyn...@exchange.microsoft.com
 wrote:

 I 

Fwd: Please add us to the Spark users page

2014-12-09 Thread Abhik Majumdar
Hi,

My name is Abhik Majumdar and I am a co-founder of Vidora Corp. We use
Spark at Vidora to power our machine learning stack and we are requesting
to be included on your Powered by Spark page:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Here is the information you requested:

*organization name:* Vidora

*URL:* http://www.vidora.com

*a list of which Spark components you are using:* Spark Core, MLlib, Spark
Streaming.

*a short description of your use case:* Vidora personalized the online
experiences for content companies and provides a platform to tailor and
adapt their consumer experiences to each of their users. Our machine
learning stack, running on Spark, is able to completely personalize the
entire webpage or mobile app for any kind of content with the objective of
optimizing the metric that the customer cares about.


Please let me know if there is any additional information we can provide.

Thanks
Abhik


Abhik Majumdar, Co-Founder Vidora
Website : www.vidora.com
E-mail : ab...@vidora.com
Follow us on Twitter https://twitter.com/#!/vidoracorp or LinkedIn
http://www.linkedin.com/company/vidora
--


RE: equivalent to sql in

2014-12-09 Thread Mohammed Guller
Option 1:
dataRDD.filter(x=(x._2 ==apple) || (x._2 ==orange))

Option 2:
val fruits = Set(apple, orange, pear)
dataRDD.filter(x=fruits.contains(x._2))

Mohammed


-Original Message-
From: dizzy5112 [mailto:dave.zee...@gmail.com] 
Sent: Tuesday, December 9, 2014 2:16 PM
To: u...@spark.incubator.apache.org
Subject: equivalent to sql in

i have and RDD i want to filter and for a single term all works good:
ie
dataRDD.filter(x=x._2 ==apple)

how can i use multiple values, for example if i wanted to filter my rdd to take 
out apples and oranges and pears with out using .  This could get long 
winded as there may be quite a few. Can you filter using a set or a list?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: dockerized spark executor on mesos?

2014-12-09 Thread Venkat Subramanian
We have dockerized Spark Master and worker(s) separately and are using it in
our dev environment. We don't use Mesos though, running it in Standalone
mode, but adding Mesos should not be that difficult I think.

Regards

Venkat



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Cluster getting a null pointer error

2014-12-09 Thread Eric Tanner
I have set up a cluster on AWS and am trying a really simple hello world
program as a test.  The cluster was built using the ec2 scripts that come
with Spark.  Anyway, I have output the error message (using --verbose)
below.  The source code is further below that.

Any help would be greatly appreciated.

Thanks,

Eric

*Error code:*

r...@ip-xx.xx.xx.xx ~]$ ./spark/bin/spark-submit  --verbose  --class
com.je.test.Hello --master spark://xx.xx.xx.xx:7077
 Hello-assembly-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using properties file: /root/spark/conf/spark-defaults.conf
Adding default property: spark.executor.memory=5929m
Adding default property:
spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
Adding default property:
spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
Using properties file: /root/spark/conf/spark-defaults.conf
Adding default property: spark.executor.memory=5929m
Adding default property:
spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
Adding default property:
spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
Parsed arguments:
  master  spark://xx.xx.xx.xx:7077
  deployMode  null
  executorMemory  5929m
  executorCores   null
  totalExecutorCores  null
  propertiesFile  /root/spark/conf/spark-defaults.conf
  extraSparkPropertiesMap()
  driverMemorynull
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   com.je.test.Hello
  primaryResource file:/root/Hello-assembly-1.0.jar
  namecom.je.test.Hello
  childArgs   []
  jarsnull
  verbose true

Default properties from /root/spark/conf/spark-defaults.conf:
  spark.executor.extraLibraryPath - /root/ephemeral-hdfs/lib/native/
  spark.executor.memory - 5929m
  spark.executor.extraClassPath - /root/ephemeral-hdfs/conf


Using properties file: /root/spark/conf/spark-defaults.conf
Adding default property: spark.executor.memory=5929m
Adding default property:
spark.executor.extraClassPath=/root/ephemeral-hdfs/conf
Adding default property:
spark.executor.extraLibraryPath=/root/ephemeral-hdfs/lib/native/
Main class:
com.je.test.Hello
Arguments:

System properties:
spark.executor.extraLibraryPath - /root/ephemeral-hdfs/lib/native/
spark.executor.memory - 5929m
SPARK_SUBMIT - true
spark.app.name - com.je.test.Hello
spark.jars - file:/root/Hello-assembly-1.0.jar
spark.executor.extraClassPath - /root/ephemeral-hdfs/conf
spark.master - spark://xxx.xx.xx.xxx:7077
Classpath elements:
file:/root/Hello-assembly-1.0.jar

*Actual Error:*
Exception in thread main java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


*Source Code:*
package com.je.test


import org.apache.spark.{SparkConf, SparkContext}

class Hello {

  def main(args: Array[String]): Unit = {

val conf = new
SparkConf(true)//.set(spark.cassandra.connection.host,
xxx.xx.xx.xxx)
val sc = new SparkContext(spark://xxx.xx.xx.xxx:7077, Season, conf)

println(Hello World)

  }
}


Reading Yarn log on EMR

2014-12-09 Thread Nagy István

Dear all,

I would like to run a simple spark job on EMR with yarn.

My job is the follows:

public voidEMRRun() {
SparkConf sparkConf 
=newSparkConf().setAppName(RunEMR).setMaster(yarn-cluster);
sparkConf.set(spark.executor.memory,13000m);
JavaSparkContext ctx =newJavaSparkContext(sparkConf);
System.out.println(ctx.appName());

ListInteger list =newLinkedListInteger();
for(inti =0;i1;i++){
list.add(i);
}

JavaRDDInteger listRDD = ctx.parallelize(list);
ListInteger results = listRDD.collect();

for(Integer i : results){
System.out.println(i);
}

ctx.stop();

}

public static voidmain(String[] args) {
SparkTest sp =newSparkTest();
sp.EMRRun();
}


On EMR I run the spark with spark-submit with the following:

./spark-submit --class 
com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest 
--master yarn-cluster --executor-memory 512m --num-executors 10 
/home/hadoop/MLyBigData.jar


After that finished I tried to see yarn log, but I got this:
 yarn logs -applicationId application_1418123020170_0032
14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at 
/172.31.3.155:9022

Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032
Log aggregation has not completed or is not enabled.

But I modified the yarn-site.xml as:
propertynameyarn.log-aggregation-enable/namevaluetrue/value/property
propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property
propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property

I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4

Any suggestions how can I see the logs of the yarn?
Thanks,
Istvan
attachment: nistvan.vcf
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Reading Yarn log on EMR

2014-12-09 Thread Adam Diaz
https://issues.apache.org/jira/browse/YARN-321

There is not a generic application history server yet. The current one
works for MR.

On Tue, Dec 9, 2014 at 4:48 PM, Nagy István tyson...@gmail.com wrote:

  Dear all,

 I would like to run a simple spark job on EMR with yarn.

 My job is the follows:

 public void EMRRun() {
 SparkConf sparkConf = new 
 SparkConf().setAppName(RunEMR).setMaster(yarn-cluster);
 sparkConf.set(spark.executor.memory, 13000m);JavaSparkContext ctx = 
 new JavaSparkContext(sparkConf);System.out.println(ctx.appName());
 ListInteger list = new LinkedListInteger();for (int i =0; i1; 
 i++){
 list.add(i);}

 JavaRDDInteger listRDD = ctx.parallelize(list);ListInteger 
 results = listRDD.collect();for (Integer i : results){
 System.out.println(i);}

 ctx.stop();}
 public static void main(String[] args) {
 SparkTest sp = new SparkTest();sp.EMRRun();}


 On EMR I run the spark with spark-submit with the following:

 ./spark-submit --class
 com.collokia.ml.stackoverflow.usertags.browserhistory.sparkTestJava.SparkTest
 --master yarn-cluster --executor-memory 512m --num-executors 10
 /home/hadoop/MLyBigData.jar

 After that finished I tried to see yarn log, but I got this:
  yarn logs -applicationId application_1418123020170_0032
 14/12/09 20:29:26 INFO client.RMProxy: Connecting to ResourceManager at /
 172.31.3.155:9022
 Logs not available at /tmp/logs/hadoop/logs/application_1418123020170_0032
 Log aggregation has not completed or is not enabled.

 But I modified the yarn-site.xml as:

 propertynameyarn.log-aggregation-enable/namevaluetrue/value/property

 propertynameyarn.log-aggregation.retain-seconds/namevalue-1/value/property

 propertynameyarn.log-aggregation.retain-check-interval-seconds/namevalue30/value/property

 I use AMI version of 3.2.3, spark 1.1.0 on hadoop 2.4

 Any suggestions how can I see the logs of the yarn?
 Thanks,
 Istvan


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Workers keep dying on EC2 Spark cluster: PriviledgedActionException

2014-12-09 Thread Jeff Schecter
Hi Spark users,

I've been attempting to get flambo
https://github.com/yieldbot/flambo/blob/develop/README.md, a Clojure
library for Spark, working with my codebase. After getting things to build
with this very simple interface:

(ns sharknado.core
  (:require [flambo.conf :as conf]
[flambo.api :as spark]))

(defn configure [master-url app-name]
  (- (conf/spark-conf)
  (conf/master master-url)
  (conf/app-name app-name)))

(defn get-context [master-url app-name]
  (spark/spark-context (configure master-url app-name)))



I run in the lein repl:

(use 'sharknado.core)
(def cx (get-context spark://MASTER-URL.compute-1.amazonaws.com:7077
flambo-test))


This connects to the master and successfully creates an app; however, the
app's workers all die after several seconds.

It looks like user Saiph Kappa had similar problems about a month ago.
Someone suggested that the cluster and submitted spark application might be
using different versions of Spark; that's definitely not the case here.
I've tried with both 1.1.0 and 1.1.1 on both ends.

With Spark 1.1.0, after all workers die, the application exits.

With spark 1.1.1, after each worker dies, another is automatically created;
at the moment the app detail screen in the UI is showing 150 exited and 5
running workers.

Anyone have any ideas? Example trace from a worker below.

Thanks,

Jeff

14/12/10 01:22:09 INFO executor.CoarseGrainedExecutorBackend:
Registered signal handlers for [TERM, HUP, INT]
14/12/10 01:22:10 INFO spark.SecurityManager: Changing view acls to: root,Jeff
14/12/10 01:22:10 INFO spark.SecurityManager: Changing modify acls to: root,Jeff
14/12/10 01:22:10 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(root, Jeff); users with modify permissions: Set(root,
Jeff)
14/12/10 01:22:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/12/10 01:22:10 INFO Remoting: Starting remoting
14/12/10 01:22:10 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050]
14/12/10 01:22:10 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://driverPropsFetcher@ip-address.ec2.internal:49050]
14/12/10 01:22:10 INFO util.Utils: Successfully started service
'driverPropsFetcher' on port 49050.
14/12/10 01:22:40 ERROR security.UserGroupInformation:
PriviledgedActionException as:Jeff
cause:java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
Exception in thread main
java.lang.reflect.UndeclaredThrowableException: Unknown exception in
doAs
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException:
java.util.concurrent.TimeoutException: Futures timed out after [30
seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
... 7 more


Can HiveContext be used without using Hive?

2014-12-09 Thread Manoj Samel
From 1.1.1 documentation, it seems one can use HiveContext instead of
SQLContext without having a Hive installation. The benefit is richer SQL
dialect.

Is my understanding correct ?

Thanks


Re: Can HiveContext be used without using Hive?

2014-12-09 Thread Michael Armbrust
That is correct.  It the hive context will create an embedded metastore in
the current directory if you have not configured hive.

On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 From 1.1.1 documentation, it seems one can use HiveContext instead of
 SQLContext without having a Hive installation. The benefit is richer SQL
 dialect.

 Is my understanding correct ?

 Thanks





Re: Can HiveContext be used without using Hive?

2014-12-09 Thread Anas Mosaad
In that case, what should be the behavior of saveTableAs?
On Dec 10, 2014 4:03 AM, Michael Armbrust mich...@databricks.com wrote:

 That is correct.  It the hive context will create an embedded metastore in
 the current directory if you have not configured hive.

 On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 From 1.1.1 documentation, it seems one can use HiveContext instead of
 SQLContext without having a Hive installation. The benefit is richer SQL
 dialect.

 Is my understanding correct ?

 Thanks






RE: Can HiveContext be used without using Hive?

2014-12-09 Thread Cheng, Hao
It works exactly like Create Table As Select (CTAS) in Hive.

Cheng Hao

From: Anas Mosaad [mailto:anas.mos...@incorta.com]
Sent: Wednesday, December 10, 2014 11:59 AM
To: Michael Armbrust
Cc: Manoj Samel; user@spark.apache.org
Subject: Re: Can HiveContext be used without using Hive?


In that case, what should be the behavior of saveTableAs?
On Dec 10, 2014 4:03 AM, Michael Armbrust 
mich...@databricks.commailto:mich...@databricks.com wrote:
That is correct.  It the hive context will create an embedded metastore in the 
current directory if you have not configured hive.

On Tue, Dec 9, 2014 at 5:51 PM, Manoj Samel 
manojsamelt...@gmail.commailto:manojsamelt...@gmail.com wrote:
From 1.1.1 documentation, it seems one can use HiveContext instead of 
SQLContext without having a Hive installation. The benefit is richer SQL 
dialect.

Is my understanding correct ?

Thanks





Mllib error

2014-12-09 Thread amin mohebbi
I'm trying to build a very simple scala standalone app using the Mllib, but I 
get the following error when trying to bulid the program:Object mllib is not a 
member of package org.apache.spark
please note I just migrated from 1.0.2 to 1.1.1  
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
Hi Michael,

I think I have found the exact problem in my case. I see that we have
written something like following in Analyzer.scala :-

  // TODO: pass this in as a parameter.

  val fixedPoint = FixedPoint(100)


and


Batch(Resolution, fixedPoint,

  ResolveReferences ::

  ResolveRelations ::

  ResolveSortReferences ::

  NewRelationInstances ::

  ImplicitGenerate ::

  StarExpansion ::

  ResolveFunctions ::

  GlobalAggregates ::

  UnresolvedHavingClauseAttributes ::

  TrimGroupingAliases ::

  typeCoercionRules ++

  extendedRules : _*),

Perhaps in my case, it reaches the 100 iterations and break out of while
loop in RuleExecutor.scala and thus, doesn't resolve all the attributes.

Exception in my logs :-

14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
for batch Resolution

14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
context with path [] threw exception [Servlet execution threw an exception]
with root cause

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
DOWN_BYTESHTTPSUBCR#6567, tree:

'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
DOWN_BYTESHTTPSUBCR#6567]

...

...

...

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

at
org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)

at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)

at org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)

at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)

 at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)


I think the solution here is to have the FixedPoint constructor argument as
configurable/parameterized (also written as TODO). Do we have a plan to do
this in 1.2 release? Or I can take this up as a task for myself if you want
(since this is very crucial for our release).


Thanks

-Nitin

On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com
wrote:

 val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
 existingSchemaRDD.schema)


 This line is throwing away the logical information about existingSchemaRDD
 and thus Spark SQL can't know how to push down projections or predicates
 past this operator.

 Can you describe more the problems that you see if you don't do this
 reapplication of the schema.




-- 
Regards
Nitin Goyal


Stack overflow Error while executing spark SQL

2014-12-09 Thread Jishnu Prathap
HI
I am getting Stack overflow Error 
Exception in main java.lang.stackoverflowerror
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

Exact line where exception occurs.
sqlContext.sql(SELECT text FROM tweetTable LIMIT
10).collect().foreach(println)

The complete code is from github

https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala
https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala
  

import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on them.
*/
object ExamineAndTrain {
val jsonParser = new JsonParser()
val gson = new GsonBuilder().setPrettyPrinting().create()
def main(args: Array[String]) {
// Process program arguments and set properties
/*if (args.length  3) {
System.err.println(Usage:  + this.getClass.getSimpleName +
 tweetInput outputModelDir numClusters numIterations)
System.exit(1)
}
* 
*/
   val outputModelDir=C:\\MLModel
 val tweetInput=C:\\MLInput
   val numClusters=10
   val numIterations=20
   
//val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters),
Utils.IntParam(numIterations)) = args

val conf = new
SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4])
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Pretty print some of the tweets.
val tweets = sc.textFile(tweetInput)
println(Sample JSON Tweets---)
for (tweet - tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}
val tweetTable = sqlContext.jsonFile(tweetInput).cache()
tweetTable.registerTempTable(tweetTable)
println(--Tweet table Schema---)
tweetTable.printSchema()
println(Sample Tweet Text-)

sqlContext.sql(SELECT text FROM tweetTable LIMIT
10).collect().foreach(println)



println(--Sample Lang, Name, text---)
sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT
1000).collect().foreach(println)
println(--Total count by languages Lang, count(*)---)
sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY
user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println)
println(--- Training the model and persist it)
val texts = sqlContext.sql(SELECT text from
tweetTable).map(_.head.toString)
// Cache the vectors RDD since it will be used for all the KMeans
iterations.
val vectors = texts.map(Utils.featurize).cache()
vectors.count() // Calls an action on the RDD to populate the vectors cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters,
numClusters).saveAsObjectFile(outputModelDir)
val some_tweets = texts.take(100)
println(Example tweets from the clusters)
for (i - 0 until numClusters) {
println(s\nCLUSTER $i:)
some_tweets.foreach { t =
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}
}
}

Any Help would be appreciated..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stack-overflow-Error-while-executing-spark-SQL-tp20604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reg JDBCRDD code

2014-12-09 Thread Deepa Jayaveer
Thanks Akhil but it is expecting Function1 instead of Function .. I tried 
out writing a new class by implementing Function1 but
got an error . can you please help us to get it resolved

JDBCRDD is created as
JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1,
getResultset, ClassTag$.MODULE$.apply(String.class));


overridden  'apply' method in Function1 
public String apply(ResultSet arg0) {

 String ss = null;

try {
ss = (String) ((java.sql.ResultSet) arg0).getString(1);
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

System.out.println(ss);

return ss;
// TODO Auto-generated method stub
}

Error log
Exception in thread main org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 0.0:0 failed 1 times, most recent failure: 
Exception failure in TID 0 on host localhost: java.sql.SQLException: 
Parameter index out of range (1  number of parameters, which is 0).
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982)
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
com.mysql.jdbc.PreparedStatement.checkBounds(
PreparedStatement.java:3709)
com.mysql.jdbc.PreparedStatement.setInternal(
PreparedStatement.java:3693)
com.mysql.jdbc.PreparedStatement.setInternal(
PreparedStatement.java:3735)
com.mysql.jdbc.PreparedStatement.setLong(
PreparedStatement.java:3751)

Thanks 
Deepa



From:   Akhil Das ak...@sigmoidanalytics.com
To: Deepa Jayaveer deepa.jayav...@tcs.com
Cc: user@spark.apache.org user@spark.apache.org
Date:   12/09/2014 09:30 PM
Subject:Re: reg JDBCRDD code



Hi Deepa,

In Scala, You will do something like 
https://gist.github.com/akhld/ccafb27f098163bea622 

With Java API's it will be something like 
https://gist.github.com/akhld/0d9299aafc981553bc34



Thanks
Best Regards

On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com 
wrote:
Hi All, 
am new to Spark.  I tried to connect to Mysql using Spark.  want to write 
a code in Java but 
getting runtime exception. I guess that the issue is with the function0 
and function1 objects being passed in JDBCRDD . 

I tried my level best and attached the code, can you please help us to fix 
the issue. 



Thanks 
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySprak and UnsupportedOperationException

2014-12-09 Thread Mohamed Lrhazi
somehow I now get a better error trace, I believe for the same root
issue... any advice of how to narrow this down further highly appreciated:


...
14/12/10 07:15:03 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File /spark/python/pyspark/worker.py, line 75, in main
command = pickleSer._read_with_length(infile)
  File /spark/python/pyspark/serializers.py, line 146, in
_read_with_length
length = read_int(stream)
  File /spark/python/pyspark/serializers.py, line 464, in read_int
raise EOFError
EOFError

at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
at
org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)




On Tue, Dec 9, 2014 at 2:32 PM, Mohamed Lrhazi 
mohamed.lrh...@georgetown.edu wrote:

 While trying simple examples of PySpark code, I systematically get these
 failures when I try this.. I dont see any prior exceptions in the output...
 How can I debug further to find root cause?


 es_rdd = sc.newAPIHadoopRDD(
 inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,
 keyClass=org.apache.hadoop.io.NullWritable,
 valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,
 conf={
 es.resource : en_2014/doc,
 es.nodes:rap-es2,
 es.query :  {query:{match_all:{}},fields:[title],
 size: 100}
 }
 )


 titles=es_rdd.map(lambda d: d[1]['title'][0])
 counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
 1)).reduceByKey(add)


 output = counts.collect()



 ...
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448
 dropped from memory (free 274984768)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391
 dropped from memory (free 275148159)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391
 dropped from memory (free 275311550)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91
 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID
 72)
 java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at java.util.AbstractMap.putAll(AbstractMap.java:273)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913)
 at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
 at
 scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
 at
 scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
 14/12/09 19:27:30 INFO 

Re: reg JDBCRDD code

2014-12-09 Thread Akhil Das
Try changing this line

*JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 0, 1,
getResultset, ClassTag$.*MODULE$*.apply(String.*class*));

to

*JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 100, 1,
getResultset, ClassTag$.*MODULE$*.apply(String.*class*));

Here:

0   - lower bound
100 - upper bound
1   - number of partitions i believe.


Thanks
Best Regards

On Wed, Dec 10, 2014 at 12:45 PM, Deepa Jayaveer deepa.jayav...@tcs.com
wrote:

 Thanks Akhil but it is expecting Function1 instead of Function .. I tried
 out writing a new class by implementing Function1 but
 got an error . can you please help us to get it resolved

 JDBCRDD is created as
 *JdbcRDD* rdd = *new* *JdbcRDD*(sc, getConnection, sql, 0, 0, 1,
 getResultset, ClassTag$.*MODULE$*.apply(String.*class*));


 overridden  'apply' method in Function1
 *public* String apply(ResultSet arg0) {

  String ss = *null*;

 *try* {
 ss = *(String) ((java.sql.ResultSet) arg0).getString(1)*;
 } *catch* (SQLException e) {
 // *TODO* Auto-generated catch block
 e.printStackTrace();
 }

 System.*out*.println(ss);

 *return* ss;
 // *TODO* Auto-generated method stub
 }

 Error log
 Exception in thread main *org.apache.spark.SparkException*: Job aborted
 due to stage failure: Task 0.0:0 failed 1 times, most recent failure:
 Exception failure in TID 0 on host localhost: *java.sql.SQLException*:
 Parameter index out of range (1  number of parameters, which is 0).
 com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:1073*)
 com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:987*)
 com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:982*)
 com.mysql.jdbc.SQLError.createSQLException(*SQLError.java:927*)
 com.mysql.jdbc.PreparedStatement.checkBounds(
 *PreparedStatement.java:3709*)
 com.mysql.jdbc.PreparedStatement.setInternal(
 *PreparedStatement.java:3693*)
 com.mysql.jdbc.PreparedStatement.setInternal(
 *PreparedStatement.java:3735*)
 com.mysql.jdbc.PreparedStatement.setLong(
 *PreparedStatement.java:3751*)

 Thanks
 Deepa



 From:Akhil Das ak...@sigmoidanalytics.com
 To:Deepa Jayaveer deepa.jayav...@tcs.com
 Cc:user@spark.apache.org user@spark.apache.org
 Date:12/09/2014 09:30 PM
 Subject:Re: reg JDBCRDD code
 --



 Hi Deepa,

 In Scala, You will do something like
 *https://gist.github.com/akhld/ccafb27f098163bea622*
 https://gist.github.com/akhld/ccafb27f098163bea622

 With Java API's it will be something like
 *https://gist.github.com/akhld/0d9299aafc981553bc34*
 https://gist.github.com/akhld/0d9299aafc981553bc34



 Thanks
 Best Regards

 On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer *deepa.jayav...@tcs.com*
 deepa.jayav...@tcs.com wrote:
 Hi All,
 am new to Spark.  I tried to connect to Mysql using Spark.  want to write
 a code in Java but
 getting runtime exception. I guess that the issue is with the function0
 and function1 objects being passed in JDBCRDD .

 I tried my level best and attached the code, can you please help us to fix
 the issue.



 Thanks
 Deepa
 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you


 -
 To unsubscribe, e-mail: *user-unsubscr...@spark.apache.org*
 user-unsubscr...@spark.apache.org
 For additional commands, e-mail: *user-h...@spark.apache.org*
 user-h...@spark.apache.org




Re: reg JDBCRDD code

2014-12-09 Thread Deepa Jayaveer
Hi Akhil,
Getting the same error . I guess that the issue on Function1 
implementation. 
is it enough if we override apply method in Function1 class?
Thanks
Deepa





From:   Akhil Das ak...@sigmoidanalytics.com
To: Deepa Jayaveer deepa.jayav...@tcs.com
Cc: user@spark.apache.org user@spark.apache.org
Date:   12/10/2014 12:55 PM
Subject:Re: reg JDBCRDD code



Try changing this line 

JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1, 
getResultset, ClassTag$.MODULE$.apply(String.class)); 

to 

JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 100, 1, 
getResultset, ClassTag$.MODULE$.apply(String.class)); 

Here:
0   - lower bound
100 - upper bound
1   - number of partitions i believe.

Thanks
Best Regards

On Wed, Dec 10, 2014 at 12:45 PM, Deepa Jayaveer deepa.jayav...@tcs.com 
wrote:
Thanks Akhil but it is expecting Function1 instead of Function .. I tried 
out writing a new class by implementing Function1 but 
got an error . can you please help us to get it resolved 

JDBCRDD is created as 
JdbcRDD rdd = new JdbcRDD(sc, getConnection, sql, 0, 0, 1, 
getResultset, ClassTag$.MODULE$.apply(String.class)); 


overridden  'apply' method in Function1 
public String apply(ResultSet arg0) { 

 String ss = null; 

try { 
ss = (String) ((java.sql.ResultSet) arg0).getString(1); 
} catch (SQLException e) { 
// TODO Auto-generated catch block 
e.printStackTrace(); 
} 

System.out.println(ss); 

return ss; 
// TODO Auto-generated method stub 
} 

Error log 
Exception in thread main org.apache.spark.SparkException: Job aborted 
due to stage failure: Task 0.0:0 failed 1 times, most recent failure: 
Exception failure in TID 0 on host localhost: java.sql.SQLException: 
Parameter index out of range (1  number of parameters, which is 0). 
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) 
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987) 
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:982) 
com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) 
com.mysql.jdbc.PreparedStatement.checkBounds(
PreparedStatement.java:3709) 
com.mysql.jdbc.PreparedStatement.setInternal(
PreparedStatement.java:3693) 
com.mysql.jdbc.PreparedStatement.setInternal(
PreparedStatement.java:3735) 
com.mysql.jdbc.PreparedStatement.setLong(
PreparedStatement.java:3751) 

Thanks 
Deepa



From:Akhil Das ak...@sigmoidanalytics.com 
To:Deepa Jayaveer deepa.jayav...@tcs.com 
Cc:user@spark.apache.org user@spark.apache.org 
Date:12/09/2014 09:30 PM 
Subject:Re: reg JDBCRDD code 




Hi Deepa, 

In Scala, You will do something like 
https://gist.github.com/akhld/ccafb27f098163bea622  

With Java API's it will be something like 
https://gist.github.com/akhld/0d9299aafc981553bc34 



Thanks 
Best Regards 

On Tue, Dec 9, 2014 at 6:39 PM, Deepa Jayaveer deepa.jayav...@tcs.com 
wrote: 
Hi All, 
am new to Spark.  I tried to connect to Mysql using Spark.  want to write 
a code in Java but 
getting runtime exception. I guess that the issue is with the function0 
and function1 objects being passed in JDBCRDD . 

I tried my level best and attached the code, can you please help us to fix 
the issue. 



Thanks 
Deepa 
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 




Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
I see that somebody had already raised a PR for this but it hasn't been
merged.

https://issues.apache.org/jira/browse/SPARK-4339

Can we merge this in next 1.2 RC?

Thanks
-Nitin


On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal nitin2go...@gmail.com wrote:

 Hi Michael,

 I think I have found the exact problem in my case. I see that we have
 written something like following in Analyzer.scala :-

   // TODO: pass this in as a parameter.

   val fixedPoint = FixedPoint(100)


 and


 Batch(Resolution, fixedPoint,

   ResolveReferences ::

   ResolveRelations ::

   ResolveSortReferences ::

   NewRelationInstances ::

   ImplicitGenerate ::

   StarExpansion ::

   ResolveFunctions ::

   GlobalAggregates ::

   UnresolvedHavingClauseAttributes ::

   TrimGroupingAliases ::

   typeCoercionRules ++

   extendedRules : _*),

 Perhaps in my case, it reaches the 100 iterations and break out of while
 loop in RuleExecutor.scala and thus, doesn't resolve all the attributes.

 Exception in my logs :-

 14/12/10 04:45:28 INFO HiveContext$$anon$4: Max iterations (100) reached
 for batch Resolution

 14/12/10 04:45:28 ERROR [Sql]: Servlet.service() for servlet [Sql] in
 context with path [] threw exception [Servlet execution threw an exception]
 with root cause

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
 DOWN_BYTESHTTPSUBCR#6567, tree:

 'Project ['T1.SP AS SP#6566,'T1.DOWN_BYTESHTTPSUBCR AS
 DOWN_BYTESHTTPSUBCR#6567]

 ...

 ...

 ...

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:80)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)

 at
 org.apache.spark.sql.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:86)

 at org.apache.spark.sql.CacheManager$class.writeLock(CacheManager.scala:67)

 at
 org.apache.spark.sql.CacheManager$class.cacheQuery(CacheManager.scala:85)

 at org.apache.spark.sql.SQLContext.cacheQuery(SQLContext.scala:50)

  at org.apache.spark.sql.SchemaRDD.cache(SchemaRDD.scala:490)


 I think the solution here is to have the FixedPoint constructor argument
 as configurable/parameterized (also written as TODO). Do we have a plan to
 do this in 1.2 release? Or I can take this up as a task for myself if you
 want (since this is very crucial for our release).


 Thanks

 -Nitin

 On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich...@databricks.com
 wrote:

 val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
 existingSchemaRDD.schema)


 This line is throwing away the logical information about
 existingSchemaRDD and thus Spark SQL can't know how to push down
 projections or predicates past this operator.

 Can you describe more the problems that you see if you don't do this
 reapplication of the schema.




 --
 Regards
 Nitin Goyal




-- 
Regards
Nitin Goyal


Re: PySprak and UnsupportedOperationException

2014-12-09 Thread Davies Liu
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi
mohamed.lrh...@georgetown.edu wrote:
 While trying simple examples of PySpark code, I systematically get these
 failures when I try this.. I dont see any prior exceptions in the output...
 How can I debug further to find root cause?


 es_rdd = sc.newAPIHadoopRDD(
 inputFormatClass=org.elasticsearch.hadoop.mr.EsInputFormat,
 keyClass=org.apache.hadoop.io.NullWritable,
 valueClass=org.elasticsearch.hadoop.mr.LinkedMapWritable,
 conf={
 es.resource : en_2014/doc,
 es.nodes:rap-es2,
 es.query :  {query:{match_all:{}},fields:[title],
 size: 100}
 }
 )


 titles=es_rdd.map(lambda d: d[1]['title'][0])
 counts = titles.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
 1)).reduceByKey(add)
 output = counts.collect()



 ...
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 93
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_93
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_93 of size 2448 dropped
 from memory (free 274984768)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 93
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 92
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_92
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_92 of size 163391
 dropped from memory (free 275148159)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 92
 14/12/09 19:27:20 INFO BlockManager: Removing broadcast 91
 14/12/09 19:27:20 INFO BlockManager: Removing block broadcast_91
 14/12/09 19:27:20 INFO MemoryStore: Block broadcast_91 of size 163391
 dropped from memory (free 275311550)
 14/12/09 19:27:20 INFO ContextCleaner: Cleaned broadcast 91
 14/12/09 19:27:30 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID
 72)
 java.lang.UnsupportedOperationException
 at java.util.AbstractMap.put(AbstractMap.java:203)
 at java.util.AbstractMap.putAll(AbstractMap.java:273)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:373)
 at

It looks like it's a bug in ElasticSearch (EsInputFormat).

 org.elasticsearch.hadoop.mr.EsInputFormat$WritableShardRecordReader.setCurrentValue(EsInputFormat.java:322)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.next(EsInputFormat.java:299)
 at
 org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.nextKeyValue(EsInputFormat.java:227)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:913)
 at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
 at
 scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
 at
 scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:339)
 at

This means that the task failed when it read the data in EsInputFormat
to feed Python mapper.

 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
 14/12/09 19:27:30 INFO TaskSetManager: Starting task 2.0 in stage 67.0 (TID
 74, localhost, ANY, 26266 bytes)
 14/12/09 19:27:30 INFO Executor: Running task 2.0 in stage 67.0 (TID 74)
 14/12/09 19:27:30 WARN TaskSetManager: Lost task 0.0 in stage 67.0 (TID 72,
 localhost): java.lang.UnsupportedOperationException:

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Stackoverflow error

2014-12-09 Thread Jishnu Prathap
Hi,

Unfortunately I am also getting the same error
Anybody solved it??..
Exception in main java.lang.stackoverflowerror
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

Exact line where exception occurs.
sqlContext.sql(SELECT text FROM tweetTable LIMIT
10).collect().foreach(println)

The complete code is from github

https://github.com/databricks/reference-apps/blob/master/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on them.
*/
object ExamineAndTrain {
val jsonParser = new JsonParser()
val gson = new GsonBuilder().setPrettyPrinting().create()
def main(args: Array[String]) {
// Process program arguments and set properties
/*if (args.length  3) {
System.err.println(Usage:  + this.getClass.getSimpleName +
 tweetInput outputModelDir numClusters numIterations)
System.exit(1)
}
*
*/
   val outputModelDir=C:\\MLModel
 val tweetInput=C:\\MLInput
   val numClusters=10
   val numIterations=20
   
//val Array(tweetInput, outputModelDir, Utils.IntParam(numClusters),
Utils.IntParam(numIterations)) = args

val conf = new
SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4])
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Pretty print some of the tweets.
val tweets = sc.textFile(tweetInput)
println(Sample JSON Tweets---)
for (tweet - tweets.take(5)) {
println(gson.toJson(jsonParser.parse(tweet)))
}
val tweetTable = sqlContext.jsonFile(tweetInput).cache()
tweetTable.registerTempTable(tweetTable)
println(--Tweet table Schema---)
tweetTable.printSchema()
println(Sample Tweet Text-)

sqlContext.sql(SELECT text FROM tweetTable LIMIT
10).collect().foreach(println)



println(--Sample Lang, Name, text---)
sqlContext.sql(SELECT user.lang, user.name, text FROM tweetTable LIMIT
1000).collect().foreach(println)
println(--Total count by languages Lang, count(*)---)
sqlContext.sql(SELECT user.lang, COUNT(*) as cnt FROM tweetTable GROUP BY
user.lang ORDER BY cnt DESC LIMIT 25).collect.foreach(println)
println(--- Training the model and persist it)
val texts = sqlContext.sql(SELECT text from
tweetTable).map(_.head.toString)
// Cache the vectors RDD since it will be used for all the KMeans
iterations.
val vectors = texts.map(Utils.featurize).cache()
vectors.count() // Calls an action on the RDD to populate the vectors cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters,
numClusters).saveAsObjectFile(outputModelDir)
val some_tweets = texts.take(100)
println(Example tweets from the clusters)
for (i - 0 until numClusters) {
println(s\nCLUSTER $i:)
some_tweets.foreach { t =
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}
}
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p20605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org