Re: Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-18 Thread Ian Wilkinson
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4.


 On Dec 3, 2014, at 3:31 PM, Ian Wilkinson ia...@me.com wrote:
 
 Hi,
 
 I'm trying the Elasticsearch support for Spark (2.1.0.Beta3).
 
 In the following I provide the query (as query dsl):
 
 
 import org.elasticsearch.spark._
 
 object TryES {
  val sparkConf = new SparkConf().setAppName(Campaigns)
  sparkConf.set(es.nodes, es_cluster:9200)
  sparkConf.set(es.nodes.discovery, false)
  val sc = new SparkContext(sparkConf)
 
  def main(args: Array[String]) {
val query = {
   query: {
  ...
   }
 }
 
val campaigns = sc.esRDD(resource, query)
campaigns.count();
  }
 }
 
 
 However when I submit this (using spark-1.1.0-bin-hadoop2.4),
 I am experiencing the following exceptions:
 
 14/12/03 14:55:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, 
 whose tasks have all completed, from pool
 14/12/03 14:55:27 INFO scheduler.DAGScheduler: Failed to run count at 
 TryES.scala:...
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: 
 Lost task 1.0 in stage 0.0 (TID 1, localhost): 
 org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot open stream 
 for resource {
   query: {
   ...
   }
 }
 
 
 Is the query dsl supported with esRDD, or am I missing something
 more fundamental?
 
 Huge thanks,
 ian
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-03 Thread Ian Wilkinson
Hi,

I'm trying the Elasticsearch support for Spark (2.1.0.Beta3).

In the following I provide the query (as query dsl):


import org.elasticsearch.spark._

object TryES {
  val sparkConf = new SparkConf().setAppName(Campaigns)
  sparkConf.set(es.nodes, es_cluster:9200)
  sparkConf.set(es.nodes.discovery, false)
  val sc = new SparkContext(sparkConf)

  def main(args: Array[String]) {
val query = {
   query: {
  ...
   }
}

val campaigns = sc.esRDD(resource, query)
campaigns.count();
  }
}


However when I submit this (using spark-1.1.0-bin-hadoop2.4),
I am experiencing the following exceptions:

14/12/03 14:55:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose 
tasks have all completed, from pool
14/12/03 14:55:27 INFO scheduler.DAGScheduler: Failed to run count at 
TryES.scala:...
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost 
task 1.0 in stage 0.0 (TID 1, localhost): 
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot open stream 
for resource {
   query: {
   ...
   }
}


Is the query dsl supported with esRDD, or am I missing something
more fundamental?

Huge thanks,
ian
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DynamoDB input source

2014-07-21 Thread Ian Wilkinson
)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:132)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:767)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
14/07/20 23:56:05 INFO Remoting: Remoting shut down
14/07/20 23:56:05 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.


On 4 Jul 2014, at 18:04, Nick Pentreath nick.pentre...@gmail.com wrote:

 Interesting - I would have thought they would make that available publicly. 
 Unfortunately, unless you can use Spark on EMR, I guess your options are to 
 hack it by spinning up an EMR cluster and getting the JAR, or maybe fall back 
 to using boto and rolling your own :(
 
 
 
 On Fri, Jul 4, 2014 at 9:28 AM, Ian Wilkinson ia...@me.com wrote:
 Trying to discover source for the DynamoDBInputFormat.
 Not appearing in:
 
 - https://github.com/aws/aws-sdk-java
 - https://github.com/apache/hive
 
 Then came across 
 http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb.
 Unsure whether this represents the latest situation…
 
 ian 
 
 
 On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 I should qualify by saying there is boto support for dynamodb - but not for 
 the inputFormat. You could roll your own python-based connection but this 
 involves figuring out how to split the data in dynamo - inputFormat takes 
 care of this so should be the easier approach 
 —
 Sent from Mailbox
 
 
 On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote:
 
 Excellent. Let me get browsing on this.
 
 
 Huge thanks,
 ian
 
 
 On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 No boto support for that. 
 
 In master there is Python support for loading Hadoop inputFormat. Not sure 
 if it will be in 1.0.1 or 1.1
 
 I master docs under the programming guide are instructions and also under 
 examples project there are pyspark examples of using Cassandra and HBase. 
 These should hopefully give you enough to get started. 
 
 Depending on how easy it is to use the dynamo DB format, you may have to 
 write a custom converter (see the mentioned examples for storm details).
 
 Sent from my iPhone
 
 On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote:
 
 Hi Nick,
 
 I’m going to be working with python primarily. Are you aware of
 comparable boto support?
 
 ian
 
 On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 You should be able to use DynamoDBInputFormat (I think this should be 
 part of AWS libraries for Java) and create a HadoopRDD from that.
 
 
 On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
 Hi,
 
 I noticed mention of DynamoDB as input source in
 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.
 
 Unfortunately, Google is not coming to my rescue on finding
 further mention for this support.
 
 Any pointers would be well received.
 
 Big thanks,
 ian
 
 
 
 
 
 



Problem running Spark shell (1.0.0) on EMR

2014-07-16 Thread Ian Wilkinson
Hi,

I’m trying to run the Spark (1.0.0) shell on EMR and encountering a classpath 
issue.
I suspect I’m missing something gloriously obviously, but so far it is eluding 
me.

I launch the EMR Cluster (using the aws cli) with:

aws emr create-cluster --name Test Cluster  \
--ami-version 3.0.3 \
--no-auto-terminate \
--ec2-attributes KeyName=... \
--bootstrap-actions 
Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \
--instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium  \
InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium --region 
eu-west-1

then,

$ aws emr ssh --cluster-id ... --key-pair-file ... --region eu-west-1

On the master node, I then launch the shell with:

[hadoop@ip-... spark]$ ./bin/spark-shell

and try performing:

scala val logs = sc.textFile(s3n://.../“)

this produces:

14/07/16 12:40:35 WARN storage.BlockManager: Putting block broadcast_0 failed
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;


Any help mighty welcome,
ian



Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Hi Nick,

I’m going to be working with python primarily. Are you aware of
comparable boto support?

ian

On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:

 You should be able to use DynamoDBInputFormat (I think this should be part of 
 AWS libraries for Java) and create a HadoopRDD from that.
 
 
 On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
 Hi,
 
 I noticed mention of DynamoDB as input source in
 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.
 
 Unfortunately, Google is not coming to my rescue on finding
 further mention for this support.
 
 Any pointers would be well received.
 
 Big thanks,
 ian
 



Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Trying to discover source for the DynamoDBInputFormat.
Not appearing in:

- https://github.com/aws/aws-sdk-java
- https://github.com/apache/hive

Then came across 
http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb.
Unsure whether this represents the latest situation…

ian 


On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote:

 I should qualify by saying there is boto support for dynamodb - but not for 
 the inputFormat. You could roll your own python-based connection but this 
 involves figuring out how to split the data in dynamo - inputFormat takes 
 care of this so should be the easier approach 
 —
 Sent from Mailbox
 
 
 On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote:
 
 Excellent. Let me get browsing on this.
 
 
 Huge thanks,
 ian
 
 
 On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 No boto support for that. 
 
 In master there is Python support for loading Hadoop inputFormat. Not sure 
 if it will be in 1.0.1 or 1.1
 
 I master docs under the programming guide are instructions and also under 
 examples project there are pyspark examples of using Cassandra and HBase. 
 These should hopefully give you enough to get started. 
 
 Depending on how easy it is to use the dynamo DB format, you may have to 
 write a custom converter (see the mentioned examples for storm details).
 
 Sent from my iPhone
 
 On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote:
 
 Hi Nick,
 
 I’m going to be working with python primarily. Are you aware of
 comparable boto support?
 
 ian
 
 On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 You should be able to use DynamoDBInputFormat (I think this should be part 
 of AWS libraries for Java) and create a HadoopRDD from that.
 
 
 On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
 Hi,
 
 I noticed mention of DynamoDB as input source in
 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.
 
 Unfortunately, Google is not coming to my rescue on finding
 further mention for this support.
 
 Any pointers would be well received.
 
 Big thanks,
 ian