date:20141216

Locality Level Kryo

2014-12-16 Thread aecc

Hi guys, I get Kryo exceptions of the type "unregistered class id" and "cannot cast to class" when the locality level of the tasks go beyond LOCAL. However I get no Kryo exceptions during shuffling operations. If the locality level never goes beyond LOCAL everything works fine. Is there a speci

Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu

Could you post the stack trace? Best Regards, Shixiong Zhu 2014-12-16 23:21 GMT+08:00 richiesgr : > > Hi > > This time I need expert. > On 1.1.1 and only in cluster (standalone or EC2) > when I use this code : > > countersPublishers.foreachRDD(rdd => { > rdd.foreachPartition(partitionRecords

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-16 Thread Kyle Lin

I also got the same problem.. 2014-12-09 22:58 GMT+08:00 Daniel Haviv : > > Hi, > I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I > get the following exception: > > 14/12/09 06:54:24 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 14/12/

Re: S3 globbing

2014-12-16 Thread Akhil Das

Did you try something like: //Get the last hour val d = (System.currentTimeMillis() - 3600 * 1000) val ex = "abc_" + d.toString().substring(0,7) + "*.json" [image: Inline image 1] Thanks Best Regards On Wed, Dec 17, 2014 at 5:05 AM, durga wrote: > > Hi All, > > I need help with regex in my sc

Re: "toArray","first" get the different result from one element RDD

2014-12-16 Thread buring

I get the key point . The problem is in sc.sequenceFile,From API description "RDD will create many references to the same objecty" ,So I revise the code "sessions.getBytes" to "sessions.getBytes.clone", It seems to work. Thanks. -- View this message in context: http://apache-spark-user-list.1

Re: Spark SQL DSL for joins?

2014-12-16 Thread Tobias Pfeiffer

Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj wrote: > > Another problem with the DSL: > > t1.where('term == "dmin").count() returns zero. Looks like you need ===: https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD Tobias

Rolling upgrade Spark cluster

2014-12-16 Thread Kenichi Maehashi

Hi, I have a Spark cluster using standalone mode. Spark Master is configured as High Availablity mode. Now I am going to upgrade Spark from 1.0 to 1.1, but don't want to interrupt the currently running jobs. (1) Are there any way to perform a rolling upgrade (while running a job)? (2) If not, whe

RE: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sun, Rui

Gautham, How many number of gz files do you have? Maybe the reason is that gz file is compressed that can't be splitted for processing by Mapreduce. A single gz file can only be processed by a single Mapper so that the CPU treads can't be fully utilized. -Original Message- From: Gau

Re: Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj

Another problem with the DSL: t1.where('term == "dmin").count() returns zero. But sqlCtx.sql("select * from t1 where term = 'dmin').count() returns 700, which I know is correct from the data. Is there something wrong with how I'm using the DSL? Thanks On 17/12/14 11:13 am, Jerry Raj wrote:

RE: Control default partition when load a RDD from HDFS

2014-12-16 Thread Sun, Rui

Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

Re: Running Spark Job on Yarn from Java Code

2014-12-16 Thread Kyle Lin

Hi there I also got exception when running PI example on YARN Spark version: spark-1.1.1-bin-hadoop2.4 My environment: Hortonworks HDP 2.2 My command: ./bin/spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10 Output logs: 14/12/17 14:06:32 IN

Re: when will the spark 1.3.0 be released？

2014-12-16 Thread Andrew Ash

Releases are roughly every 3mo so you should expect around March if the pace stays steady. 2014-12-16 22:56 GMT-05:00 Marco Shaw : > > When it is ready. > > > > > On Dec 16, 2014, at 11:43 PM, 张建轶 wrote: > > > > Hi £¡ > > > > when will the spark 1.3.0 be released£¿ > > I want to use new LDA featu

Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj

Hi, I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I have two tables (backed by Parquet files) and I need to do a join across them using a common field (user_id). This works fine using standard SQL but not using the language-integrated DSL neither t1.join(t2, on = 't1.us

答复: 答复: Fetch Failed caused job failed.

2014-12-16 Thread Ma,Xi

Actually there was still Fetch failure. However, after I upgrade the spark to 1.1.1, this error was not met again. Thanks, Mars 发件人: Akhil Das [mailto:ak...@sigmoidanalytics.com] 发送时间: 2014年12月16日 17:52 收件人: Ma,Xi 抄送: u...@spark.incubator.apache.org 主题: Re: 答复: Fetch Failed caused job failed.

Re: RDDs being cleaned too fast

2014-12-16 Thread Harihar Nahak

RDD.persist() can be useful here. On 11 December 2014 at 14:34, ankits [via Apache Spark User List] < ml-node+s1001560n20613...@n3.nabble.com> wrote: > > I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too > fast. How can i inspect the size of RDD in memory and get more informa

Re: when will the spark 1.3.0 be released？

2014-12-16 Thread Marco Shaw

When it is ready. > On Dec 16, 2014, at 11:43 PM, 张建轶 wrote: > > Hi £¡ > > when will the spark 1.3.0 be released£¿ > I want to use new LDA feature. > Thank > you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü

when will the spark 1.3.0 be released？

2014-12-16 Thread 张建轶

Hi ！ when will the spark 1.3.0 be released？ I want to use new LDA feature. Thank you!

Re: Executor memory

2014-12-16 Thread Pala M Muthaia

Thanks for the clarifications. I misunderstood what the number on UI meant. On Mon, Dec 15, 2014 at 7:00 PM, Sean Owen wrote: > I believe this corresponds to the 0.6 of the whole heap that is > allocated for caching partitions. See spark.storage.memoryFraction on > http://spark.apache.org/docs/l

"toArray","first" get the different result from one element RDD

2014-12-16 Thread buring

Hi Recently I have some problems about rdd behaviors.It's about "RDD.first","RDD.toArray" method when RDD only has one element. I get the different result in different method from one element RDD where i should have the same result. I will give more detail after the code.

Re: How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll

Wow. i just realized what was happening and it's all my fault. I have a library method that I wrote that presents the RDD and I was actually repartitioning it myself. I feel pretty dumb. Sorry about that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-

How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll

I've been trying to figure out how to use Spark to do a simple aggregation without reparitioning and essentially creating fully instantiated intermediate RDDs and it seem virtually impossible. I've now gone as far as writing my own single parition RDD that wraps an Iterator[String] and calling ag

S3 globbing

2014-12-16 Thread durga

Hi All, I need help with regex in my sc.textFile() I have lots of files with with epoch millisecond timestamp. ex:abc_1418759383723.json Now I need to consume last one hour files using the epoch time stamp as mentioned above. I tried couple of options , nothing seems working for me. If any on

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sebastián Ramírez

Are you reading the file from your driver (main / master) program? Is your file in a distributed system like HDFS? available to all your nodes? It might be due to the laziness of transformations: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations "Transformations" are lazy

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim

Hi Harry, Thanks for your response. I'm working in scala. When I do a "count" call it expands the RDD in the count (since it's an action). You can see the call stack that results in the failure of the job here: ERROR DiskBlockObjectWriter - Uncaught exception while reverting partial write

Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread Sebastián Ramírez

Your Spark is trying to load a hadoop library "winutils.exe", which you don't have in your Windows: 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. a

Re: pyspark exception catch

2014-12-16 Thread cfregly

hey igor! a few ways to work around this depending on the level of exception-handling granularity you're willing to accept: 1) use mapPartitions() to wrap the entire partition handling code in a try/catch -- this is fairly coarse-grained, however, and will fail the entire partition. 2) modify your

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage

Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the "streaming" behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu

I had created https://issues.apache.org/jira/browse/SPARK-4866, it will be fixed by https://github.com/apache/spark/pull/3714. Thank you for reporting this. Davies On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu wrote: > It's a bug, could you file a JIRA for this? thanks! > > On Tue, Dec 16, 2014

Spark eating exceptions in multi-threaded local mode

2014-12-16 Thread Corey Nolet

I've been running a job in local mode using --master local[*] and I've noticed that, for some reason, exceptions appear to get eaten- as in, I don't see them. If i debug in my IDE, I'll see that an exception was thrown if I step through the code but if I just run the application, it appears everyth

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu

It's a bug, could you file a JIRA for this? thanks! On Tue, Dec 16, 2014 at 5:49 AM, sahanbull wrote: > > Hi Guys, > > Im running a spark cluster in AWS with Spark 1.1.0 in EC2 > > I am trying to convert a an RDD with tuple > > (u'string', int , {(int, int): int, (int, int): int}) > > to a schema

Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams

TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas wrote: > Hi Jeniba, > > The second part of this meetup recording has a very good answer to your > question. TD explains the current behavior and the on-going work in Spark > S

Control default partition when load a RDD from HDFS

2014-12-16 Thread Shuai Zheng

Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can turn-off

Re: NumberFormatException

2014-12-16 Thread Imran Rashid

wow, really weird. My intuition is the same as everyone else's, some unprintable character. Here's a couple more debugging tricks I've used in the past: //set up an accumulator to catch the bad rows as a side-effect val nBadRows = sc.accumulator(0) val nGoodRows = sc.accumulator(0) val badRows =

Cannot parse ListBuffer - StreamingLinearRegression

2014-12-16 Thread tsu-wt

Hi, I am trying to run StreamingLinearRegression example on single node w/ out hadoop installed. I keep getting the following error and cannot find any documentation about the issue: /14/12/16 13:26:50 ERROR JobScheduler: Error running job streaming job 141875801 ms.1 org.apache.spark.SparkEx

Re: streaming linear regression is not building the model

2014-12-16 Thread tsu-wt

I am having the same issue, and it still does not update for me. I am trying to execute the example by using bin/run-example -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-linear-regression-is-not-building-the-model-tp18522p20727.html Sent from th

RE: Spark SQL API Doc & IsCached as SQL command

2014-12-16 Thread Judy Nash

Thanks Cheng. Tried it out and saw the InMemoryColumnarTableScan word in the physical plan. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, December 12, 2014 11:37 PM To: Judy Nash; user@spark.apache.org Subject: Re: Spark SQL API Doc & IsCached as SQL command There isn’t a SQL st

Spark Sql on Yarn using python

2014-12-16 Thread Sam Flint

I have tested my python script by using the pyspark shell. I run into an error because of memory limits on the name node. I am wondering how I run the script no spark yarn. I am not familiar with this at all. Any help would be greatly appreciated. Thanks, -- *MAGNE**+**I**C* *Sam Flint* |

Re: Multiple Filter Effiency

2014-12-16 Thread Imran Rashid

I think accumulators do exactly what you want. (Scala syntax below, I'm just not familiar with the Java equivalent ...) val f1counts = sc.accumulator (0) val f2counts = sc.accumulator (0) val f3counts = sc.accumulator (0) textfile.foreach { s => if(f1matches) f1counts += 1 ... } Note that y

Running Spark Job on Yarn from Java Code

2014-12-16 Thread Rahul Swaminathan

Hi all, I am trying to run a simple Spark_Pi application through Yarn from Java code. I have the Spark_Pi class and everything works fine if I run on Spark. However, when I set master to "yarn-client" and set yarn mode to true, I keep getting exceptions. I suspect this has something to do with

Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim Carroll

Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a "transformation" and not an "action" (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD wit

Re: Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-16 Thread Aniket Bhatnagar

Hi guys I am hoping someone might have a clue on why this is happening. Otherwise I will have to dwell into YARN module's source code to better understand the issue. On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar wrote: > I am running spark 1.1.0 on AWS EMR and I am running a batch job that >

Re: Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya

Also, this may be related to this issue https://issues.apache.org/jira/browse/SPARK-3885. Further, to clarify, data is being written to Hadoop on the data nodes. Would really appreciate any help. Thanks! From: , "Ganelin, Ilya" mailto:ilya.gane...@capitalone.com>> Date: Tuesday, December 16, 2

Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya

Hi all – I’m running a long running batch-processing job with Spark through Yarn. I am doing the following Batch Process val resultsArr = sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]()) InMemoryArray.forEach{ 1) Using a thread pool, generate callable jobs that operate

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll

Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User L

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll

In case a little more information is helpful: the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a ".gz" file (that's too big to fit on my disk). I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect. This rdd is what I'm calling aggregate on and I expect t

Re: Spark 1.2 + Avro file does not work in HDP2.2

2014-12-16 Thread Zhan Zhang

Hi Manas, There is a small patch needed for HDP2.2. You can refer to this PR https://github.com/apache/spark/pull/3409 There are some other issues compiling against hadoop2.6. But we will fully support it very soon. You can ping me, if you want. Thanks. Zhan Zhang On Dec 12, 2014, at 11:38 AM

No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll

Okay, I have an rdd that I want to run an aggregate over but it insists on spilling to disk even though I structured the processing to only require a single pass. In other words, I can do all of my processing one entry in the rdd at a time without persisting anything. I set rdd.persist(StorageLe

Re: integrating long-running Spark jobs with Thriftserver

2014-12-16 Thread Tim Schweichler

To ask a related question, if I use Zookeeper for table locking, will this affect all attempts to access the Hive tables (including those from my Spark applications) or only those made through the Thriftserver? In other words, does Zookeeper provide concurrency for the Hive metastore in general

Re: Appending an incrental value to each RDD record

2014-12-16 Thread bethesda

Thanks! zipWithIndex() works well. I had overlooked it because the name 'zip' is rather odd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html Sent from the Apache Spark User List mailing l

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

I'm using CDH5 that was installed via Cloudera Manager. Does it matter? Thanks, Daniel > On 16 בדצמ׳ 2014, at 19:18, Akhil Das wrote: > > Can you open the file bin/spark-class and then put an echo $CLASSPATH below > the place where they exports it and see what are the contents? > >> On 16 De

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

Completely diffrent than the one I set: Classpath is ::/tmp/spark/spark-branch-1.1/conf:/tmp/spark/spark-branch-1.1/assembly/target/scala-2.10/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar:/tmp/spark/spark-branch-1.1/lib_managed/jars/datanucleus-core-3.2.2.jar:/tmp/spark/spark-branch-1.1/lib_mana

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das

Can you open the file bin/spark-class and then put an echo $CLASSPATH below the place where they exports it and see what are the contents? On 16 Dec 2014 22:46, "Daniel Haviv" wrote: > I've added every jar in the lib dir to my classpath and still no luck: > > > CLASSPATH=/tmp/spark/spark-branch-1

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

I've added every jar in the lib dir to my classpath and still no luck: CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-rdbms-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/spar

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das

And this is how my classpath looks like Spark assembly has been built with Hive, including Datanucleus jars on classpath ::/home/akhld/mobi/localcluster/spark-1/conf:/home/akhld/mobi/localcluster/spark-1/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

Same here... # jar tf lib/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar | grep SparkSubmit.class *org/apache/spark/deploy/SparkSubmit.class* On Tue, Dec 16, 2014 at 6:50 PM, Akhil Das wrote: > > This is how it looks on my machine. > > [image: Inline image 1] > > Thanks > Best Regards > > On

Re: Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez

It seems to be slightly related to this: https://issues.apache.org/jira/browse/SPARK-1340 But in this case, it's not the Task that is failing but the entire executor where the Kafka Receiver resides. 2014-12-16 16:53 GMT+00:00 Luis Ángel Vicente Sánchez < langel.gro...@gmail.com>: > > Dear spark

Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez

Dear spark community, We were testing a spark failure scenario where the executor that is running a Kafka Receiver dies. We are running our streaming jobs on top of mesos and we killed the mesos slave that was running the executor ; a new executor was created on another mesos-slave but according

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das

This is how it looks on my machine. [image: Inline image 1] Thanks Best Regards On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv wrote: > > That's the first thing I tried... still the same error: > hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib > hdfs@ams-rsrv01:~$ cd /tmp/spa

Re: Spark 1.2 + Avro does not work in HDP2.2

2014-12-16 Thread Sean Owen

Given that the error is java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected ...this usually means there is a Hadoop version problem. But in particular it's https://issues.apache.org/jira/browse/SPARK-3039 which affects as

Re: Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar

It turns out that this happens when checkpoint is set to a local directory path. I have opened a JIRA SPARK-4862 for Spark streaming to output better error message. Thanks, Aniket On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar wrote: > I am using spark 1.1.0 running a streaming job that uses u

Re: Spark 1.2 + Avro does not work in HDP2.2

2014-12-16 Thread manasdebashiskar

Hi All, I saw some helps online about forcing avro-mapred to hadoop2 using classifiers. Now my configuration is thus val avro= "org.apache.avro" % "avro-mapred" % V.avro classifier "hadoop2" How ever I still get java.lang.IncompatibleClassChangeError. I think I am not building sp

Re: Why so many tasks?

2014-12-16 Thread Ashish Rangole

Take a look at combine file input format. Repartition or coalesce could introduce shuffle I/O overhead. On Dec 16, 2014 7:09 AM, "bethesda" wrote: > Thank you! I had known about the small-files problem in HDFS but didn't > realize that it affected sc.textFile(). > > > > -- > View this message in

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

That's the first thing I tried... still the same error: hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1 hdfs@ams-rsrv01:/tmp/spark/spark-branch-1.1$ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleu

Re: Appending an incrental value to each RDD record

2014-12-16 Thread mj

You could try using zipWIthIndex (links below to API docs). For example, in python: items =['a','b','c'] items2= sc.parallelize(items) print(items2.first()) items3=items2.map(lambda x: (x, x+"!")) print(items3.first()) items4=items3.zipWithIndex() print(items4.first()) items5=items4.map(lamb

Re: Appending an incrental value to each RDD record

2014-12-16 Thread Gerard Maas

You would do: rdd.zipWithIndexGives you an RDD[Original, Int] where the second element is the index. To have a (index,original) tuple, you will need to map that previous RDD to the desired shape: rdd.zipWithIndex.map(_.swap) -kr, Gerard. kr, Gerard. On Tue, Dec 16, 2014 at 4:12 PM, bethe

NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread richiesgr

Hi This time I need expert. On 1.1.1 and only in cluster (standalone or EC2) when I use this code : countersPublishers.foreachRDD(rdd => { rdd.foreachPartition(partitionRecords => { partitionRecords.foreach(record => { //dbActorUpdater ! updateDBMessage(record) println(

Appending an incrental value to each RDD record

2014-12-16 Thread bethesda

I think this is sort of a newbie question, but I've checked the api closely and don't see an obvious answer: Given an RDD, how would I create a new RDD of Tuples where the first Tuple value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple is the original RDD record? I'm tryi

Re: Why so many tasks?

2014-12-16 Thread bethesda

Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html Sent from the Apache Spark User List mailing list archive at

Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar

I am using spark 1.1.0 running a streaming job that uses updateStateByKey and then (after a bunch of maps/flatMaps) does a foreachRDD to save data in each RDD by making HTTP calls. The issue is that each time I attempt to save the RDD (using foreach on RDD), it gives me the following exception: or

Re: KafkaUtils explicit acks

2014-12-16 Thread Cody Koeninger

Do you actually need spark streaming per se for your use case? If you're just trying to read data out of kafka into hbase, would something like this non-streaming rdd work for you: https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka Note th

Re: Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-16 Thread Gerard Maas

Hi Demi, Thanks for sharing. What we usually do is let the driver read the configuration for the job and pass the config object to the actual job as a serializable object. That way avoids the need of a centralized config sharing point that needs to be accessed from the workers. as you have define

Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread sahanbull

Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: fields = [StructField('field1',StringType(),True), StructField('field2',Intege

Re: Why so many tasks?

2014-12-16 Thread Gen

Hi, As you have 1,000 files, the RDD created by textFile will have 1,000 partitions. It is normal. In fact, as the same principal of HDFS, it is better to store data with smaller number of files but larger size file. You can use data.coalesce(10) to solve this problem(it reduce the number of par

Re: Why so many tasks?

2014-12-16 Thread Gerard Maas

Creating an RDD from a wildcard like this: val data = sc.textFile("/user/foo/myfiles/*") Will create 1 partition for each file found. 1000 files = 1000 partitions. A task is a job stage (defined as a sequence of transformations) applied to a partition, so 1000 partitions = 1000 tasks per stage. Y

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread Gen

Hi,How many clients and how many products do you have?CheersGen jaykatukuri wrote > Hi all,I am running into an out of memory error while running ALS using > MLLIB on a reasonably small data set consisting of around 6 Million > ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap

Re: Why so many tasks?

2014-12-16 Thread Akhil Das

Try to repartition the data like: val data = sc.textFile("/user/foo/myfiles/*").repartition(100) Since the file size is small it shouldn't be a problem. Thanks Best Regards On Tue, Dec 16, 2014 at 6:21 PM, bethesda wrote: > > Our job is creating what appears to be an inordinate number of v

Re: Why so many tasks?

2014-12-16 Thread Koert Kuipers

sc.textFile uses a hadoop input format. hadoop input formats by default create one task per file, and they are not very suitable for many very small files. can you turns your 1000 files into one larger text file? otherwise maybe try: val data = sc.textFile("/user/foo/myfiles/*").coalesce(100) On

Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread mj

I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running PySpa

Why so many tasks?

2014-12-16 Thread bethesda

Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks are running, perhaps because we are mis-configured

Re: KafkaUtils explicit acks

2014-12-16 Thread Mukesh Jha

I agree that this is not a trivial task as in this approach the kafka ack's will be done by the SparkTasks that means a plug-able mean to ack your input data source i.e. changes in core. >From my limited experience with Kafka + Spark what I've seem is If spark tasks takes longer time than the batc

Re: Data Loss - Spark streaming

2014-12-16 Thread Gerard Maas

Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson < jeniba.j

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das

print the CLASSPATH and make sure the spark assembly jar is there in the classpath Thanks Best Regards On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv wrote: > > Hi, > I've built spark successfully with maven but when I try to run spark-shell > I get the following errors: > > Spark assembly has be

Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit Caused by: j

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread buring

you can try to decrease the rank value. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20711.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RDD "toarray","first" behavior

2014-12-16 Thread buring

Hi Recently I have some problems about rdd behaviors.It's about "RDD.first","RDD.toArray" method when RDD only has one element. I can't get the correct element in RDD. I will give more detail after the code. My code was as follows: //get and rdd with just one row RDD[(Long,A

Re: protobuf error running spark on hadoop 2.4

2014-12-16 Thread RodrigoB

Hi, I'm currently having this issue as well, while using Spark with Akka 2.3.4. Did Debasish's suggestion work for you? I've already built Spark with hadoop 2.3.0 but still having the error. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/prot

Re: Locality level and Kryo

2014-12-16 Thread Akhil Das

Make sure ulimit is set system wide across the cluster (ulimit -a). Also reduce the number of partitions to a smaller number (say 200-500) to get ride of *the too many open files*. Thanks Best Regards On Tue, Dec 16, 2014 at 3:54 PM, aecc wrote: > > Hi guys, > > It happens to me quite often that

Locality level and Kryo

2014-12-16 Thread aecc

Hi guys, It happens to me quite often that when the locality level of a task goes further than LOCAL (NODE, RACK, etc), I get some of the following exceptions: "too many files open", "encountered unregistered class id", "cannot cast X to Y". I do not get any exceptions during shuffling (which mea

Spark inserting into parquet files with different schema

2014-12-16 Thread AdamPD

Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second SchemaRDD

Re: 答复: Fetch Failed caused job failed.

2014-12-16 Thread Akhil Das

So the fetch failure error is gone? Can you paste the code that you are executing? What is the size of the data and your cluster setup? Thanks Best Regards On Tue, Dec 16, 2014 at 3:16 PM, Ma,Xi wrote: > > Hi Das, > > > > Thanks for your advice. > > > > I'm not sure what's the usage of setting

答复: Fetch Failed caused job failed.

2014-12-16 Thread Ma,Xi

Hi Das, Thanks for your advice. I'm not sure what's the usage of setting memoryFraction to 1. I've tried to rerun the test again with the following parameters in spark_default.conf, but failed again: spark.rdd.compress true spark.akka.frameSize 50 spark.storage.memoryFraction 0.8 spark.core.

GC problem while filtering

2014-12-16 Thread Batselem

Hi I am trying to filter a large table with 3 columns. My goal is to filter this bigtable using multi clauses. I filtered bigtable 3 times but the first filtering took about 50 seconds to complete whereas the second and third filter transformation took about 5 seconds. I wonder if it is because of

Re: MLLib: Saving and loading a model

2014-12-16 Thread Jaonary Rabarisoa

Hi, There's is a ongoing work on model export https://www.github.com/apache/spark/pull/3062 For now, since LinearRegression is serializable you can save it as object file : sc.saveAsObjectFile(Seq(model)) then val model = sc.objectFile[LinearRegresionWithSGD]("path").first model.predict(...)

Re: Accessing Apache Spark from Java

2014-12-16 Thread Akhil Das

Hi Jai, Refer this doc and make sure your network is not blocking http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html Also make sure you are using the same version of spark in both places (the one on the cluster, and t

Re: Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-16 Thread shenghua

A workaround trick is found and put in the ticket https://issues.apache.org/jira/browse/SPARK-4854. Hope this would be useful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-UDTF-with-Lateral-View-throws-ClassNotFound-exception-in-Spark-SQL-CLI-tp206

GC problem while filtering large data

2014-12-16 Thread Joe L

Hi I am trying to filter large table with 3 columns. Spark SQL might be a good choice but want to do it without SQL. The goal is to filter bigtable with multi clauses. I filtered bigtable 3times but the first filtering takes about 50seconds but the second and third filter transformation took about

Multiple Filter Effiency

2014-12-16 Thread zkidkid

Hi, Currently I am trying to count on a document with multiple filter. Let say, here is my document: //user field1 field2 field3 user1 0 0 1 user2 0 1 0 user3 0 0 0 I want to count on user.log for some filters like this: Filter1: field1 == 0 & field 2 = 0 Filter2: field1 == 0 & field 3 = 1 Filte

97 matches

Mail list logo