Hi guys,
I get Kryo exceptions of the type "unregistered class id" and "cannot cast
to class" when the locality level of the tasks go beyond LOCAL.
However I get no Kryo exceptions during shuffling operations.
If the locality level never goes beyond LOCAL everything works fine.
Is there a speci
Could you post the stack trace?
Best Regards,
Shixiong Zhu
2014-12-16 23:21 GMT+08:00 richiesgr :
>
> Hi
>
> This time I need expert.
> On 1.1.1 and only in cluster (standalone or EC2)
> when I use this code :
>
> countersPublishers.foreachRDD(rdd => {
> rdd.foreachPartition(partitionRecords
I also got the same problem..
2014-12-09 22:58 GMT+08:00 Daniel Haviv :
>
> Hi,
> I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
> get the following exception:
>
> 14/12/09 06:54:24 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 14/12/
Did you try something like:
//Get the last hour
val d = (System.currentTimeMillis() - 3600 * 1000)
val ex = "abc_" + d.toString().substring(0,7) + "*.json"
[image: Inline image 1]
Thanks
Best Regards
On Wed, Dec 17, 2014 at 5:05 AM, durga wrote:
>
> Hi All,
>
> I need help with regex in my sc
I get the key point . The problem is in sc.sequenceFile,From API description
"RDD will create many references to the same objecty" ,So I revise the code
"sessions.getBytes" to "sessions.getBytes.clone",
It seems to work.
Thanks.
--
View this message in context:
http://apache-spark-user-list.1
Jerry,
On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj wrote:
>
> Another problem with the DSL:
>
> t1.where('term == "dmin").count() returns zero.
Looks like you need ===:
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
Tobias
Hi,
I have a Spark cluster using standalone mode. Spark Master is
configured as High Availablity mode.
Now I am going to upgrade Spark from 1.0 to 1.1, but don't want to
interrupt the currently running jobs.
(1) Are there any way to perform a rolling upgrade (while running a job)?
(2) If not, whe
Gautham,
How many number of gz files do you have? Maybe the reason is that gz file is
compressed that can't be splitted for processing by Mapreduce. A single gz
file can only be processed by a single Mapper so that the CPU treads can't be
fully utilized.
-Original Message-
From: Gau
Another problem with the DSL:
t1.where('term == "dmin").count() returns zero. But
sqlCtx.sql("select * from t1 where term = 'dmin').count() returns 700,
which I know is correct from the data. Is there something wrong with how
I'm using the DSL?
Thanks
On 17/12/14 11:13 am, Jerry Raj wrote:
Hi, Shuai,
How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as a
constructor parameter to HadoopRDD or NewHadoopRDD in Spark.
Hi there
I also got exception when running PI example on YARN
Spark version: spark-1.1.1-bin-hadoop2.4
My environment: Hortonworks HDP 2.2
My command:
./bin/spark-submit --master yarn-cluster --class
org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10
Output logs:
14/12/17 14:06:32 IN
Releases are roughly every 3mo so you should expect around March if the
pace stays steady.
2014-12-16 22:56 GMT-05:00 Marco Shaw :
>
> When it is ready.
>
>
>
> > On Dec 16, 2014, at 11:43 PM, 张建轶 wrote:
> >
> > Hi £¡
> >
> > when will the spark 1.3.0 be released£¿
> > I want to use new LDA featu
Hi,
I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I
have two tables (backed by Parquet files) and I need to do a join across
them using a common field (user_id). This works fine using standard SQL
but not using the language-integrated DSL neither
t1.join(t2, on = 't1.us
Actually there was still Fetch failure. However, after I upgrade the spark to
1.1.1, this error was not met again.
Thanks,
Mars
发件人: Akhil Das [mailto:ak...@sigmoidanalytics.com]
发送时间: 2014年12月16日 17:52
收件人: Ma,Xi
抄送: u...@spark.incubator.apache.org
主题: Re: 答复: Fetch Failed caused job failed.
RDD.persist() can be useful here.
On 11 December 2014 at 14:34, ankits [via Apache Spark User List] <
ml-node+s1001560n20613...@n3.nabble.com> wrote:
>
> I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
> fast. How can i inspect the size of RDD in memory and get more informa
When it is ready.
> On Dec 16, 2014, at 11:43 PM, 张建轶 wrote:
>
> Hi £¡
>
> when will the spark 1.3.0 be released£¿
> I want to use new LDA feature.
> Thank
> you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü
Hi !
when will the spark 1.3.0 be released?
I want to use new LDA feature.
Thank you!
Thanks for the clarifications. I misunderstood what the number on UI meant.
On Mon, Dec 15, 2014 at 7:00 PM, Sean Owen wrote:
> I believe this corresponds to the 0.6 of the whole heap that is
> allocated for caching partitions. See spark.storage.memoryFraction on
> http://spark.apache.org/docs/l
Hi
Recently I have some problems about rdd behaviors.It's about
"RDD.first","RDD.toArray" method when RDD only has one element.
I get the different result in different method from one element RDD
where i
should have the same result. I will give more detail after the code.
Wow. i just realized what was happening and it's all my fault. I have a
library method that I wrote that presents the RDD and I was actually
repartitioning it myself.
I feel pretty dumb. Sorry about that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-
I've been trying to figure out how to use Spark to do a simple aggregation
without reparitioning and essentially creating fully instantiated
intermediate RDDs and it seem virtually impossible.
I've now gone as far as writing my own single parition RDD that wraps an
Iterator[String] and calling ag
Hi All,
I need help with regex in my sc.textFile()
I have lots of files with with epoch millisecond timestamp.
ex:abc_1418759383723.json
Now I need to consume last one hour files using the epoch time stamp as
mentioned above.
I tried couple of options , nothing seems working for me.
If any on
Are you reading the file from your driver (main / master) program?
Is your file in a distributed system like HDFS? available to all your nodes?
It might be due to the laziness of transformations:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
"Transformations" are lazy
Hi Harry,
Thanks for your response.
I'm working in scala. When I do a "count" call it expands the RDD in the
count (since it's an action). You can see the call stack that results in
the failure of the job here:
ERROR DiskBlockObjectWriter - Uncaught exception while reverting
partial write
Your Spark is trying to load a hadoop library "winutils.exe", which you
don't have in your Windows:
14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
a
hey igor!
a few ways to work around this depending on the level of exception-handling
granularity you're willing to accept:
1) use mapPartitions() to wrap the entire partition handling code in a
try/catch -- this is fairly coarse-grained, however, and will fail the
entire partition.
2) modify your
Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the "streaming"
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?
I had created https://issues.apache.org/jira/browse/SPARK-4866, it
will be fixed by https://github.com/apache/spark/pull/3714.
Thank you for reporting this.
Davies
On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu wrote:
> It's a bug, could you file a JIRA for this? thanks!
>
> On Tue, Dec 16, 2014
I've been running a job in local mode using --master local[*] and I've
noticed that, for some reason, exceptions appear to get eaten- as in, I
don't see them. If i debug in my IDE, I'll see that an exception was thrown
if I step through the code but if I just run the application, it appears
everyth
It's a bug, could you file a JIRA for this? thanks!
On Tue, Dec 16, 2014 at 5:49 AM, sahanbull wrote:
>
> Hi Guys,
>
> Im running a spark cluster in AWS with Spark 1.1.0 in EC2
>
> I am trying to convert a an RDD with tuple
>
> (u'string', int , {(int, int): int, (int, int): int})
>
> to a schema
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s
On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas wrote:
> Hi Jeniba,
>
> The second part of this meetup recording has a very good answer to your
> question. TD explains the current behavior and the on-going work in Spark
> S
Hi All,
My application load 1000 files, each file from 200M - a few GB, and combine
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation.
In Hadoop, it is simple because I can turn-off
wow, really weird. My intuition is the same as everyone else's, some
unprintable character. Here's a couple more debugging tricks I've used in
the past:
//set up an accumulator to catch the bad rows as a side-effect
val nBadRows = sc.accumulator(0)
val nGoodRows = sc.accumulator(0)
val badRows =
Hi,
I am trying to run StreamingLinearRegression example on single node w/ out
hadoop installed. I keep getting the following error and cannot find any
documentation about the issue:
/14/12/16 13:26:50 ERROR JobScheduler: Error running job streaming job
141875801 ms.1
org.apache.spark.SparkEx
I am having the same issue, and it still does not update for me. I am trying
to execute the example by using bin/run-example
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-linear-regression-is-not-building-the-model-tp18522p20727.html
Sent from th
Thanks Cheng. Tried it out and saw the InMemoryColumnarTableScan word in the
physical plan.
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, December 12, 2014 11:37 PM
To: Judy Nash; user@spark.apache.org
Subject: Re: Spark SQL API Doc & IsCached as SQL command
There isn’t a SQL st
I have tested my python script by using the pyspark shell. I run into an
error because of memory limits on the name node.
I am wondering how I run the script no spark yarn. I am not familiar with
this at all.
Any help would be greatly appreciated.
Thanks,
--
*MAGNE**+**I**C*
*Sam Flint* |
I think accumulators do exactly what you want.
(Scala syntax below, I'm just not familiar with the Java equivalent ...)
val f1counts = sc.accumulator (0)
val f2counts = sc.accumulator (0)
val f3counts = sc.accumulator (0)
textfile.foreach { s =>
if(f1matches) f1counts += 1
...
}
Note that y
Hi all,
I am trying to run a simple Spark_Pi application through Yarn from Java code. I
have the Spark_Pi class and everything works fine if I run on Spark. However,
when I set master to "yarn-client" and set yarn mode to true, I keep getting
exceptions. I suspect this has something to do with
Is there a way to get Spark to NOT reparition/shuffle/expand a
sc.textFile(fileUri) when the URI is a gzipped file?
Expanding a gzipped file should be thought of as a "transformation" and not
an "action" (if the analogy is apt). There is no need to fully create and
fill out an intermediate RDD wit
Hi guys
I am hoping someone might have a clue on why this is happening. Otherwise I
will have to dwell into YARN module's source code to better understand the
issue.
On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar
wrote:
> I am running spark 1.1.0 on AWS EMR and I am running a batch job that
>
Also, this may be related to this issue
https://issues.apache.org/jira/browse/SPARK-3885.
Further, to clarify, data is being written to Hadoop on the data nodes.
Would really appreciate any help. Thanks!
From: , "Ganelin, Ilya"
mailto:ilya.gane...@capitalone.com>>
Date: Tuesday, December 16, 2
Hi all – I’m running a long running batch-processing job with Spark through
Yarn. I am doing the following
Batch Process
val resultsArr =
sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]())
InMemoryArray.forEach{
1) Using a thread pool, generate callable jobs that operate
Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.gz
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User L
In case a little more information is helpful:
the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
".gz" file (that's too big to fit on my disk).
I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.
This rdd is what I'm calling aggregate on and I expect t
Hi Manas,
There is a small patch needed for HDP2.2. You can refer to this PR
https://github.com/apache/spark/pull/3409
There are some other issues compiling against hadoop2.6. But we will fully
support it very soon. You can ping me, if you want.
Thanks.
Zhan Zhang
On Dec 12, 2014, at 11:38 AM
Okay,
I have an rdd that I want to run an aggregate over but it insists on
spilling to disk even though I structured the processing to only require a
single pass.
In other words, I can do all of my processing one entry in the rdd at a time
without persisting anything.
I set rdd.persist(StorageLe
To ask a related question, if I use Zookeeper for table locking, will this
affect all attempts to access the Hive tables (including those from my Spark
applications) or only those made through the Thriftserver? In other words, does
Zookeeper provide concurrency for the Hive metastore in general
Thanks! zipWithIndex() works well. I had overlooked it because the name
'zip' is rather odd
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html
Sent from the Apache Spark User List mailing l
I'm using CDH5 that was installed via Cloudera Manager.
Does it matter?
Thanks,
Daniel
> On 16 בדצמ׳ 2014, at 19:18, Akhil Das wrote:
>
> Can you open the file bin/spark-class and then put an echo $CLASSPATH below
> the place where they exports it and see what are the contents?
>
>> On 16 De
Completely diffrent than the one I set:
Classpath is
::/tmp/spark/spark-branch-1.1/conf:/tmp/spark/spark-branch-1.1/assembly/target/scala-2.10/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar:/tmp/spark/spark-branch-1.1/lib_managed/jars/datanucleus-core-3.2.2.jar:/tmp/spark/spark-branch-1.1/lib_mana
Can you open the file bin/spark-class and then put an echo $CLASSPATH below
the place where they exports it and see what are the contents?
On 16 Dec 2014 22:46, "Daniel Haviv" wrote:
> I've added every jar in the lib dir to my classpath and still no luck:
>
>
> CLASSPATH=/tmp/spark/spark-branch-1
I've added every jar in the lib dir to my classpath and still no luck:
CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-rdbms-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/spar
And this is how my classpath looks like
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
::/home/akhld/mobi/localcluster/spark-1/conf:/home/akhld/mobi/localcluster/spark-1/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus
Same here...
# jar tf lib/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar | grep
SparkSubmit.class
*org/apache/spark/deploy/SparkSubmit.class*
On Tue, Dec 16, 2014 at 6:50 PM, Akhil Das
wrote:
>
> This is how it looks on my machine.
>
> [image: Inline image 1]
>
> Thanks
> Best Regards
>
> On
It seems to be slightly related to this:
https://issues.apache.org/jira/browse/SPARK-1340
But in this case, it's not the Task that is failing but the entire executor
where the Kafka Receiver resides.
2014-12-16 16:53 GMT+00:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:
>
> Dear spark
Dear spark community,
We were testing a spark failure scenario where the executor that is running
a Kafka Receiver dies.
We are running our streaming jobs on top of mesos and we killed the mesos
slave that was running the executor ; a new executor was created on another
mesos-slave but according
This is how it looks on my machine.
[image: Inline image 1]
Thanks
Best Regards
On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv wrote:
>
> That's the first thing I tried... still the same error:
> hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
> hdfs@ams-rsrv01:~$ cd /tmp/spa
Given that the error is
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
...this usually means there is a Hadoop version problem.
But in particular it's
https://issues.apache.org/jira/browse/SPARK-3039 which affects
as
It turns out that this happens when checkpoint is set to a local directory
path. I have opened a JIRA SPARK-4862 for Spark streaming to output better
error message.
Thanks,
Aniket
On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar
wrote:
> I am using spark 1.1.0 running a streaming job that uses u
Hi All,
I saw some helps online about forcing avro-mapred to hadoop2 using
classifiers.
Now my configuration is thus
val avro= "org.apache.avro" % "avro-mapred" % V.avro classifier
"hadoop2"
How ever I still get java.lang.IncompatibleClassChangeError. I think I am
not building sp
Take a look at combine file input format. Repartition or coalesce could
introduce shuffle I/O overhead.
On Dec 16, 2014 7:09 AM, "bethesda" wrote:
> Thank you! I had known about the small-files problem in HDFS but didn't
> realize that it affected sc.textFile().
>
>
>
> --
> View this message in
That's the first thing I tried... still the same error:
hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1
hdfs@ams-rsrv01:/tmp/spark/spark-branch-1.1$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleu
You could try using zipWIthIndex (links below to API docs). For example, in
python:
items =['a','b','c']
items2= sc.parallelize(items)
print(items2.first())
items3=items2.map(lambda x: (x, x+"!"))
print(items3.first())
items4=items3.zipWithIndex()
print(items4.first())
items5=items4.map(lamb
You would do:
rdd.zipWithIndexGives you an RDD[Original, Int] where the second
element is the index.
To have a (index,original) tuple, you will need to map that previous RDD to
the desired shape:
rdd.zipWithIndex.map(_.swap)
-kr, Gerard.
kr, Gerard.
On Tue, Dec 16, 2014 at 4:12 PM, bethe
Hi
This time I need expert.
On 1.1.1 and only in cluster (standalone or EC2)
when I use this code :
countersPublishers.foreachRDD(rdd => {
rdd.foreachPartition(partitionRecords => {
partitionRecords.foreach(record => {
//dbActorUpdater ! updateDBMessage(record)
println(
I think this is sort of a newbie question, but I've checked the api closely
and don't see an obvious answer:
Given an RDD, how would I create a new RDD of Tuples where the first Tuple
value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple
is the original RDD record? I'm tryi
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html
Sent from the Apache Spark User List mailing list archive at
I am using spark 1.1.0 running a streaming job that uses updateStateByKey
and then (after a bunch of maps/flatMaps) does a foreachRDD to save data in
each RDD by making HTTP calls. The issue is that each time I attempt to
save the RDD (using foreach on RDD), it gives me the following exception:
or
Do you actually need spark streaming per se for your use case? If you're
just trying to read data out of kafka into hbase, would something like this
non-streaming rdd work for you:
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
Note th
Hi Demi,
Thanks for sharing.
What we usually do is let the driver read the configuration for the job and
pass the config object to the actual job as a serializable object. That way
avoids the need of a centralized config sharing point that needs to be
accessed from the workers. as you have define
Hi Guys,
Im running a spark cluster in AWS with Spark 1.1.0 in EC2
I am trying to convert a an RDD with tuple
(u'string', int , {(int, int): int, (int, int): int})
to a schema rdd using the schema:
fields = [StructField('field1',StringType(),True),
StructField('field2',Intege
Hi,
As you have 1,000 files, the RDD created by textFile will have 1,000
partitions. It is normal. In fact, as the same principal of HDFS, it is
better to store data with smaller number of files but larger size file.
You can use data.coalesce(10) to solve this problem(it reduce the number of
par
Creating an RDD from a wildcard like this:
val data = sc.textFile("/user/foo/myfiles/*")
Will create 1 partition for each file found. 1000 files = 1000 partitions.
A task is a job stage (defined as a sequence of transformations) applied to
a partition, so 1000 partitions = 1000 tasks per stage.
Y
Hi,How many clients and how many products do you have?CheersGen
jaykatukuri wrote
> Hi all,I am running into an out of memory error while running ALS using
> MLLIB on a reasonably small data set consisting of around 6 Million
> ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap
Try to repartition the data like:
val data = sc.textFile("/user/foo/myfiles/*").repartition(100)
Since the file size is small it shouldn't be a problem.
Thanks
Best Regards
On Tue, Dec 16, 2014 at 6:21 PM, bethesda wrote:
>
> Our job is creating what appears to be an inordinate number of v
sc.textFile uses a hadoop input format. hadoop input formats by default
create one task per file, and they are not very suitable for many very
small files. can you turns your 1000 files into one larger text file?
otherwise maybe try:
val data = sc.textFile("/user/foo/myfiles/*").coalesce(100)
On
I've got a simple pyspark program that generates two CSV files and then
carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
program works fine for smaller volumes of records, but when it goes beyond 3
million records for the fact dataset, I get the error below. I'm running
PySpa
Our job is creating what appears to be an inordinate number of very small
tasks, which blow out our os inode and file limits. Rather than continually
upping those limits, we are seeking to understand whether our real problem
is that too many tasks are running, perhaps because we are mis-configured
I agree that this is not a trivial task as in this approach the kafka ack's
will be done by the SparkTasks that means a plug-able mean to ack your
input data source i.e. changes in core.
>From my limited experience with Kafka + Spark what I've seem is If spark
tasks takes longer time than the batc
Hi Jeniba,
The second part of this meetup recording has a very good answer to your
question. TD explains the current behavior and the on-going work in Spark
Streaming to fix HA.
https://www.youtube.com/watch?v=jcJq3ZalXD8
-kr, Gerard.
On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson <
jeniba.j
print the CLASSPATH and make sure the spark assembly jar is there in the
classpath
Thanks
Best Regards
On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv wrote:
>
> Hi,
> I've built spark successfully with maven but when I try to run spark-shell
> I get the following errors:
>
> Spark assembly has be
Hi,
I've built spark successfully with maven but when I try to run spark-shell I
get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/deploy/SparkSubmit
Caused by: j
you can try to decrease the rank value.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20711.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi
Recently I have some problems about rdd behaviors.It's about
"RDD.first","RDD.toArray" method when RDD only has one element. I can't get
the correct element in RDD. I will give more detail after the code.
My code was as follows:
//get and rdd with just one row RDD[(Long,A
Hi,
I'm currently having this issue as well, while using Spark with Akka 2.3.4.
Did Debasish's suggestion work for you?
I've already built Spark with hadoop 2.3.0 but still having the error.
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/prot
Make sure ulimit is set system wide across the cluster (ulimit -a). Also
reduce the number of partitions to a smaller number (say 200-500) to get
ride of *the too many open files*.
Thanks
Best Regards
On Tue, Dec 16, 2014 at 3:54 PM, aecc wrote:
>
> Hi guys,
>
> It happens to me quite often that
Hi guys,
It happens to me quite often that when the locality level of a task goes
further than LOCAL (NODE, RACK, etc), I get some of the following
exceptions: "too many files open", "encountered unregistered class id",
"cannot cast X to Y".
I do not get any exceptions during shuffling (which mea
Hi all,
I understand that parquet allows for schema versioning automatically in the
format; however, I'm not sure whether Spark supports this.
I'm saving a SchemaRDD to a parquet file, registering it as a table, then
doing an insertInto with a SchemaRDD with an extra column.
The second SchemaRDD
So the fetch failure error is gone? Can you paste the code that you are
executing? What is the size of the data and your cluster setup?
Thanks
Best Regards
On Tue, Dec 16, 2014 at 3:16 PM, Ma,Xi wrote:
>
> Hi Das,
>
>
>
> Thanks for your advice.
>
>
>
> I'm not sure what's the usage of setting
Hi Das,
Thanks for your advice.
I'm not sure what's the usage of setting memoryFraction to 1. I've tried to
rerun the test again with the following parameters in spark_default.conf, but
failed again:
spark.rdd.compress true
spark.akka.frameSize 50
spark.storage.memoryFraction 0.8
spark.core.
Hi I am trying to filter a large table with 3 columns. My goal is to filter
this bigtable using multi clauses. I filtered bigtable 3 times but the first
filtering took about 50 seconds to complete whereas the second and third
filter transformation took about 5 seconds. I wonder if it is because of
Hi,
There's is a ongoing work on model export
https://www.github.com/apache/spark/pull/3062
For now, since LinearRegression is serializable you can save it as object
file :
sc.saveAsObjectFile(Seq(model))
then
val model = sc.objectFile[LinearRegresionWithSGD]("path").first
model.predict(...)
Hi Jai,
Refer this doc and make sure your network is not blocking
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html
Also make sure you are using the same version of spark in both places (the
one on the cluster, and t
A workaround trick is found and put in the ticket
https://issues.apache.org/jira/browse/SPARK-4854. Hope this would be useful.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-UDTF-with-Lateral-View-throws-ClassNotFound-exception-in-Spark-SQL-CLI-tp206
Hi I am trying to filter large table with 3 columns. Spark SQL might be a
good choice but want to do it without SQL. The goal is to filter bigtable
with multi clauses. I filtered bigtable 3times but the first filtering takes
about 50seconds but the second and third filter transformation took about
Hi,
Currently I am trying to count on a document with multiple filter.
Let say, here is my document:
//user field1 field2 field3
user1 0 0 1
user2 0 1 0
user3 0 0 0
I want to count on user.log for some filters like this:
Filter1: field1 == 0 & field 2 = 0
Filter2: field1 == 0 & field 3 = 1
Filte
97 matches
Mail list logo