If you want to force materialization use .count()
Also if you can simply don't unpersist anything, unless you really need to free
the memory
—
Sent from Mailbox
On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
BTW, it is possible that rdd.first()
Well i was able to get it to work by running spark over mesos. But it looks
like a bug while running spark alone.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7382.html
Sent
Hi,
I'm facing similar problem
According to: http://tachyon-project.org/Running-Spark-on-Tachyon.html
in order to allow tachyon client to connect to tachyon master in HA mode you
need to pass 2 system properties:
-Dtachyon.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181
I used these commands to show the GC timings : -verbose:gc
-XX:-PrintGCDetails -XX:+PrintGCTimeStamps
Following is the output I got on the standard output :
4.092: [GC 4.092: [ParNew: 274752K-27199K(309056K), 0.0421460 secs]
274752K-27199K(995776K), 0.0422720 secs] [Times: user=0.33 sys=0.11,
http://stackoverflow.com/questions/895444/java-garbage-collection-log-messages
http://stackoverflow.com/questions/16794783/how-to-read-a-verbosegc-output
I think this will help in understanding the logs.
On Wed, Jun 11, 2014 at 12:53 PM, nilmish nilmish@gmail.com wrote:
I used these
Hi,
I am trying to get a sense of number of streams we can process in parallel
on a Spark streaming cluster(Hadoop Yarn).
Is there any benchmark for the same?
We need a large number of streams(original + transformed) to be processed in
parallel.
The number is approximately around= 30,.
Thanks Tobias for replying.
The problem was that, I have to provide the dependency jars' paths to the
StreamingContext within the code. So, providing all the jar paths, resolved
my problem. Refer the below code snippet:
*JavaStreamingContext ssc = new JavaStreamingContext(args[0],
Any suggestions from anyone?
Thanks
Bijoy
On Tue, Jun 10, 2014 at 11:46 PM, bijoy deb bijoy.comput...@gmail.com
wrote:
Hi all,
I have build Shark-0.9.1 using sbt using the below command:
*SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly*
My Hadoop cluster is also having version
These stack traces come from the stuck node? Looks like it's waiting on
data in BlockFetcherIterator. Waiting for data from another node. But you
say all other nodes were done? Very curious.
Maybe you could try turning on debug logging, and try to figure out what
happens in BlockFetcherIterator (
Hi,
I am using spark 1.0.0 and found in spark sql some queries use GROUP BY
give weird results.
To reproduce, type the following commands in spark-shell connecting to a
standalone server:
case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import
Hi All,
I have to normalize a set of values in the range 0-500 to the [0-1] range.
Is there any util method in MLBase to normalize large set of data?
BR,
Aslan
That’s a good catch, but I think it’s suggested to use HiveContext currently.
( https://github.com/apache/spark/tree/master/sql)
Catalyst$ sbt/sbt hive/console
case class Foo(k: String, v: Int)
val rows = List.fill(100)(Foo(a, 1)) ++ List.fill(200)(Foo(b, 2)) ++
List.fill(300)(Foo(c, 3))
About more succeeded tasks than total tasks:
- This can happen if you have enabled speculative execution. Some
partitions can get processed multiple times.
- More commonly, the result of the stage may be used in a later
calculation, and has to be recalculated. This happens if some of the
results
It looks like this was due to another executor on a different node closing
the connection on its side. I found the entries below in the remote side's
logs.
Can anyone comment on why one ConnectionManager would close its connection
to another node and what could be tuned to avoid this? It did not
Daniel,
Thanks for the explanation.
On Wed, Jun 11, 2014 at 8:57 AM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
About more succeeded tasks than total tasks:
- This can happen if you have enabled speculative execution. Some
partitions can get processed multiple times.
- More
I'd try rerunning with master. It is likely you are running into SPARK-1994
https://issues.apache.org/jira/browse/SPARK-1994.
Michael
On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark 1.0.0 and found in spark sql some queries use GROUP BY
give weird
Hi,
I am currently using spark 1.0 locally on Windows 7. I would like to use
classes from external jar in the spark-shell. I followed the instruction in:
http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E
I
Ah, not that it should matter, but I'm on Linux and you seem to be on
Windows... maybe there is something weird going on with the Windows
launcher?
On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin van...@cloudera.com wrote:
Just tried this and it worked fine for me:
./bin/spark-shell --jars
Hello Spark/PMML enthusiasts,
It's pretty trivial to integrate the JPMML-Evaluator library with Spark. In
brief, take the following steps in your Spark application code:
1) Create a Java Map (arguments) that represents the input data record.
You need to specify a key-value mapping for every
Are you able to import any class from you jars within spark-shell?
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Wednesday, June 11, 2014 9:36 PM
To: user@spark.apache.org
Subject: Re: Adding external jar to spark-shell classpath in spark 1.0
Ah, not that it
Could you elaborate on this? I don’t have an application, I just use spark
shell.
From: Andrew Or [mailto:and...@databricks.com]
Sent: Wednesday, June 11, 2014 9:40 PM
To: user@spark.apache.org
Subject: Re: Adding external jar to spark-shell classpath in spark 1.0
This is a known issue:
The error is saying that your client libraries are older than what
your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7).
Try double-checking that your build is actually using that version
(e.g., by looking at the hadoop jar files in lib_managed/jars).
On Wed, Jun 11, 2014 at 2:07 AM, bijoy
Using MEMORY_AND_DISK_SER to persist the input RDD[Rating] seems to work
right for me now. I'm testing on a larger dataset and will see how it goes.
On Wed, Jun 11, 2014 at 9:56 AM, Neville Li neville@gmail.com wrote:
Does cache eviction affect disk storage level too? I tried cranking up
Hello, I was wondering if we could add our organization to the Powered by
Spark page. The information is:
Name: Vistar Media
URL: www.vistarmedia.com
Description: Location technology company enabling brands to reach on-the-go
consumers.
Let me know if you need anything else.
Thanks!
Derek
I rearranged my code to do a reduceByKey which I think is working. I also
don't think the problem was that updateState call, but something else;
unfortunately I changed a lot in looking for this issue, so not sure what
the actual fix might have been, but I think it's working now.
On Wed, Jun
Is there a way in the Apache Spark Kafka Utils to specify an offset to
start reading? Specifically, from the start of the queue, or failing that,
a specific point?
Hi Aslan,
Currently, we don't have the utility function to do so. However, you
can easily implement this by another map transformation. I'm working
on this feature now, and there will be couple different available
normalization option users can chose.
Sincerely,
DB Tsai
Hello,
You're absolutely right, the syntax you're using is returning the json4s
value objects, not native types like Int, Long etc. fix that problem and
then everything else (filters) will work as you expect. This is a short
snippet of a larger example: [1]
val lines =
Alright, added you.
Matei
On Jun 11, 2014, at 1:28 PM, Derek Mansen de...@vistarmedia.com wrote:
Hello, I was wondering if we could add our organization to the Powered by
Spark page. The information is:
Name: Vistar Media
URL: www.vistarmedia.com
Description: Location technology company
Could you try to click one that RDD and see the storage info per
partition? I tried continuously caching RDDs, so new ones kick old
ones out when there is not enough memory. I saw similar glitches but
the storage info per partition is correct. If you find a way to
reproduce this error, please
Thanks. We've run into timeout issues at scale as well. We were able to
workaround them by setting the following JVM options:
-Dspark.akka.askTimeout=300
-Dspark.akka.timeout=300
-Dspark.worker.timeout=300
NOTE: these JVM options *must* be set on worker nodes (and not just the
driver/master) for
I tried everything including sudo, but it still did not work using the local
directory.
However, I finally got it working by getting the history server to log into
hdfs.
I first created a directory in hdfs like the following:
./ephemeral-hdfs/bin/hadoop fs -mkdir /spark_logs
Then I started the
What about rainbow tables?
http://en.wikipedia.org/wiki/Rainbow_table
M.
2014-06-12 2:41 GMT+02:00 DB Tsai dbt...@stanford.edu:
I think creating the samples in the search space within RDD will be
too expensive, and the amount of data will probably be larger than any
cluster.
However, you
Xiangrui, clicking into the RDD link, it gives the same message, say only
96 of 100 partitions are cached. The disk/memory usage are the same, which
is far below the limit.
Is this what you want to check or other issue?
On Wed, Jun 11, 2014 at 4:38 PM, Xiangrui Meng men...@gmail.com wrote:
combineByKey is designed for when your return type from the aggregation is
different from the values being aggregated (e.g. you group together objects),
and it should allow you to modify the leftmost argument of each function
(mergeCombiners, mergeValue, etc) and return that instead of
Yes, I mean the RDD would just have elements to define partitions or
ranges within the search space, not have actual hashes. It's really just a
using the RDD as a control structure, rather than a real data set.
As you noted, we don't need to store any hashes. We just need to check them
as they
Title: Samsung Enterprise Portal mySingle
Hi all,
Can I use spark-shell programmatically in my spark application(in java or scala)?
Because I want toconvert scalalines to string array and run automatically in my application.
For example,
for( var line - lines){
//run this line in spark
37 matches
Mail list logo