Hi,
Just wondering if you've seen the following error when reading from Kafka:
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
at kafka.utils.Log4jController$.init(Log4jController.scala:29)
at
Hi,
I'm trying to get hold of use spark context from hive context or
streamingcontext. I have 2 pieces of codes one in core spark one in spark
streaming. plain spark with hive which gives me context. Spark streaming
code with hive which prints null. plz help me figure out how to make this
code
Hi, Ryan
We have met similar errors and increasing executor memory solved it. Though
I am not sure about the detailed reason, it might be worth a try.
On Wed, Oct 29, 2014 at 1:34 PM, Ryan Williams [via Apache Spark User List]
ml-node+s1001560n17605...@n3.nabble.com wrote:
My job is failing
-- Forwarded message --
From: Chengi Liu chengi.liu...@gmail.com
Date: Tue, Oct 28, 2014 at 11:23 PM
Subject: Re: sampling in spark
To: Davies Liu dav...@databricks.com
Any suggestions.. Thanks
On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
Is
And the scala way of doing it would be:
val sc = new SparkContext(conf)
sc.addJar(/full/path/to/my/application/jar/myapp.jar)
On Wed, Oct 29, 2014 at 1:44 AM, Shailesh Birari sbir...@wynyardgroup.com
wrote:
Yes, this is doable.
I am submitting the Spark job using
JavaSparkContext
Yes we shade akka to change its protobuf version (If I am not wrong.). Yes,
binary compatibility with other akka modules is compromised. One thing you
can try is use akka from org.spark-project.akka, I have not tried this and
not sure if its going to help you but may be you could exclude the akka
1. Download
https://dl.bintray.com/sbt/native-packages/sbt/0.13.6/sbt-0.13.6.zip
2. Extract
3. export PATH=$PATH:/path/to/sbt/bin/sbt
Now you can do all the sbt commands (sbt package etc.)
Thanks
Best Regards
On Tue, Oct 28, 2014 at 9:49 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
sbt
You need to set the following jar (cassandra-connector
http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.1.0-alpha3/spark-cassandra-connector_2.10-1.1.0-alpha3.jar)
in the classpath like:
What are you trying to do? Connecting to a remote cluster from your local
windows eclipse environment? Just make sure you meet the following:
1. Set spark.driver.host to your local ip (Where you runs your code, and it
should be accessible from the cluster)
2. Make sure no firewall/router
Send it to user-unsubscr...@spark.apache.org. Read the community page
https://spark.apache.org/community.html for more info
Thanks
Best Regards
On Wed, Oct 29, 2014 at 3:32 AM, Ricky Thomas ri...@truedash.io wrote:
I don't think accumulators come into play here. Use foreachPartition,
not mapPartitions.
On Wed, Oct 29, 2014 at 12:43 AM, Flavio Pompermaier
pomperma...@okkam.it wrote:
Sorry but I wasn't able to code my stuff using accumulators as you suggested
:(
In my use case I have to to add elements to
Looks like the kafka jar that you are using isn't compatible with your
scala version.
Thanks
Best Regards
On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen har...@nexgate.com wrote:
Hi,
Just wondering if you've seen the following error when reading from Kafka:
ERROR ReceiverTracker:
Call RDD.toLocalIterator()?
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html
On Wed, Oct 29, 2014 at 4:15 AM, Dai, Kevin yun...@ebay.com wrote:
Hi, ALL
I have a RDD[T], can I use it like a iterator.
That means I can compute every element of this RDD lazily.
Thanks! How do I find out which Kafka jar to use for scala 2.10.4?
—
Sent from Mailbox
On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Looks like the kafka jar that you are using isn't compatible with your
scala version.
Thanks
Best Regards
On Wed, Oct 29,
I using kafka_2.10-1.1.0.jar on spark 1.1.0
—
Sent from Mailbox
On Wed, Oct 29, 2014 at 12:31 AM, null har...@nexgate.com wrote:
Thanks! How do I find out which Kafka jar to use for scala 2.10.4?
—
Sent from Mailbox
On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das ak...@sigmoidanalytics.com
Hi
I ve written a spark streaming code which streams data from flume to kafka
which is received by spark.
I ve used update state by key and then for each rdd im saving them into
cassandra and elsatic search(by calling 2 different methods).
The above parts are working fine when streaming job is
Thank you Holden, it works!
infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])
Csaba
2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca:
Hi Csaba,
It sounds like the API you are looking for is sc.wholeTextFiles :)
RDD.toLocalIterator() is the suitable solution.
But I doubt whether it conform with the design principle of spark and RDD.
All RDD transform is lazily computed until it end with some actions.
2014-10-29 15:28 GMT+08:00 Sean Owen so...@cloudera.com:
Call RDD.toLocalIterator()?
Hi,
I have cloned sparked as:
git clone g...@github.com:apache/spark.git
cd spark
sbt/sbt compile
Apparently http://repo.maven.apache.org/maven2 is no longer valid.
See the error further below.
Is this correct?
And what should it be changed to?
Everything seems to go smooth until :
Are you trying to compile the master branch ? Can you try branch-1.1 ?
On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com
wrote:
Hi,
I have cloned sparked as:
git clone g...@github.com:apache/spark.git
cd spark
sbt/sbt compile
Apparently
Hi,
My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise.
Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available?
Regards
Arthur
-
To unsubscribe, e-mail:
Hi Ryan - I've been fighting the exact same issue for well over a month now. I
initially saw the issue in 1.02 but it persists in 1.1.
Jerry - I believe you are correct that this happens during a pause on
long-running jobs on a large data set. Are there any parameters that you
suggest tuning
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0,
and related PRs are already merged or being merged to the master branch.
On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote:
Hi,
My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise.
Or, any news
Hi,
Thanks for your update. Any idea when will Spark 1.2 be GA?
Regards
Arthur
On 29 Oct, 2014, at 8:22 pm, Cheng Lian lian.cs@gmail.com wrote:
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and
related PRs are already merged or being merged to the master
Yes the data is getting processed. I printed the data and the rdd count. The
point where data is getting saved is not invoked. I am using the same class
and jar for submitting by both methods. Only difference is I am launching by
tomcat and there directly by putty.
--
View this message in
Hi All
I am using spark streaming with kafka streaming for 24/7
My Code is something like
JavaDStreamString data = messages.map(new MapData());
JavaPairDStreamString, Iterablelt;String records = data.mapToPair(new
dataPair()).groupByKey(100);
records.print();
JavaPairDStreamString, Double
SparkApplication UI shows that one of the executor Cannot find Addresss
Aggregated Metrics by Executor
Executor ID Address Task Time Total Tasks Failed Tasks
Succeeded Tasks Input
Shuffle ReadShuffle
We used both spray and Akka. To avoid comparability issue, we used spark
shaded akka version. It works for us. This is 1.1.0 branch, I have not tried
with master branch
Chester
Sent from my iPad
On Oct 28, 2014, at 11:48 PM, Prashant Sharma scrapco...@gmail.com wrote:
Yes we shade akka to
I see this when I start a worker and then try to start it again forgetting
it's already running (I don't use start-slaves, I start the slaves
individually with start-slave.sh). All this is telling you is that there is
already a running process on that machine. You can see it if you do a ps
I am relatively new to spark processing. I am using Spark Java API to process
data. I am having trouble processing a data set that I don't think is
significantly large. It is joining a dataset that is around 3-4gb each
(around 12 gb data).
The workflow is:
x=RDD1.KeyBy(x).partitionBy(new
CANNOT FIND ADDRESS occurs when your executor has crashed. I'll look
further down where it shows each task and see if you see any tasks failed.
Then you can examine the error log of that executor and see why it died.
On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi ami.khande...@gmail.com wrote:
since spark holds data structures on heap (and by default tries to work
with all data in memory) and its written in Scala seeing lots of scala
Tuple2 is not unexpected. how do these numbers relate to your data size?
On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi,
I wanted
Sometime after Nov. 15:
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
On Wed, Oct 29, 2014 at 5:28 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
Thanks for your update. Any idea when will Spark 1.2 be GA?
Regards
Arthur
On 29 Oct, 2014, at 8:22 pm,
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what
is causing it? or how to debug?
I see following error message...
4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9
akka.pattern.AskTimeoutException: Timed out
at
I developed the spark-xml-utils library because we have a large amount of XML
in big datasets and I felt this data could be better served by providing some
helpful xml utilities. This includes the ability to filter documents based on
an xpath/xquery expression, return specific nodes for an
Without the second line, it's will much faster:
infile = sc.wholeTextFiles(sys.argv[1])
infile.saveAsSequenceFile(sys.argv[2])
On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany rag...@gmail.com wrote:
Thank you Holden, it works!
infile = sc.wholeTextFiles(sys.argv[1])
rdd =
Hey all,
Just thought I'd share this with the list in case any one else would
benefit. Currently working on a proper integration of PySpark and
DataStax's new Cassandra-Spark connector, but that's on going.
In the meanwhile, I've basically updated the cassandra_inputformat.py and
Nice!
- Helena
@helenaedelson
On Oct 29, 2014, at 12:01 PM, Mike Sukmanowsky mike.sukmanow...@gmail.com
wrote:
Hey all,
Just thought I'd share this with the list in case any one else would benefit.
Currently working on a proper integration of PySpark and DataStax's new
Hi all,
I followed the guide here:
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
But got this error:
Exception in thread main java.lang.NoClassDefFoundError:
com/amazonaws/auth/AWSCredentialsProvider
Would you happen to know what dependency or jar is needed ?
Harold
add these to your dependencies:
io.netty % netty % 3.6.6.Final
exclude(io.netty, netty-all) to the end of spark and hadoop dependencies
reference: https://spark-project.atlassian.net/browse/SPARK-1138
I am using Spark 1.1 so the akka issue is already fixed
--
View this message in context:
Hi TD,
Thanks a lot for the comprehensive answer.
I think this explanation deserves some place in the Spark Streaming tuning
guide.
-kr, Gerard.
On Thu, Oct 23, 2014 at 11:41 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Hey Gerard,
This is a very good question!
*TL;DR: *The
Hi again,
After getting through several dependencies, I finally got to this
non-dependency type error:
Exception in thread main java.lang.NoSuchMethodError:
I have a SchemaRDD with 100 records in 1 partition. We'll call this baseline.
I have a SchemaRDD with 11 records in 1 partition. We'll call this daily.
After a fairly basic join of these two tables
JavaSchemaRDD results = sqlContext.sql(SELECT id, action, daily.epoch,
daily.version FROM
Thanks, guys. Michael Armbrust also suggested the same two approaches.
I believe “getAs[Date]” is available only in 1.2 branch and I have Spark 1.1,
so I am using row(i).asInstanceOf[Date], which works.
Mohammed
From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, October 28, 2014
Good idea, will do for 1.2 release.
On Oct 29, 2014 9:50 AM, Gerard Maas gerard.m...@gmail.com wrote:
Hi TD,
Thanks a lot for the comprehensive answer.
I think this explanation deserves some place in the Spark Streaming tuning
guide.
-kr, Gerard.
On Thu, Oct 23, 2014 at 11:41 PM,
Hi tathagata. I actually had a few minor improvements to spark streaming
in SPARK-4040. possibly i could weave this in w/ my pr ?
On Wed, Oct 29, 2014 at 1:59 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Good idea, will do for 1.2 release.
On Oct 29, 2014 9:50 AM, Gerard Maas
Apparently Spark does require Hadoop even if you do not intend to use Hadoop.
Is there a workaround for the below error I get when creating the SparkContext
in Scala?
I will note that I didn't have this problem yesterday when creating the Spark
context in Java as part of the getting started
QQ - did you download the Spark 1.1 binaries that included the Hadoop one?
Does this happen if you're using the Spark 1.1 binaries that do not include
the Hadoop jars?
On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub ronalday...@live.com wrote:
Apparently Spark does require Hadoop even if you do not
Can you try setting the following while creating the sparkContext and see
if the issue still exists?
.set(spark.core.connection.ack.wait.timeout,900)
.set(spark.akka.frameSize,50)
.set(spark.akka.timeout,900)
Looks like your executor is stuck on GC Pause.
Thanks
Best Regards
On
ok. after reading some documentation, it would appear the issue is the default
number of partitions for a join (200).
After doing something like the following, I was able to change the value.
From: Darin McBeath ddmcbe...@yahoo.com.INVALID
To: User user@spark.apache.org
Sent: Wednesday,
Sorry, hit the send key a bitt too early.
Anyway, this is the code I set.
sqlContext.sql(set spark.sql.shuffle.partitions=10);
From: Darin McBeath ddmcbe...@yahoo.com
To: Darin McBeath ddmcbe...@yahoo.com; User user@spark.apache.org
Sent: Wednesday, October 29, 2014 2:47 PM
Subject: Re:
Well. I got past this problem and the manner was in my own email. I did
download the one with Hadoop since it was among the only ones you don't have to
compile from source along with CDH and Map. It worked yesterday because I added
1.1.0 as a maven dependency from the repository. I just did the
Well. I got past this problem and the manner was in my own email. I did
download the one with Hadoop since it was among the only ones you don't have to
compile from source along with CDH and Map. It worked yesterday because I added
1.1.0 as a maven dependency from the repository. I just did the
Assume in my executor I say
SparkConf sparkConf = new SparkConf();
sparkConf.set(spark.kryo.registrator,
com.lordjoe.distributed.hydra.HydraKryoSerializer);
sparkConf.set(mysparc.data, Some user Data);
sparkConf.setAppName(Some App);
Now
1) Are there default
I haven't tried this myself yet, but this sounds relevant:
https://github.com/apache/spark/pull/2535
Will be giving this a try today or so, will report back.
On Wednesday, October 29, 2014, Harold Nguyen har...@nexgate.com wrote:
Hi again,
After getting through several dependencies, I
We are working on more helpful error messages, but in the meantime let me
explain how to read this output.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'p.name,'p.age, tree:
Project ['p.name,'p.age]
Filter ('location.number = 2300)
Join Inner,
I am not sure about that.
Can you try a Spray version built with 2.2.x along with Spark 1.1 and include
the Akka dependencies in your project’s sbt file?
Mohammed
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Tuesday, October 28, 2014 8:58 PM
To: Mohammed Guller
Cc: user
Subject:
I have been searching and have not found a solution as to how one might query
on dates stored as UTC milliseconds from the epoch. The schema I have
pulled in from a NoSQL datasource (JSON from MongoDB) has the target date
as:
|-- dateCreated: struct (nullable = true)
||-- $date: long
cf. https://issues.apache.org/jira/browse/SPARK-2356
On Wed, Oct 29, 2014 at 7:31 PM, Ron Ayoub ronalday...@live.com wrote:
Apparently Spark does require Hadoop even if you do not intend to use
Hadoop. Is there a workaround for the below error I get when creating the
SparkContext in Scala?
I
Hi all,
How do I convert a DStream to a string ?
For instance, I want to be able to:
val myword = words.filter(word = word.startsWith(blah))
And use myword in other places, like tacking it onto (key, value) pairs,
like so:
val pairs = words.map(word = (myword+_+word, 1))
Thanks for any help,
The documentation at
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
describes the union() method as
Return a new DStream by unifying data of another DStream with this
DStream.
Can somebody provide a clear definition of what unifying means in
Greetings!
This might be a documentation issue as opposed to a coding issue, in that
perhaps the correct answer is don't do that, but as this is not obvious, I am
writing.
The following code produces output most would not expect:
package misc
import org.apache.spark.SparkConfimport
The union function simply returns a DStream with the elements from both.
This is the same behavior as when we call union on RDDs :) (You can think
of union as similar to the union operator on sets except without the unique
element restrictions).
On Wed, Oct 29, 2014 at 3:15 PM, spr
I am processing a log file, from each line of which I want to extract the
zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had
hoped to be able to index the Array for elements 0 and 4, but Arrays appear
not to support vector indexing. I'm not finding a way to extract and
On Wed, Oct 29, 2014 at 3:29 PM, spr s...@yarcdata.com wrote:
I am processing a log file, from each line of which I want to extract the
zeroth and 4th elements (and an integer 1 for counting) into a tuple. I
had
hoped to be able to index the Array for elements 0 and 4, but Arrays appear
not
Hi all,
We’re organizing a meet up on Nov 6th in our office downtown SF that might
be of interest to the Spark community. We will be discussing our experience
building our first production Spark based application.
More details and sign up info here:
What would it mean to make a DStream into a String? it's inherently a
sequence of things over time, each of which might be a string but
which are usually RDDs of things.
On Wed, Oct 29, 2014 at 11:15 PM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
How do I convert a DStream to a string ?
I need more precision to understand. If the elements of one DStream/RDD are
(String) and the elements of the other are (Time, Int), what does union
mean? I'm hoping for (String, Time, Int) but that appears optimistic. :)
Do the elements have to be of homogeneous type?
Holden Karau wrote
Hi Sean,
I'd just like to take the first word of every line, and use it as a
variable for later. Is there a way to do that?
Here's the gist of what I want to do:
val lines = KafkaUtils.createStream(ssc, localhost:2181, test,
Map(test - 10)).map(_._2)
val words = lines.flatMap(_.split( ))
On Wed, Oct 29, 2014 at 3:39 PM, spr s...@yarcdata.com wrote:
I need more precision to understand. If the elements of one DStream/RDD
are
(String) and the elements of the other are (Time, Int), what does union
mean? I'm hoping for (String, Time, Int) but that appears optimistic. :)
It
Good catch! If you'd like, you can send a pull request changing the files in
docs/ to do this (see
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark),
otherwise maybe open an issue on
Sure, that code looks like it does sort of what you describe but it's
mixed up in a few ways. It looks like you only want to operate on
words that start with SECRETWORD, but then you are prepending acct and
_ in the code but expecting something appending in the result. You
also seem like you want
I tried using shapeless HLists as data storage for data inside spark.
Unsurprisingly, it failed. The deserialization isn't well-defined because of
all the implicits used by shapeless. How could I make it work?
Sample Code:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import
I want several RDDs (which are the result of my program's operations on
existing RDDs) to match the partitioning of an existing RDD, since they will
be joined together in the end. Do I understand correctly that I would
benefit from using a custom partitioner that would be applied to all RDDs?
looks like a misssing class issue? what makes you think its serialization?
shapeless does indeed have a lot of helper classes that get sucked in and
are not serializable. see here:
https://groups.google.com/forum/#!topic/shapeless-dev/05_DXnoVnI4
and for a project that uses shapeless in spark
I started my ec2 spark cluster with
./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10
launch mycluster
I see the additional volumes attached but they do not seem to be set up for
hdfs.
How can I check if they are being utilized on all workers,
and how can I get all workers
Hi,
I am using Spark 1.1.0. I have the following SQL statement where I am trying
to count the number of UIDs that are in the tusers table but not in the
device table.
val users_with_no_device = sql_cxt.sql(SELECT COUNT (u_uid) FROM tusers
WHERE tusers.u_uid NOT IN (SELECT d_uid FROM device))
I
jira created with comments/references to this discussion:
https://issues.apache.org/jira/browse/SPARK-4144
On Tue, Aug 19, 2014 at 4:47 PM, Xiangrui Meng men...@gmail.com wrote:
No. Please create one but it won't be able to catch the v1.1 train.
-Xiangrui
On Tue, Aug 19, 2014 at 4:22 PM,
Thanks by setting driver host to Windows and specifying some ports (like
driver, fileserver, broadcast etc..) it worked perfectly. I need to specify
those ports as not all ports are open on my machine.
For, driver host name, I was assuming Spark should get it, as in case of
linux we are not
Hi,I'm new to spark, and am facing a peculiar problem.
I'm writing a simple Java Driver program where i'm creating Key / Value data
structure and collecting them, once created. The problem i'm facing is that,
when i increase the iterations of a for loop which creates the ArrayList of
Long Values
Hey all – not writing to necessarily get a fix but more to get an understanding
of what’s going on internally here.
I wish to take a cross-product of two very large RDDs (using cartesian), the
product of which is well in excess of what can be stored on disk . Clearly that
is intractable, thus
Hi,
I am trying to get my Spark application to run on YARN and by now I have
managed to build a fat jar as described on
http://markmail.org/message/c6no2nyaqjdujnkq (which is the only really
usable manual on how to get such a jar file). My code runs fine using sbt
test and sbt run, but when
Hi again,
On Thu, Oct 30, 2014 at 11:50 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Exception in thread main java.lang.NoClassDefFoundError:
com/typesafe/scalalogging/slf4j/Logger
It turned out scalalogging
Hello Steve .
1) When you call new SparkConf you should get an object with the default
config values. You can reference the spark configuration and tuning pages
for details on what those are.
2) Yes. Properties set in this configuration will be pushed down to worker
nodes actually executing the
hi Yana,
in my case I did not start any spark worker. However, shark was definitely
running. Do you think that might be a problem?
I will take a look
Thank you,
From: Yana Kadiyska [yana.kadiy...@gmail.com]
Sent: Wednesday, October 29, 2014 9:45 AM
To:
You may use -
select count(u_uid) from tusers a left outer join device b on (a.u_uid =
b.d_uid) where b.d_uid is null
On Wed, Oct 29, 2014 at 5:45 PM, SK skrishna...@gmail.com wrote:
Hi,
I am using Spark 1.1.0. I have the following SQL statement where I am
trying
to count the number of
I'm running into this error when I attempt to launch spark-shell passing in
the algebird-core jar:
~~
$ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar
scala import com.twitter.algebird._
import com.twitter.algebird._
scala import HyperLogLog._
import HyperLogLog._
scala
Hi All,I have my sparse data in libsvm format.
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
mllib/data/sample_libsvm_data.txt)
I am running Linear regression. Let us say that my data has following entry:1
1:0 4:1
I think it will assume 0 for indices 2 and 3, right? I would
Thanks Koert. These numbers indeed tie back to our data and algorithms.
Would going the scala route save some memory, as the java API creates
wrapper Tuple2 for all pair functions?
On Wednesday, October 29, 2014, Koert Kuipers ko...@tresata.com wrote:
since spark holds data structures on heap
Thanks Akhil.
So the worker spark node doesn't need access to metastore to run Hive
queries? If yes, which component accesses the metastore?
For Hive, the Hive-cli accesses the metastore before submitting M/R jobs.
Thanks,
Ken
--
View this message in context:
90 matches
Mail list logo