Right now dimsum is meant to be used for tall and skinny matrices, and so
columnSimilarities() returns similar columns, not rows. We are working on
adding an efficient row similarity as well, tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
Reza
On Mon, Apr 6, 2015 at 6:08
Thanks for your replies I solved the problem with this code
val weathersRDD = sc.textFile(csvfilePath).map {
line =
val Array(dayOfdate, minDeg, maxDeg, meanDeg) =
line.replaceAll(\,).trim.split(,)
Tuple2(dayOfdate.substring(0,7), (minDeg.toInt, maxDeg.toInt,
meanDeg.toInt))
Trying to bump up the rank of the question.
Any example on Github can someone point to?
..Manas
On Fri, Apr 3, 2015 at 9:39 AM, manasdebashiskar manasdebashis...@gmail.com
wrote:
Hi experts,
I am trying to write unit tests for my spark application which fails with
I'd make sure you're selecting the correct columns. If not that, then your
input data might be corrupt.
CCing user to keep it on the user list.
On Mon, Apr 6, 2015 at 6:53 AM, Sergio Jiménez Barrio drarse.a...@gmail.com
wrote:
Hi!,
I had tried your solution, and I saw that the first row is
The Dataframe API should be perfectly helpful in this case.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Some code snippet will like:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import
Please attach the full stack trace. -Xiangrui
On Mon, Apr 6, 2015 at 12:06 PM, Jay Katukuri jkatuk...@apple.com wrote:
Hi all,
I got a runtime error while running the ALS.
Exception in thread main java.lang.NoSuchMethodError:
I have created a Custom Receiver to fetch records pertaining to a specific
query from Elastic Search and have implemented Streaming RDD transformations to
process the data generated by the receiver.
The final RDD is a sorted list of name value pairs and I want to read the top
20 results
Cc'ing Chris Fregly, who wrote the Kinesis integration. Maybe he can help.
On Mon, Apr 6, 2015 at 9:23 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com
wrote:
Hi all,
I am wondering, has anyone on this list been able to successfully
implement Spark on top of Kinesis?
Best,
Vadim
ᐧ
On
There are no workers registered with the Spark Standalone master! That is
the crux of the problem. :)
Follow the instructions properly -
https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
Especially make the conf/slaves file has intended workers listed.
TD
On Mon,
In 1.2.1 of I was persisting a set of parquet files as a table for use by
spark-sql cli later on. There was a post here
http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-tt16297.html#a16311
by
Mchael Armbrust that provide a nice little helper method for dealing
In this code in foreach I am getting task not serialized exception
@SuppressWarnings(serial)
public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final
JavaSparkContext jsc) throws IOException{
log.info(Company matcher started);
//final JavaSparkContext jsc = getSparkContext();
In HiveQL, you should be able to express this as:
SELECT ... FROM table GROUP BY m['SomeKey']
On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I have a case class like this:
case class A(
m: Map[Long, Long],
...
)
and constructed a DataFrame from
Thanks. I’ll look into it. But the JSON string I push via receiver goes through
a series of transformations, before it ends up in the final RDD. I need to take
care to ensure that this magic value propagates all the way down to the last
one that I’m iterating on.
Currently, I’m calling “stop
I'll add that I don't think there is a convenient way to do this in the
Column API ATM, but would welcome a JIRA for adding it :)
On Mon, Apr 6, 2015 at 1:45 PM, Michael Armbrust mich...@databricks.com
wrote:
In HiveQL, you should be able to express this as:
SELECT ... FROM table GROUP BY
Hello Sparkers,
I kept getting this error:
java.lang.ClassCastException: scala.Tuple2 cannot be cast to
org.apache.spark.mllib.regression.LabeledPoint
I have tried the following to convert v._1 to double:
Method 1:
(if(v._10) 1d else 0d)
Method 2:
def bool2Double(b:Boolean): Double = {
if
You could have your receiver send a magic value when it is done. I discuss
this Spark Streaming pattern in my presentation Spark Gotchas and
Anti-Patterns. In the PDF version, it's slides
34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language
What's the advantage of killing an application for lack of resources?
I think the rationale behind killing an app based on executor failures is
that, if we see a lot of them in a short span of time, it means there's
probably something going wrong in the app or on the cluster.
On Wed, Apr 1, 2015
Hey Todd,
In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet
is no longer public, so the above no longer works.
This was probably just a typo, but to be clear,
spark.sql.hive.convertMetastoreParquet is still a supported option and
should work. You are correct that
Here is the command that I have used :
spark-submit —class packagename.ALSNew --num-executors 100 --master yarn
ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path
Btw - I could run the old ALS in mllib package.
On Apr 6, 2015, at 12:32 PM, Xiangrui Meng men...@gmail.com wrote:
I hit again same issue This time I tried to return the Object it failed
with task not serialized below is the code
here vendor record is serializable
private static JavaRDDVendorRecord
getVendorDataToProcess(JavaSparkContext sc) throws IOException {
return sc
Yes, I’m using updateStateByKey and it works. But then I need to perform
further computation on this Stateful RDD (see code snippet below). I perform
forEach on the final RDD and get the top 10 records. I just don’t want the
foreach to be performed every time a new batch is received. Only when
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If
that is the case, you need to cast them explicitly:
rdd.map { case (label, features) = LabeledPoint(label, features) }
-Xiangrui
On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact joannenetw...@gmail.com wrote:
Hello Sparkers,
Hi all,
I got a runtime error while running the ALS.
Exception in thread main java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
The error that I am getting is at the following code:
val ratings =
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF
to do the mapping.
val labelToIndex = udf { ... }
featureDF.withColumn(f3_dummy, labelToIndex(col(f3)))
See instructions here
Hi,
Here is the stack trace:
Exception in thread main java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
at ALSNew$.main(ALSNew.scala:35)
at ALSNew.main(ALSNew.scala)
at
We support sparse vectors in MLlib, which recognizes MLlib's sparse
vector and SciPy's csc_matrix with a single column. You can create RDD
of sparse vectors for your data and save/load them to/from parquet
format using dataframes. Sparse matrix supported will be added in 1.4.
-Xiangrui
On Mon,
So ALSNew.scala is your own application, did you add it with
spark-submit or spark-shell? The correct command should like
spark-submit --class your.package.name.ALSNew ALSNew.jar [options]
Please check the documentation:
http://spark.apache.org/docs/latest/submitting-applications.html
-Xiangrui
So you want to sort based on the total count of the all the records
received through receiver? In that case, you have to combine all the counts
using updateStateByKey (
Thanks a lot.That means Spark does not support the nested RDD?
if I pass the javaSparkContext that also wont work. I mean passing
SparkContext not possible since its not serializable
i have a requirement where I will get JavaRDDVendorRecord matchRdd and I
need to return the postential matches for
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Thanks a lot.That means Spark does not support the nested RDD?
if I pass the javaSparkContext that also wont work. I mean passing
SparkContext not possible since its not serializable
That's right. RDD don't nest
Hi all,
Has anyone else noticed very slow time to cache a Parquet file? It
takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
on M2 EC2 instances. Or are my expectations way off...
Cheers,
Christian
--
Christian Perez
Silicon Valley Data Science
Data Analyst
You could certainly build a connector, but it seems like you would want
support for pushing down aggregations to get the benefits of Druid. There
are only experimental interfaces for doing so today, but it sounds like a
pretty cool project.
On Mon, Apr 6, 2015 at 2:23 PM, Paolo Platter
Hi all,
I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.
Here is an example use case. Amazon recently put the entire Landsat8
archive in S3:
http://aws.amazon.com/public-data-sets/landsat/
I have a bunch of GDAL based
Thanks for the info, Michael. Is there a reason to do so, as opposed to
shipping out the bytecode and loading it via the classloader? Is it more
complex? I can imagine caching to be effective for repeated queries, but
when the subsequent queries are different.
On Mon, Apr 6, 2015 at 2:41 PM,
Hi all,
is there anyone using SparkSQL + Parquet that has made a benchmark about
storing parquet files on HDFS or on CFS ( Cassandra File System )?
What storage can improve performance of SparkSQL+ Parquet ?
Thanks
Paolo
Hi,
I'm curious as to how Spark does code generation for SQL queries.
Following through the code, I saw that an expression is parsed and compiled
into a class using Scala reflection toolbox. However, it's unclear to me
whether the actual byte code is generated on the master or on each of the
On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Thanks a lot.That means Spark does not support the nested RDD?
if I pass the javaSparkContext that also wont work. I mean passing
SparkContext
The compilation happens in parallel on all of the machines, so its not
really clear that there is a win to generating it on the driver and
shipping it from a latency perspective. However, really I just took the
easiest path that didn't require more bytecode extracting / shipping
machinery.
On
Hi,
Do you think it is possible to build an integration beetween druid and spark,
using Datasource API ?
Is someone investigating this kind of solution ?
I think that Spark SQL could fill the lack of a complete SQL Layer of Druid. It
could be a great OLAP solution.
WDYT ?
Paolo Platter
It is generated and cached on each of the executors.
On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
I'm curious as to how Spark does code generation for SQL queries.
Following through the code, I saw that an expression is parsed and
compiled into a class using
The log instance won't be serializable, because it will have a file
handle to write to. Try defining another static method outside
matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
might not be serializable either, but you didn't provide it. If it holds a
database
Do you think you are seeing a regression from 1.2? Also, are you caching
nested data or flat rows? The in-memory caching is not really designed for
nested data and so performs pretty slowly here (its just falling back to
kryo and even then there are some locking issues).
If so, would it be
Sure, will do. I may not be able to get to it until next week, but will let you
know if I am able to the crack the code.
Mohammed
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 5:52 PM
To: Mohammed Guller
Cc: pawan kumar; user@spark.apache.org
Subject: Re: Tableau +
Hi all,
I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.
Here is an example use case. Amazon recently put the entire Landsat8
archive in S3:
http://aws.amazon.com/public-data-sets/landsat/
I have a bunch of GDAL based
My application is running Spark in local mode and I have a Spark Streaming
Listener as well as a Custom Receiver. When the receiver is done fetching all
documents, it invokes “stop” on itself.
I see the StreamingListener getting a callback on “onReceiverStopped” where I
stop the streaming
My hunch is that this behavior was introduced by a patch to start shading
Jetty in Spark 1.3: https://issues.apache.org/jira/browse/SPARK-3996.
Note that Spark's *MetricsSystem* class is marked as *private[spark]* and
thus isn't intended to be interacted with directly by users. It's not
super
Hi, I am trying to pull data from ms-sql server. I have tried using the
spark.sql.jdbc
CREATE TEMPORARY TABLE c
USING org.apache.spark.sql.jdbc
OPTIONS (
url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;,
dbtable Customer
);
But it shows java.sql.SQLException: No suitable driver found
Hi im currently using graphx for some analysis and have come into a bit of a
hurdle. If use my test dataset of 20 nodes and about 30 links it runs really
quickly. I have two other data sets i use one of 10million links and one of
20 million. When i create my graphs seems to work okay and i can get
Hello Manish,
you can take a look at the spark-notebook build, it's a bit tricky to get
rid of some clashes but at least you can refer to this build to have ideas.
LSS, I have stripped out akka from play deps.
ref:
https://github.com/andypetrella/spark-notebook/blob/master/build.sbt
Are you expecting to receive 1 to 100 values in your second program?
RDD is just an abstraction, you would need to do like:
num.foreach(x = send(x))
Thanks
Best Regards
On Mon, Apr 6, 2015 at 1:56 AM, raggy raghav0110...@gmail.com wrote:
For a class project, I am trying to utilize 2 spark
Hey Akhil,
Thanks for your response! No, I am not expecting to receive the values
themselves. I am just trying to receive the RDD object on my second Spark
application. However, I get a NPE when I try to use the object within my
second program. Would you know how I can properly send the RDD
I also meet the same problem. I deploy and run spark(version:1.3.0) on local
mode. when i run a simple app that counts lines of a file, the console
prints TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
Thank you so much for your reply.
We would like to provide a tool to the user to convert a binary file to a
file in Avro/Parquet format on his own computer. The tool will parse binary
file in python, and convert the data to Parquet. (BTW can we append to
parquet file). The issue is that we do not
We had few sessions at Sigmoid, you could go through the meetup page for
details:
http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/
On 6 Apr 2015 18:01, Abhideep Chakravarty
abhideep.chakrava...@mindtree.com wrote:
Hi all,
We are here planning to setup a Spark learning
Please send email to user-subscr...@spark.apache.org
On Mon, Apr 6, 2015 at 6:52 AM, 林晨 bewit...@gmail.com wrote:
Hi ,
In Spark Web Application the RDD is generating every time client is sending a
query request. Is there any way where the RDD is compiled once and run query
again and again on active SparkContext?
Thanks,
Siddharth Ubale,
Synchronized Communications
#43, Velankani Tech Park, Block No. II,
bq. I need to know on what all databases
You can access HBase using Spark.
Cheers
On Mon, Apr 6, 2015 at 5:59 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
We had few sessions at Sigmoid, you could go through the meetup page for
details:
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
`
After building Spark with command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package
I try to run Pi example on YARN with the following command:
export
I am trying to call Row.create(object[]) similarly to what's shown in this
programming guide
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
, but the create() method is no longer recognized. I tried to look up the
documentation for the Row
Hi,
I am trying to build this project
https://github.com/databricks/learning-spark with mvn package.This should
work out of the box but unfortunately it doesn't. In fact, I get the
following error:
mvn pachage -X
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.7.0_76, vendor:
(This mailing list concerns Spark itself rather than the book about
Spark. Your question is about building code that isn't part of Spark,
so, the right place to ask is
https://github.com/databricks/learning-spark You have a typo in
pachage but I assume that's just your typo in this email.)
On
Hi all,
We are here planning to setup a Spark learning session series. I need all of
your input to create a TOC for this program i.e. what all to cover if we need
to start from basics and upto what we should go to cover all the aspects of
Spark in details.
Also, I need to know on what all
The example below illustrates how to use the DIMSUM algorithm to calculate
the similarity between each two rows and output row pairs with cosine
simiarity that is not less than a threshold.
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041
`
After building Spark with command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests package
I try to run Pi example on YARN with the following command:
export
From scaladoc of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
:
* To create a new Row, use [[RowFactory.create()]] in Java or
[[Row.apply()]] in Scala.
*
Cheers
On Mon, Apr 6, 2015 at 7:23 AM, ARose ashley.r...@telarix.com wrote:
I am trying to call Row.create(object[])
I searched code base but didn't find RowFactory class.
Pardon me.
On Mon, Apr 6, 2015 at 7:39 AM, Ted Yu yuzhih...@gmail.com wrote:
From scaladoc
of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala :
* To create a new Row, use [[RowFactory.create()]] in Java or
[[Row.apply()]]
Row class was not documented mistakenly in 1.3.0
you can check the 1.3.1 API doc
http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/api/scala/index.html#org.apache.spark.sql.Row
Best,
--
Nan Zhu
http://codingcat.me
On Monday, April 6, 2015 at 10:23 AM, ARose wrote:
I am trying to
Hi, Ted
It’s here:
https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java
Best,
--
Nan Zhu
http://codingcat.me
On Monday, April 6, 2015 at 10:44 AM, Ted Yu wrote:
I searched code base but didn't
Hi
I have a class in above desc.
case class weatherCond(dayOfdate: String, minDeg: Int, maxDeg: Int, meanDeg:
Int)
I am reading the data from csv file and I put this data into weatherCond
class with this code
val weathersRDD = sc.textFile(weather.csv).map {
line =
val
I'm trying to apply Spark to a NLP problem that I'm working around. I have near
4 million tweets text and I have converted them into word vectors. It's pretty
sparse because each message just has dozens of words but the vocabulary has
tens of thousand words.
These vectors should be loaded each
Hi folks, currently have a DF that has a factor variable -- say gender.
I am hoping to use the RandomForest algorithm on this data an it appears
that this needs to be converted to RDD[LabeledPoint] first -- i.e. all
features need to be double-encoded.
I see
Thanks Nan.
I was searching for RowFactory.scala
Cheers
On Mon, Apr 6, 2015 at 7:52 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, Ted
It’s here:
https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java
Hi all,
I am wondering, has anyone on this list been able to successfully implement
Spark on top of Kinesis?
Best,
Vadim
ᐧ
On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com
wrote:
ᐧ
Hi all,
Below is the output that I am getting. My Kinesis stream has 1 shard, and
Somewhat agree on subclassing and its issues. It looks like the alternative
in spark 1.3.0 to create a custom build. Is there an enhancement filed for
this? If not, I'll file one.
Thanks!
-neelesh
On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das t...@databricks.com wrote:
The challenge of
If you're going to do it this way, I would ouput dayOfdate.substring(0,7),
i.e. the month part, and instead of weatherCond, you can use
(month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD:
RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in
multiple Spark
Interesting, I see 0 cores in the UI?
- *Cores:* 0 Total, 0 Used
On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote:
What does the Spark Standalone UI at port 8080 say about number of cores?
On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com
77 matches
Mail list logo