, May 08, 2015 3:15 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Hash Partitioning and Dataframes
What are you trying to accomplish? Internally Spark SQL will add Exchange
operators to make sure that data is partitioned correctly for joins and
aggregations. If you
Any convenient tool to do this [sparse vector product] in Spark?
Unfortunately, it seems that there are very few operations defined for sparse
vectors. I needed to add some, and ended up converting them to (dense) numpy
vectors and doing the addition on those.
Best regards,
Ron
From: Xi
Hi,
Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
SparseVectors in Python?
Thanks,
Ron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Short story: I want to write some parquet files so they are pre-partitioned by
the same key. Then, when I read them
back in, joining the two tables on that key should be about as fast as things
can be done.
Can I do that, and if so, how? I don't see how to control the partitioning of a
SQL
Thanks for the info Andy. A big help.
One thing - I think you can figure out which document is responsible for which
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that
into one RDD for each column.
The values in the doc_string RDD
Hi all,
I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf RDD[Vector]s, using the code
below.
But how do I get this back to words and their counts and tf-idf values for
presentation?
val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)
Indeed it did. Thanks!
Ron
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, November 14, 2014 9:53 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: filtering a SchemaRDD
If I use row[6] instead of row[text] I get what I am looking for. However
Hi all,
I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one
of which holds the text for a sentence from a document.
# Load sentence data table
sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
sentenceRDD.take(3)
Out[20]: [Row(annotID=118,
Hi all,
Assume I have read the lines of a text file into an RDD:
textFile = sc.textFile(SomeArticle.txt)
Also assume that the sentence breaks in SomeArticle.txt were done by machine
and have some errors, such as the break at Fig. in the sample text below.
Index Text
N...as shown
-in-a-sorted-RDD-td12621.html#a12664
On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Hi all,
Assume I have read the lines of a text file into an RDD:
textFile = sc.textFile(SomeArticle.txt)
Also assume that the sentence breaks
Thanks Xiangrui, that looks very helpful.
Best regards,
Ron
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Wednesday, September 03, 2014 1:19 PM
To: Daniel, Ronald (ELS-SDG)
Cc: Victor Tso-Guillen; user@spark.apache.org
Subject: Re: Accessing neighboring
Just to point out that the benchmark you point to has Redshift running on HDD
machines instead of SSD, and it is still faster than Shark in all but one case.
Like Gary, I'm also interested in replacing something we have on Redshift with
Spark SQL, as it will give me much greater capability to
From: Gary Malouf [mailto:malouf.g...@gmail.com]
Sent: Wednesday, August 06, 2014 1:17 PM
To: Nicholas Chammas
Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org
Subject: Re: Regarding tooling/performance vs RedShift
Also, regarding something like redshift not having MLlib built in, much
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are
Strings holding the contents of those (UTF-8) files, but NOT split into lines.
Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB?
Similarly, can such a thing be registered as a table so that I can
14 matches
Mail list logo