Hi all,Can somebody point me to the implementation of predict() in
LogisticRegressionModel of spark mllib? I could find a predictPoint() in the
class LogisticRegressionModel, but where is predict()?
Thanks & Regards, Meethu M
try:
mvn test -pl sql -DwildcardSuites=org.apache.spark.sql -Dtest=none
On 12 Nov 2015, at 03:13, weoccc >
wrote:
Hi,
I am wondering how to run unit test for specific spark component only.
mvn test -DwildcardSuites="org.apache.spark.sql.*"
We have done this by blocking but without using BlockMatrix. We used our
own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
is the size of your block? How much memory are you giving to the executors?
I assume you are running on YARN, if so you would want to make sure your
looks suspiciously like some thrift transport unmarshalling problem, THRIFT-2660
Spark 1.5 uses hive 1.2.1; it should have the relevant thrift JAR too.
Otherwise, you could play with thrift JAR versions yourself —maybe it will
work, maybe not...
On 13 Nov 2015, at 00:29, Yana Kadiyska
Hi
I'm looking for some benchmarks on joining data frames where most of the
data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is
still in RDBMS. I am only looking at the very first join before any caching
happens, and I assume there will be loss of parallelization because
Hi,
I know a task can fail 2 times and only the 3rd breaks the entire job.
I am good with this number of attempts.
I would like that after trying a task 3 times, it continues with the other
tasks.
The job can be "failed", but I want all tasks run.
Please see my use case.
I read a hadoop
Hello,
We have been using Spark at Elsevier Labs for a while now. Would love to be
added to the “Powered By Spark” page.
Organization Name: Elsevier Labs
URL: http://labs.elsevier.com
Spark components: Spark Core, Spark SQL, MLLib, GraphX.
Use Case: Building Machine Reading Pipeline, Knowledge
Please have a look at http://spark.apache.org/docs/1.4.0/tuning.html
You may also want to use the latest build of JDK 7/8 and use G1GC instead.
I saw considerable reductions in GC time just by doing that.
Rest of the tuning parameters are better explained in the link above.
Best Regards,
Gaurav
Unless you change maxRatePerPartition, a batch is going to contain all of
the offsets from the last known processed to the highest available.
Offsets are not time-based, and Kafka's time-based api currently has very
poor granularity (it's based on filesystem timestamp of the log segment).
There's
Hi Sab,
Thanks for your response. We’re thinking of trying a bigger cluster, because we
just started with 2 nodes. What we really want to know is whether the code will
scale up with larger matrices and more nodes. I’d be interested to hear how
large a matrix multiplication you managed to do?
Just an update that the kinesis checkpointing works well with orderly and
kill -9 driver shutdowns when there is less than 4 shards. We use 20+.
I created a case with Amazon support since it is the AWS kinesis getRecords
API which is hanging.
Regards,
Heji
On Thu, Nov 12, 2015 at 10:37 AM,
So we tried reading a sequencefile in Spark and realized that all our records
have ended up becoming the same.
THen one of us found this:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for
each record, directly caching the returned RDD or directly passing it to an
Have you use any partitioned columns when write as json or parquet?
On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar wrote:
> yes I was expecting that too because of all the metadata generation and
> compression. But I have not seen performance this bad for other parquet
> files
Hi,
I am using Spark 1.5.2 and I notice the existence of the class
org.apache.spark.sql.columnar.ColumnStatisticsSchema, How can I use it to
calculate column statistics of a DataFrame?
Thanks,
--
View this message in context:
Thes is very well explained.
Thank you
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-difference-btw-reduce-fold-tp22653p25376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
We're running Spark 1.5.0 on EMR 4.1.0 in AWS and consuming from Kinesis.
We saw the following exception today - it killed the Spark "step":
org.apache.spark.SparkException: Could not read until the end sequence
number of the range
We guessed it was because our Kinesis stream didn't
Hey Friends,
I am trying to use sqlContext.write.parquet() to write dataframe to parquet
files. I have the following questions.
1. number of partitions
The default number of partition seems to be 200. Is there any way other
than using df.repartition(n) to change this number? I was told
In R, its easy to split a data set into training, crossValidation, and test
set. Is there something like this in spark.ml? I am using python of now.
My real problem is I want to randomly select a relatively small data set to
do some initial data exploration. Its not clear to me how using spark I
Hi Everyone
Is there any difference in performance btw the following two joins?
val r1: RDD[(String, String]) = ???
val r2: RDD[(String, String]) = ???
val partNum = 80
val partitioner = new HashPartitioner(partNum)
// Join 1
val res1 =
Python does not support library as tar balls, so PySpark may also not
support that.
On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi wrote:
> Hi,
>
> Pyspark/spark-submit offers a --py-files handle to distribute python code
> for execution. Currently(version 1.5) only zip
You forgot to create a SparkContext instance:
sc = SparkContext()
On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson
wrote:
> I am having a heck of a time getting Ipython notebooks to work on my 1.5.1
> AWS cluster I created using
I searched the code base and looked at:
https://spark.apache.org/docs/latest/running-on-yarn.html
I didn't find mapred.max.map.failures.percent or its counterpart.
FYI
On Fri, Nov 13, 2015 at 9:05 AM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:
> Hi,
>
>
> I know a task can fail 2
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu" wrote:
> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote:
> > I'm writing a ~100 Gb
Hi,
I have an RDD which crashes the driver when being collected. I want to
send the data on its partitions out to S3 without bringing it back to the
driver. I try calling rdd.foreachPartition, but the data that gets sent has
not gone through the chain of transformations that I need. It's the
Do you have partitioned columns?
On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote:
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way
I'm using Spark to read in a data from many files and write it back out in
Parquet format for ease of use later on. Currently, I'm using this code:
val fnamesRDD = sc.parallelize(fnames,
ceil(fnames.length.toFloat/numfilesperpartition).toInt)
val results =
Hi,
I am trying simple file streaming example using
Sparkstreaming(spark-streaming_2.10,version:1.5.1)
public class DStreamExample {
public static void main(final String[] args) {
final SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("SparkJob");
Hi Gaurav,
Your graph can be saved to graph databases like Neo4j or Titan through
their drivers, that eventually saved to the disk.
BR,
Todd
Gaurav Kumar
gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道:
> Hi,
>
> I was wondering how to save a graph to disk and load it back again. I know
> how
The RDD has a takeSample method where you can supply the flag for
replacement or not as well as the fraction to sample.
On Nov 14, 2015 2:51 AM, "Andy Davidson"
wrote:
> In R, its easy to split a data set into training, crossValidation, and
> test set. Is there
Hi,
The BlockMatrix multiplication should be much more efficient on the current
master (and will be available with Spark 1.6). Could you please give that a
try if you have the chance?
Thanks,
Burak
On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
I guess this is not related to SparkR. It seems that Spark can’t pick
hostname/IP address of RM.
Make sure you have correctly set YARN_CONF_DIR env var and have configured
address of RM in yarn-site.xml.
From: Amit Behera [mailto:amit.bd...@gmail.com]
Sent: Friday, November 13, 2015 9:38 PM
Never mind; when I switched to Spark 1.5.0, my code works as written and is
pretty fast! Looking at some Parquet related Spark jiras, it seems that
Parquet is known to have some memory issues with buffering and writing, and
that at least some were resolved in Spark 1.5.0.
--
View this
Tip: jump straight to 1.5.2; it has some key bug fixes.
Sent from my phone
> On Nov 13, 2015, at 10:02 PM, AlexG wrote:
>
> Never mind; when I switched to Spark 1.5.0, my code works as written and is
> pretty fast! Looking at some Parquet related Spark jiras, it seems that
Hi,
I am using spark streaming check-pointing mechanism and reading the data
from kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.
So, my batches run at following intervals...
09:45
10:00
10:15
10:30 and so on
Suppose, my running batch dies at 09:55
Hi,
I am facing issue while integrating spark with spring.
I am getting "java.lang.IllegalStateException: Cannot deserialize
BeanFactory with id" errors for all beans. I have tried few solutions
available in web. Please help me out to solve this issue.
Few details:
Java : 8
Spark : 1.5.1
Hi,
The off-heap memory usage of the 3 Spark executor processes keeps increasing
constantly until the boundaries of the physical RAM are hit. This happened two
weeks ago, at which point the system comes to a grinding halt, because it's
unable to spawn new processes. At such a moment restarting
Hi All,
Just started understanding / getting hands on with Spark,
Streaming and MLLIb. We are in the design phase and need suggestions on the
training data storage requirement.
Batch Layer: Our core systems generate data which we will be using as batch
data, currently SQL
am using spark 1.4 and my application is taking much time in GC around
60-70% of time for each task
I am using parallel GC.
please help somebody as soon as possible.
Thanks,
Renu
The reserved cores are to prevent starvation so that user B cam run jobs
when user A's job is already running and using almost all of the cluster.
You can change your scheduler configuration to use more cores.
Regards
Sab
On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote:
>
Hi,
I was wondering how to save a graph to disk and load it back again. I know
how to save vertices and edges to disk and construct the graph from them,
not sure if there's any method to save the graph itself to disk.
Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91
EMR 4.1.0 + Spark 1.5.0 + YARN Resource Allocation
http://stackoverflow.com/q/33488869/1366507?sem=2
41 matches
Mail list logo