Hi Disha,
This is a good question. We plan to elaborate on it in our talk on the upcoming
Spark Summit. Less workers means less compute power, more workers means more
communication overhead. So, there exist an optimal number of workers for
solving optimization problem with batch gradient given
Hi Joseph,
There seems to be no improvement if I run it with more partitions or bigger
depth:
N = 6 Avg time: 13.49157910868
N = 7 Avg time: 8.929480508
N = 8 Avg time: 14.50712347198
N= 9 Avg time: 13.85487164533
Depth = 3
N=2 Avg time: 8.85389534633
N=5 Avg time: 15.99157492466
Thanks for everyone's patience with this email thread. I have fixed my
environmental problem and my tests run cleanly now. This seems to be a
problem which afflicts modern JVMs on Mac OSX (and maybe other unix
variants). The following can happen on these platforms:
InetAddress.getLocalHost().is
bq. Access is denied
Please check permission of the path mentioned.
On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo <
melongo_anna...@yahoo.com.invalid> wrote:
> I was trying to build a cloned version of Spark on my local machine using
> the command:
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.v
I was trying to build a cloned version of Spark on my local machine using the
command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
clean packageHowever I got the error: [ERROR] Failed to execute goal
org.apache.maven.plugins:maven-shade-plugin:2.4.1:shade (default) on
>
> In hive, the ambiguous name can be resolved by using the table name as
> prefix, but seems DataFrame don't support it ( I mean DataFrame API rather
> than SparkSQL)
You can do the same using pure DataFrames.
Seq((1,2)).toDF("a", "b").registerTempTable("y")
Seq((1,4)).toDF("a", "b").register
Rick,
Try setting the environment variable SPARK_LOCAL_IP=127.0.0.1 in your
spark-env.conf (if not done yet) ...
Regards,
- Steve
From: Richard Hillegas
Sent: Thursday, October 15, 2015 1:50 PM
To: Richard Hillegas
Cc: Dev
Subject: Re: Network-related environemental problem when running JDB
Continuing this lively conversation with myself (hopefully this archived
thread may be useful to someone else in the future):
I set the following environment variable as recommended by this page:
http://stackoverflow.com/questions/29906686/failed-to-bind-to-spark-master-using-a-remote-cluster-wit
For the record, I get the same error when I simply try to boot the spark
shell:
bash-3.2$ bin/spark-shell
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging
Hi Alexander,
Thanks for your reply.Actually I am working with a modified version of the
actual MNIST dataset ( maximum samples = 8.2 M)
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. I
have been running different sized versions*( 1,10,50,1M,8M
samples)* on di
I am seeing what look like environmental errors when I try to run a test on
a clean local branch which has been sync'd to the head of the development
trunk. I would appreciate advice about how to debug or hack around this
problem. For the record, the test ran cleanly last week. This is the
experi
My apologies for mixing up what was being referred to in that case! :)
Mark.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14629.html
Sent from the Apache Spark Developers List mailing list a
To clarify, we're asking about the *spark.sql.tungsten.enabled* flag, which
was introduced in Spark 1.5 and enables Project Tungsten optimizations in
Spark SQL. This option is set to *true* by default in Spark 1.5+ and exists
primarily to allow users to disable the new code paths if they encounter
Are you referring to spark.shuffle.manager=tungsten-sort? If so, we saw the
default value as still being as the regular sort, and since it was only
first introduced in 1.5, were actually waiting a bit to see if anyone
ENABLED it as opposed to DISABLING it since - it's disabled by default! :)
I rec
Hi, i made a clustering algorithm in Scala/Spark during my internship, i
would like to contribute to MLlib, but i don't know how, i do my best to
follow this instructions :
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuideli
Ok It turns out I was using the wrong LinearRegressionModel which was
in package
org.apache.spark.ml.regression;.
On Thu, Oct 15, 2015 at 3:23 PM, Fazlan Nazeem wrote:
> This is the API doc for LinearRegressionModel. It does not implement
> PMMLExportable
>
> https://spark.apache.org/docs/lat
This is the API doc for LinearRegressionModel. It does not implement
PMMLExportable
https://spark.apache.org/docs/latest/api/java/index.html
On Thu, Oct 15, 2015 at 3:11 PM, canan chen wrote:
> The method toPMML is in trait PMMLExportable
>
> *LinearRegressionModel has this trait, you should be
The method toPMML is in trait PMMLExportable
*LinearRegressionModel has this trait, you should be able to call *
*LinearRegressionModel#toPMML*
On Thu, Oct 15, 2015 at 5:25 PM, Fazlan Nazeem wrote:
> Hi
>
> I am trying to export a LinearRegressionModel in PMML format. According to
> the followi
Hi
I am trying to export a LinearRegressionModel in PMML format. According to
the following resource[1] PMML export is supported for
LinearRegressionModel.
[1] https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
But there is *no* *toPMML* method in *LinearRegressionModel* class alt
Hi Arijit,
my understanding is the following:
RDD actions will at some point call the runJob method of a SparkContext
That runJob method calls the clean method which in turn calls
ClosureCleaner.clean which removes unneeded stuff from closures and also
checks whether they are serializable.
The ne
if DataFrame aspires to be more than a vehicle for SQL then i think it
would be mistake to allow multiple column names. it is very confusing.
pandas indeed allows this and it has led to many bugs. R does not allow it
for data.frame (it renames the name dupes).
i would consider a csv with duplicate
True. As long as we can ensure the correct message are printed out, users
can correct their app easily. For example, Reference 'name' is ambiguous,
could be: name#1, name#5.;
Thanks,
Xiao Li
2015-10-14 23:58 GMT-07:00 Reynold Xin :
> That could break a lot of applications. In particular, a lot
That could break a lot of applications. In particular, a lot of input data
sources (csv, json) don't have clean schema, and can have duplicate column
names.
For the case of join, maybe a better solution is to ask the left/right
prefix/suffix in the user code, similar to what Pandas does.
On Wed,
23 matches
Mail list logo