from:"Tomasz Fruboes"

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Tomasz Fruboes


Hi Yanbo,

 thanks for info. Is it likely to change in (near :) ) future? Ability 
to call this function only on local data (ie not in rdd) seems to be 
rather serious limitation.


 cheers,
  Tomasz

On 02.01.2016 09:45, Yanbo Liang wrote:

Hi Tomasz,

The GMM is bind with the peer Java GMM object, so it need reference to
SparkContext.
Some of MLlib(not ML) models are simple object such as KMeansModel,
LinearRegressionModel etc., but others will refer SparkContext. The
later ones and corresponding member functions should not called in map().

Cheers
Yanbo



2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl
<mailto:tomasz.frub...@ncbj.gov.pl>>:

Dear All,

  I'm trying to implement a procedure that iteratively updates a rdd
using results from GaussianMixtureModel.predictSoft. In order to
avoid problems with local variable (the obtained GMM) beeing
overwritten in each pass of the loop I'm doing the following:

###
for i in xrange(10):
 gmm = GaussianMixture.train(rdd, 2)

 def getSafePredictor(unsafeGMM):
 return lambda x: \
 (unsafeGMM.predictSoft(x.features),
unsafeGMM.gaussians.mu <http://unsafeGMM.gaussians.mu>)

 safePredictor = getSafePredictor(gmm)
 predictionsRDD = (labelledpointrddselectedfeatsNansPatched
   .map(safePredictor)
 )
 print predictionsRDD.take(1)
 (... - rest of code - update rdd with results from predictionsRdd)
###

Unfortunately this ends with:

###
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
###

Any idea why I'm getting this behaviour? My expectation would be,
that GMM should be a "simple" object without SparkContext in it.
I'm using spark 1.5.2

  Thanks,
Tomasz


ps As a workaround I'm doing currently


 def getSafeGMM(unsafeGMM):
 return lambda x: unsafeGMM.predictSoft(x)

 safeGMM = getSafeGMM(gmm)
 predictionsRDD = \
 safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))

  which works fine. If it's possible I would like to avoid this
approach, since it would require to perform another closure on
gmm.gaussians later in my code


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Problem embedding GaussianMixtureModel in a closure

2015-12-31 Thread Tomasz Fruboes


Dear All,

 I'm trying to implement a procedure that iteratively updates a rdd 
using results from GaussianMixtureModel.predictSoft. In order to avoid 
problems with local variable (the obtained GMM) beeing overwritten in 
each pass of the loop I'm doing the following:


###
for i in xrange(10):
gmm = GaussianMixture.train(rdd, 2)

def getSafePredictor(unsafeGMM):
return lambda x: \
(unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu)

safePredictor = getSafePredictor(gmm)
predictionsRDD = (labelledpointrddselectedfeatsNansPatched
  .map(safePredictor)
)
print predictionsRDD.take(1)
(... - rest of code - update rdd with results from predictionsRdd)
###

Unfortunately this ends with:

###
Exception: It appears that you are attempting to reference SparkContext 
from a broadcast variable, action, or transformation. SparkContext can 
only be used on the driver, not in code that it run on workers. For more 
information, see SPARK-5063.

###

Any idea why I'm getting this behaviour? My expectation would be, that 
GMM should be a "simple" object without SparkContext in it.  I'm using 
spark 1.5.2


 Thanks,
   Tomasz


ps As a workaround I'm doing currently


def getSafeGMM(unsafeGMM):
return lambda x: unsafeGMM.predictSoft(x)

safeGMM = getSafeGMM(gmm)
predictionsRDD = \
safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))

 which works fine. If it's possible I would like to avoid this 
approach, since it would require to perform another closure on 
gmm.gaussians later in my code



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes


Hi Matt,

 is there a reason you need to call coalesce every loop iteration? Most 
likely it forces spark to do lots of unnecessary shuffles. Also - for 
really large number of inputs this approach can lead to due to to many 
nested RDD.union calls. A safer approach is to call union from 
SparkContext once, as soon as you have all RDDs ready. For python it 
looks this way:


rdds = []
for i in xrange(cnt):
rdd = ...
rdds.append(rdd)
finalRDD = sparkContext.union(rdds)

HTH,
 Tomasz


W dniu 18.06.2015 o 02:53, Matt Forbes pisze:

I have multiple input paths which each contain data that need to be
mapped in a slightly different way into a common data structure. My
approach boils down to:

RDDT rdd = null;
for (Configuration conf : configurations) {
   RDDT nextRdd = loadFromConfiguration(conf);
   rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd);
   rdd = rdd.coalesce(nextRdd.partitions().size());
}

Now, for a small number of inputs there doesn't seem to be a problem,
but for the full set which is about 60 sub-RDDs coming in at around
500MM total records takes a very long time to construct. Just for a
simple load-then-count example job, it takes 13 minutes total, where the
count() task only accounts for 2 minutes of that.

Is there something I should be doing differently here? If you can't
tell, this is in java so my RDD is probably some mess of nested wrapped
RDDs but I'm not sure if that would be the real issue.

Thanks,
Matt



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-21 Thread Tomasz Fruboes

Hi,

thanks for answer, I'll open a ticket.

In the meantime - I have found a workaround. The recipe is the following:

1. Create a new account/group on all machines (lets call it sparkuser).
Run spark from this account.

2. Add your user to group sparkuser.

3. If you decide to write RDD/parquet file under workdir directory you
need to execute the following (just once, before running spark-submit):

chgrp sparkuser workdir
chmod g+s workdir
setfacl -d -m g::rwx workdir

(first two steps can be replaced also by newgrp sparkuser, but this
way all your files will be created with sparkuser group)

than calls like

rdd.saveAsPickleFile(workdir+/somename)

work just fine.

The above solution has one serious problem - any other user from
sparkuser group will be able to overwrite your saved data.

cheers,
Tomasz

W dniu 20.05.2015 o 23:08, Davies Liu pisze:

Could you file a JIRA for this?

The executor should run under the user who submit a job, I think.

On Wed, May 20, 2015 at 2:40 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl wrote:

Thanks for a suggestion. I have tried playing with it, sc.sparkUser() gives
me expected user name, but it doesnt solve the problem. From a quick search
through the spark code it seems to me, that this setting is effective only
for yarn and mesos.

I think the workaround for the problem could be using --deploy-mode
cluster (not 100% convenient, since disallows any interactive work), but
this is not supported for python based programs.

Cheers,
Tomasz

W dniu 20.05.2015 o 10:57, Iulian Dragoș pisze:

You could try setting `SPARK_USER` to the user under which your workers
are running. I couldn't find many references to this variable, but at
least Yarn and Mesos take it into account when spawning executors.
Chances are that standalone mode also does it.

iulian

On Wed, May 20, 2015 at 9:29 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote:

Hi,

thanks for answer. The rights are

drwxr-xr-x 3 tfruboes all 5632 05-19 15 tel:5632%2005-19%2015:40
test19EE/

I have tried setting the rights to 777 for this directory prior to
execution. This does not get propagated down the chain, ie the
directory created as a result of the save call
(namesAndAges.parquet2 in the path in the dump [1] below) is created
with the drwxr-xr-x rights (owned by the user submitting the job, ie
tfruboes). The temp directories created inside

namesAndAges.parquet2/_temporary/0/

(e.g. task_201505200920_0009_r_01) are owned by root, again with
drwxr-xr-x access rights

Cheers,
Tomasz

W dniu 19.05.2015 o 23:56, Davies Liu pisze:

It surprises me, could you list the owner information of
/mnt/lustre/bigdata/med_home/tmp/test19EE/ ?

On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl

wrote:

Dear Experts,

we have a spark cluster (standalone mode) in which master
and workers are
started from root account. Everything runs correctly to the
point when we
try doing operations such as

dataFrame.select(name, age).save(ofile, parquet)

rdd.saveAsPickleFile(ofile)

, where ofile is path on a network exported filesystem
(visible on all
nodes, in our case this is lustre, I guess on nfs effect
would be similar).

Unsurprisingly temp files created on workers are owned by
root, which then
leads to a crash (see [1] below). Is there a
solution/workaround for this
(e.g. controlling file creation mode of the temporary files)?

Cheers,
Tomasz

ps I've tried to google this problem, couple of similar
reports, but no
clear answer/solution found

ps2 For completeness - running master/workers as a regular
user solves the
problem only for the given user. For other users submitting
to this master
the result is given in [2] below

[0] Cluster details:
Master/workers: centos 6.5
Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the
2.6 build)

[1]

##
File

/mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o27.save.
: java.io.IOException: Failed to rename

DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE

Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes


Hi,

 it looks you are writing to a local filesystem. Could you try writing 
to a location visible by all nodes (master and workers), e.g. nfs share?


 HTH,
  Tomasz

W dniu 21.05.2015 o 17:16, rroxanaioana pisze:

Hello!
I just started with Spark. I have an application which counts words in a
file (1 MB file).
The file is stored locally. I loaded the file using native code and then
created the RDD from it.

 JavaRDDString rddFromFile = context.parallelize(myFile,
2);
JavaRDDString words = rddFromFile.flatMap(...);
JavaPairRDDString, Integer pairs = words.mapToPair(...);
JavaPairRDDString, Integer counter = pairs.reduceByKey(..);

counter.saveAsTextFile(file:///root/output);
context.close();

I have one master and 2 slaves. I run the program from the master node.
The output directory is created on the master node and on the 2 nodes. On
the master node I have only one file _SUCCES (empty) and on the nodes I have
_temporary file. I printed the counter at the console, the result seems ok.
What am I doing wrong?
Thank you!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-part-files-are-missing-tp22974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Tomasz Fruboes


Hi,

 thanks for answer. The rights are

drwxr-xr-x 3 tfruboes all 5632 05-19 15:40 test19EE/

 I have tried setting the rights to 777 for this directory prior to 
execution. This does not get propagated down the chain, ie the directory 
created as a result of the save call (namesAndAges.parquet2 in the 
path in the dump [1] below) is created with the drwxr-xr-x rights (owned 
by the user submitting the job, ie tfruboes). The temp directories 
created inside


namesAndAges.parquet2/_temporary/0/

(e.g. task_201505200920_0009_r_01) are owned by root, again with 
drwxr-xr-x access rights


 Cheers,
  Tomasz

W dniu 19.05.2015 o 23:56, Davies Liu pisze:

It surprises me, could you list the owner information of
/mnt/lustre/bigdata/med_home/tmp/test19EE/ ?

On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl wrote:

Dear Experts,

  we have a spark cluster (standalone mode) in which master and workers are
started from root account. Everything runs correctly to the point when we
try doing operations such as

 dataFrame.select(name, age).save(ofile, parquet)

or

 rdd.saveAsPickleFile(ofile)

, where ofile is path on a network exported filesystem (visible on all
nodes, in our case this is lustre, I guess on nfs effect would be similar).

  Unsurprisingly temp files created on workers are owned by root, which then
leads to a crash (see [1] below). Is there a solution/workaround for this
(e.g. controlling file creation mode of the temporary files)?

Cheers,
  Tomasz


ps I've tried to google this problem, couple of similar reports, but no
clear answer/solution found

ps2 For completeness - running master/workers as a regular user solves the
problem only for the given user. For other users submitting to this master
the result is given in [2] below


[0] Cluster details:
Master/workers: centos 6.5
Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build)


[1]
##
File
/mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.
: java.io.IOException: Failed to rename
DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet;
isDirectory=false; length=534; replication=1; blocksize=33554432;
modification_time=1432042832000; access_time=0; owner=; group=;
permission=rw-rw-rw-; isSymlink=false} to
file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet
 at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
 at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
 at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
 at
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43)
 at
org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690)
 at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)
 at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1181)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
##



[2]
##
15/05/19 14:45:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3,
wn23023.cis.gov.pl): java.io.IOException: Mkdirs failed to create
file:/mnt/lustre/bigdata/med_home/tmp/test18/namesAndAges.parquet2/_temporary/0/_temporary/attempt_201505191445_0009_r_00_0
 at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
 at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-20 Thread Tomasz Fruboes

Thanks for a suggestion. I have tried playing with it, sc.sparkUser() 
gives me expected user name, but it doesnt solve the problem. From a 
quick search through the spark code it seems to me, that this setting is 
effective only for yarn and mesos.


 I think the workaround for the problem could be using --deploy-mode 
cluster (not 100% convenient, since disallows any interactive work), 
but this is not supported for python based programs.


Cheers,
  Tomasz



W dniu 20.05.2015 o 10:57, Iulian Dragoș pisze:

You could try setting `SPARK_USER` to the user under which your workers
are running. I couldn't find many references to this variable, but at
least Yarn and Mesos take it into account when spawning executors.
Chances are that standalone mode also does it.

iulian

On Wed, May 20, 2015 at 9:29 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl wrote:

Hi,

  thanks for answer. The rights are

drwxr-xr-x 3 tfruboes all 5632 05-19 15 tel:5632%2005-19%2015:40
test19EE/

  I have tried setting the rights to 777 for this directory prior to
execution. This does not get propagated down the chain, ie the
directory created as a result of the save call
(namesAndAges.parquet2 in the path in the dump [1] below) is created
with the drwxr-xr-x rights (owned by the user submitting the job, ie
tfruboes). The temp directories created inside

namesAndAges.parquet2/_temporary/0/

(e.g. task_201505200920_0009_r_01) are owned by root, again with
drwxr-xr-x access rights

  Cheers,
   Tomasz

W dniu 19.05.2015 o 23:56, Davies Liu pisze:

It surprises me, could you list the owner information of
/mnt/lustre/bigdata/med_home/tmp/test19EE/ ?

On Tue, May 19, 2015 at 8:15 AM, Tomasz Fruboes
tomasz.frub...@fuw.edu.pl mailto:tomasz.frub...@fuw.edu.pl
wrote:

Dear Experts,

   we have a spark cluster (standalone mode) in which master
and workers are
started from root account. Everything runs correctly to the
point when we
try doing operations such as

  dataFrame.select(name, age).save(ofile, parquet)

or

  rdd.saveAsPickleFile(ofile)

, where ofile is path on a network exported filesystem
(visible on all
nodes, in our case this is lustre, I guess on nfs effect
would be similar).

   Unsurprisingly temp files created on workers are owned by
root, which then
leads to a crash (see [1] below). Is there a
solution/workaround for this
(e.g. controlling file creation mode of the temporary files)?

Cheers,
   Tomasz


ps I've tried to google this problem, couple of similar
reports, but no
clear answer/solution found

ps2 For completeness - running master/workers as a regular
user solves the
problem only for the given user. For other users submitting
to this master
the result is given in [2] below


[0] Cluster details:
Master/workers: centos 6.5
Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the
2.6 build)


[1]
##
 File

/mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o27.save.
: java.io.IOException: Failed to rename

DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet;
isDirectory=false; length=534; replication=1;
blocksize=33554432;
modification_time=1432042832000; access_time=0; owner=; group=;
permission=rw-rw-rw-; isSymlink=false} to

file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet
  at

org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
  at

org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
  at

org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
  at

parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43)
  at

org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690

Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Tomasz Fruboes


Dear Experts,

 we have a spark cluster (standalone mode) in which master and workers 
are started from root account. Everything runs correctly to the point 
when we try doing operations such as


dataFrame.select(name, age).save(ofile, parquet)

or

rdd.saveAsPickleFile(ofile)

, where ofile is path on a network exported filesystem (visible on all 
nodes, in our case this is lustre, I guess on nfs effect would be similar).


 Unsurprisingly temp files created on workers are owned by root, which 
then leads to a crash (see [1] below). Is there a solution/workaround 
for this (e.g. controlling file creation mode of the temporary files)?


Cheers,
 Tomasz


ps I've tried to google this problem, couple of similar reports, but no 
clear answer/solution found


ps2 For completeness - running master/workers as a regular user solves 
the problem only for the given user. For other users submitting to this 
master the result is given in [2] below



[0] Cluster details:
Master/workers: centos 6.5
Spark 1.3.1 prebuilt for hadoop 2.4 (same behaviour for the 2.6 build)


[1]
##
   File 
/mnt/home/tfruboes/2015.05.SparkLocal/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
line 300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.
: java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_01/part-r-2.parquet; 
isDirectory=false; length=534; replication=1; blocksize=33554432; 
modification_time=1432042832000; access_time=0; owner=; group=; 
permission=rw-rw-rw-; isSymlink=false} to 
file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-2.parquet
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at 
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:43)
at 
org.apache.spark.sql.parquet.ParquetRelation2.insert(newParquet.scala:690)
at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:129)
at 
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)
at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
##



[2]
##
15/05/19 14:45:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 
3, wn23023.cis.gov.pl): java.io.IOException: Mkdirs failed to create 
file:/mnt/lustre/bigdata/med_home/tmp/test18/namesAndAges.parquet2/_temporary/0/_temporary/attempt_201505191445_0009_r_00_0
	at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
	at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
at parquet.hadoop.ParquetFileWriter.init(ParquetFileWriter.java:154)
	at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
	at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
	at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:667)
	at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)
	at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:689)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at

Re: Problem embedding GaussianMixtureModel in a closure

Problem embedding GaussianMixtureModel in a closure

Re: Union of many RDDs taking a long time

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

Re: saveAsTextFile() part- files are missing

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

Multi user setup and saving a DataFrame / RDD to a network exported file system

8 matches

Site Navigation

Mail list logo

Footer information