Hi Chen,
Please see the bug I filed at
https://issues.apache.org/jira/browse/SPARK-2984 with the
FileNotFoundException on _temporary directory issue.
Andrew
On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote:
Not sure which stalled HDFS client issue your'e referring to,
This is how i used to do it:
*// Create a list of jars*
ListString jars =
Lists.newArrayList(/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar,ADD-All-The-Jars-Here
);
It sounds like your data does not all have the same dimension? that's
a decent guess. Have a look at the assertions in this method.
On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.) y...@ford.com wrote:
I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the
// assuming Spark 1.0
Hi Baoqiang,
In my experience for the standalone cluster you need to set
SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are
written. I think this is a documentation issue that could be improved, as
Actually the program hangs just by calling dataAllRDD.count(). I suspect
creating the RDD is not successful when its elements are too big. When nY =
3000, dataAllRDD.count() works (each element of dataAll = 3000*400*64 bits =
9.6 MB), but when nY = 4000, it hangs (4000*400*64 bits = 12.8 MB).
-incubator, +user
So, are you trying to transpose your data?
val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))).repartition(2)
First you could pair each value with its position in its list:
val withIndex = rdd.flatMap(_.zipWithIndex)
then group by that position, and discard the
Hi,
when I run some spark application on my local machine using spark-submit:
$SPARK_HOME/bin/spark-submit --driver-memory 1g class jar
When I want to interrupt computing by ctrl-c it interrupt current stage but
later it waits and exit after around 5min and sometimes doesn't exit at all,
and the
The long lineage causes a long/deep Java object tree (DAG of RDD objects),
which needs to be serialized as part of the task creation. When
serializing, the whole object DAG needs to be traversed leading to the
stackoverflow error.
TD
On Mon, Aug 11, 2014 at 7:14 PM, randylu randyl...@gmail.com
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of
For linear models, the constructors are now public. You can save the
weights to HDFS, then load the weights back and use the constructor to
create the model. -Xiangrui
On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote:
hello:
I want to know,if I use history data to
Thanks for your answer.
Yes, I want to transpose data.
At this point, I have one more question.
I tested it with
RDD1
List(1, 2, 3, 4, 5)
List(6, 7, 8, 9, 10)
List(11, 12, 13, 14, 15)
List(16, 17, 18, 19, 20)
And the result is...
ArrayBuffer(11, 1, 16, 6)
ArrayBuffer(2, 12, 7, 17)
I was hoping I could make the system behave as a blocking queue : if the
outputs is too slow, buffers (storing space for RDDs) fills up, then blocks
instead of dropping existing rdds, until the input itself blocks (slows down
it’s consumption).
On a side note I was wondering: is there the same
Sure, just add .toList.sorted in there. Putting together in one big
expression:
val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10)))
val result =
rdd.flatMap(_.zipWithIndex).groupBy(_._2).values.map(_.map(_._1).toList.sorted)
List(2, 7)
List(1, 6)
List(4, 9)
List(3, 8)
List(5, 10)
spark.speculation was not set, any speculative execution on tachyon side?
tachyon-env.sh only changed following
export TACHYON_MASTER_ADDRESS=test01.zala
#export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs
export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020
export
more interesting is if spark-shell started on master node (test01)
then
parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex)
14/08/12 11:42:06 INFO : initialize(tachyon://...
...
...
14/08/12 11:42:06 INFO : File does not exist:
I figured it out. My indices parameters for the sparse vector are messed up. It
is a good learning for me:
When use the Vectors.sparse(int size, int[] indices, double[] values) to
generate a vector, size is the size of the whole vector, not just the size of
the elements with value. The indices
hi, TD. Thanks very much! I got it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p11980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi All,
I have a question regarding LDA topic modeling. I need to do topic modeling
on ad data.
Does MLBase supports LDA topic modeling?
Or any stable, tested LDA implementation on Spark?
BR,
Aslan
Hi Xiangrui,
Thanks for your reply!
Yes, our data is very sparse, but RDD.repartition invoke
RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has
the same effect with #2, right?
For CombineInputFormat, although I haven't tried it, but it sounds that it
will combine multiple
1. be careful, HDFS are better for large files, not bunches of small files.
2. if that's really what you want, roll it your own.
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while
Yin helped me with that, and I appreciate the onlist followup. A few
questions: Why is this the case? I guess, does building it with
thriftserver add much more time/size to the final build? It seems that
unless documented well, people will miss that and this situation would
occur, why would we
Thanks for putting this together, Andrew.
On Tue, Aug 12, 2014 at 2:11 AM, Andrew Ash and...@andrewash.com wrote:
Hi Chen,
Please see the bug I filed at
https://issues.apache.org/jira/browse/SPARK-2984 with the
FileNotFoundException on _temporary directory issue.
Andrew
On Mon, Aug
Hello.
Is there an integration spark ( pyspark) with anaconda?
I googled a lot and didn't find relevant information.
Could you please pointing me on tutorial or simple example.
Thanks in advance
Oleg.
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks
promising as it has spark as one of the platforms used.
Bests,
-Monir
From: Mozumder, Monir
Sent: Monday, August 11, 2014 7:18 PM
To: user@spark.apache.org
Subject: Benchmark on physical Spark cluster
I am trying to
Sorry, I missed #2. My suggestion is the same as #2. You need to set a
bigger numPartitions to avoid hitting integer bound or 2G limitation,
at the cost of increased shuffle size per iteration. If you use a
CombineInputFormat and then cache, it will try to give you roughly the
same size per
Hive pulls in a ton of dependencies that we were afraid would break
existing spark applications. For this reason all hive submodules are
optional.
On Tue, Aug 12, 2014 at 7:43 AM, John Omernik j...@omernik.com wrote:
Yin helped me with that, and I appreciate the onlist followup. A few
I have a class defining an inner static class (map function). The inner
class tries to refer the variable instantiated in the outside class, which
results in a NullPointerException. Sample Code as follows:
class SampleOuterClass {
private static ArrayListString someVariable;
Are there any other workarounds that could be used to pass in the values
from *someVariable *to the transformation function ?
On Tue, Aug 12, 2014 at 10:48 AM, Sean Owen so...@cloudera.com wrote:
I don't think static members are going to be serialized in the
closure? the instance of Parse
You could create a copy of the variable inside your Parse class;
that way it would be serialized with the instance you create when
calling map() below.
On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri sunny.k...@gmail.com wrote:
Are there any other workarounds that could be used to pass in the
We are probably still the minority, but our analytics platform based on
Spark + HDFS does not have map/reduce installed. I'm wondering if there is
a distcp equivalent that leverages Spark to do the work.
Our team is trying to find the best way to do cross-datacenter replication
of our HDFS data
That should work. Gonna give it a try. Thanks !
On Tue, Aug 12, 2014 at 11:01 AM, Marcelo Vanzin van...@cloudera.com
wrote:
You could create a copy of the variable inside your Parse class;
that way it would be serialized with the instance you create when
calling map() below.
On Tue, Aug
Hello spark aficionados,
We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and
started noticing some weird errors. Even a simple operation like
reduceByKey or count on an RDD gets stuck in cluster mode. This issue
does not occur with spark 1.0.0 (in cluster or local mode) or
Hi ,
Today i created a table with 3 regions and 2 jobtrackers but still the
spark job is taking lot of time
I also noticed one thing that is the memory of client was increasing
linearly is it like spark job was first bringing the complete data in
memory?
On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu
Hi,
I'm trying to run streaming.NetworkWordCount example on the Mesos Cluster
(0.19.1). I am able to run SparkPi examples on my mesos cluster, can run
the streaming example in local[n] mode, but no luck so far with the Spark
streaming examples running my mesos cluster.
I dont particularly see
Hi Jenny,
Have you copied hive-site.xml to spark/conf directory? If not, can you put
it in conf/ and try again?
Thanks,
Yin
On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao linlin200...@gmail.com wrote:
Thanks Yin!
here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't
Good question; I don't know of one but I believe people at Cloudera had some
thoughts of porting Sqoop to Spark in the future, and maybe they'd consider
DistCP as part of this effort. I agree it's missing right now.
Matei
On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)
I tried to access worker info from spark context but it seems spark context
does no expose such API. The reason of doing that is: it seems spark context
itself does not have logic to detect if its workers are in dead status. So I
like to add such logic by myself.
BTW, it seems spark web UI
understand, thank you
small file is a problem, I am considering process data before put them in
hdfs.
On Tue, Aug 12, 2014 at 9:37 PM, Fengyun RAO raofeng...@gmail.com wrote:
1. be careful, HDFS are better for large files, not bunches of small files.
2. if that's really what you want, roll
Hi Yin,
hive-site.xml was copied to spark/conf and the same as the one under
$HIVE_HOME/conf.
through hive cli, I don't see any problem. but for spark on yarn-cluster
mode, I am not able to switch to a database other than the default one, for
Yarn-client mode, it works fine.
Thanks!
Jenny
On
Hi, sorry for the delay. Would you have yarn available to test? Given
the discussion in SPARK-2878, this might be a different incarnation of
the same underlying issue.
The option in Yarn is spark.yarn.user.classpath.first
On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom dan...@wibidata.com wrote:
I'm
Uh, for some reason I don't seem to automatically reply to the list any
more.
Here is again my message to Tom.
-- Forwarded message --
Tom,
On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek minnesota...@gmail.com wrote:
This is a back-to-basics question. How do we know when Spark
Hi,
On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed xia.s...@gmail.com wrote:
I dont particularly see any errors on my logs, either on console, or on
slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and
unpacks them as well. Mesos Master shows quiet alot of Tasks created and
you should try watching this video
https://www.youtube.com/watch?v=sPhyePwo7FA, for more details, please
search in the archives, I've got a same kind of question and other guys
helped me to solve the problem.
On Tue, Aug 12, 2014 at 12:36 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com
wrote:
Have
In MLLib, I found the method to train matrix factorization model to predict
the taste of user. In this function, there are some parameters such as
lambda, and rank, I can not find the best value to set these parameters and
how to optimize this value. Could you please give me some recommends?
--
Sometimes workers are dead but spark context does not know it and still send
jobs.
On Tuesday, August 12, 2014 7:14 PM, Stanley Shi s...@pivotal.io wrote:
Why do you need to detect the worker status in the application? you application
generally don't need to know where it is executed.
actually if you search the spark mail archives you will find many similar
topics. At this time, I just want to manage it by myself.
On Tuesday, August 12, 2014 8:46 PM, Stanley Shi s...@pivotal.io wrote:
This seems a bug, right? It's not the user's responsibility to manage the
workers.
Hi Cheng,
I also meet some issues when I try to start ThriftServer based a build from
master branch (I could successfully run it from the branch-1.0-jdbc
branch). Below is my build command:
./make-distribution.sh --skip-java-test -Phadoop-2.4 -Phive -Pyarn
-Dyarn.version=2.4.0
Just find this is because below lines in make_distribution.sh doesn't work:
if [ $SPARK_HIVE == true ]; then
cp $FWDIR/lib_managed/jars/datanucleus*.jar $DISTDIR/lib/
fi
There is no definition of $SPARK_HIVE in make_distribution.sh. I should set
it explicitly.
On Wed, Aug 13, 2014 at 1:10
48 matches
Mail list logo