Re: saveAsTextFiles file not found exception

2014-08-12 Thread Andrew Ash
Hi Chen, Please see the bug I filed at https://issues.apache.org/jira/browse/SPARK-2984 with the FileNotFoundException on _temporary directory issue. Andrew On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote: Not sure which stalled HDFS client issue your'e referring to,

Re: ClassNotFound exception on class in uber.jar

2014-08-12 Thread Akhil Das
This is how i used to do it: *// Create a list of jars* ListString jars = Lists.newArrayList(/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar,ADD-All-The-Jars-Here );

Re: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Sean Owen
It sounds like your data does not all have the same dimension? that's a decent guess. Have a look at the assertions in this method. On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.) y...@ford.com wrote: I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run the

Re: set SPARK_LOCAL_DIRS issue

2014-08-12 Thread Andrew Ash
// assuming Spark 1.0 Hi Baoqiang, In my experience for the standalone cluster you need to set SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are written. I think this is a documentation issue that could be improved, as

Re: Parallelizing a task makes it freeze

2014-08-12 Thread sparkuser2345
Actually the program hangs just by calling dataAllRDD.count(). I suspect creating the RDD is not successful when its elements are too big. When nY = 3000, dataAllRDD.count() works (each element of dataAll = 3000*400*64 bits = 9.6 MB), but when nY = 4000, it hangs (4000*400*64 bits = 12.8 MB).

Re: Transform RDD[List]

2014-08-12 Thread Sean Owen
-incubator, +user So, are you trying to transpose your data? val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))).repartition(2) First you could pair each value with its position in its list: val withIndex = rdd.flatMap(_.zipWithIndex) then group by that position, and discard the

Killing spark app problem

2014-08-12 Thread Grzegorz Białek
Hi, when I run some spark application on my local machine using spark-submit: $SPARK_HOME/bin/spark-submit --driver-memory 1g class jar When I want to interrupt computing by ctrl-c it interrupt current stage but later it waits and exit after around 5min and sometimes doesn't exit at all, and the

Re: java.lang.StackOverflowError when calling count()

2014-08-12 Thread Tathagata Das
The long lineage causes a long/deep Java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error. TD On Mon, Aug 11, 2014 at 7:14 PM, randylu randyl...@gmail.com

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of

Re: How to save mllib model to hdfs and reload it

2014-08-12 Thread Xiangrui Meng
For linear models, the constructors are now public. You can save the weights to HDFS, then load the weights back and use the constructor to create the model. -Xiangrui On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote: hello: I want to know,if I use history data to

Re: Transform RDD[List]

2014-08-12 Thread Kevin Jung
Thanks for your answer. Yes, I want to transpose data. At this point, I have one more question. I tested it with RDD1 List(1, 2, 3, 4, 5) List(6, 7, 8, 9, 10) List(11, 12, 13, 14, 15) List(16, 17, 18, 19, 20) And the result is... ArrayBuffer(11, 1, 16, 6) ArrayBuffer(2, 12, 7, 17)

RE: [spark-streaming] kafka source and flow control

2014-08-12 Thread Gwenhael Pasquiers
I was hoping I could make the system behave as a blocking queue : if the outputs is too slow, buffers (storing space for RDDs) fills up, then blocks instead of dropping existing rdds, until the input itself blocks (slows down it’s consumption). On a side note I was wondering: is there the same

Re: Transform RDD[List]

2014-08-12 Thread Sean Owen
Sure, just add .toList.sorted in there. Putting together in one big expression: val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))) val result = rdd.flatMap(_.zipWithIndex).groupBy(_._2).values.map(_.map(_._1).toList.sorted) List(2, 7) List(1, 6) List(4, 9) List(3, 8) List(5, 10)

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
spark.speculation was not set, any speculative execution on tachyon side? tachyon-env.sh only changed following export TACHYON_MASTER_ADDRESS=test01.zala #export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020 export

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
more interesting is if spark-shell started on master node (test01) then parquetFile.saveAsParquetFile(tachyon://test01.zala:19998/parquet_tablex) 14/08/12 11:42:06 INFO : initialize(tachyon://... ... ... 14/08/12 11:42:06 INFO : File does not exist:

RE: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Ge, Yao (Y.)
I figured it out. My indices parameters for the sparse vector are messed up. It is a good learning for me: When use the Vectors.sparse(int size, int[] indices, double[] values) to generate a vector, size is the size of the whole vector, not just the size of the elements with value. The indices

Re: java.lang.StackOverflowError when calling count()

2014-08-12 Thread randylu
hi, TD. Thanks very much! I got it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p11980.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

LDA in MLBase

2014-08-12 Thread Aslan Bekirov
Hi All, I have a question regarding LDA topic modeling. I need to do topic modeling on ad data. Does MLBase supports LDA topic modeling? Or any stable, tested LDA implementation on Spark? BR, Aslan

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread ZHENG, Xu-dong
Hi Xiangrui, Thanks for your reply! Yes, our data is very sparse, but RDD.repartition invoke RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has the same effect with #2, right? For CombineInputFormat, although I haven't tried it, but it sounds that it will combine multiple

Fwd: how to split RDD by key and save to different path

2014-08-12 Thread Fengyun RAO
1. be careful, HDFS are better for large files, not bunches of small files. 2. if that's really what you want, roll it your own. def writeLines(iterator: Iterator[(String, String)]) = { val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map try { while

Re: Spark SQL JDBC

2014-08-12 Thread John Omernik
Yin helped me with that, and I appreciate the onlist followup. A few questions: Why is this the case? I guess, does building it with thriftserver add much more time/size to the final build? It seems that unless documented well, people will miss that and this situation would occur, why would we

Re: saveAsTextFiles file not found exception

2014-08-12 Thread Chen Song
Thanks for putting this together, Andrew. On Tue, Aug 12, 2014 at 2:11 AM, Andrew Ash and...@andrewash.com wrote: Hi Chen, Please see the bug I filed at https://issues.apache.org/jira/browse/SPARK-2984 with the FileNotFoundException on _temporary directory issue. Andrew On Mon, Aug

Re: anaconda and spark integration

2014-08-12 Thread Oleg Ruchovets
Hello. Is there an integration spark ( pyspark) with anaconda? I googled a lot and didn't find relevant information. Could you please pointing me on tutorial or simple example. Thanks in advance Oleg.

RE: Benchmark on physical Spark cluster

2014-08-12 Thread Mozumder, Monir
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks promising as it has spark as one of the platforms used. Bests, -Monir From: Mozumder, Monir Sent: Monday, August 11, 2014 7:18 PM To: user@spark.apache.org Subject: Benchmark on physical Spark cluster I am trying to

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger numPartitions to avoid hitting integer bound or 2G limitation, at the cost of increased shuffle size per iteration. If you use a CombineInputFormat and then cache, it will try to give you roughly the same size per

Re: Spark SQL JDBC

2014-08-12 Thread Michael Armbrust
Hive pulls in a ton of dependencies that we were afraid would break existing spark applications. For this reason all hive submodules are optional. On Tue, Aug 12, 2014 at 7:43 AM, John Omernik j...@omernik.com wrote: Yin helped me with that, and I appreciate the onlist followup. A few

Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
I have a class defining an inner static class (map function). The inner class tries to refer the variable instantiated in the outside class, which results in a NullPointerException. Sample Code as follows: class SampleOuterClass { private static ArrayListString someVariable;

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
Are there any other workarounds that could be used to pass in the values from *someVariable *to the transformation function ? On Tue, Aug 12, 2014 at 10:48 AM, Sean Owen so...@cloudera.com wrote: I don't think static members are going to be serialized in the closure? the instance of Parse

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Marcelo Vanzin
You could create a copy of the variable inside your Parse class; that way it would be serialized with the instance you create when calling map() below. On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri sunny.k...@gmail.com wrote: Are there any other workarounds that could be used to pass in the

DistCP - Spark-based

2014-08-12 Thread Gary Malouf
We are probably still the minority, but our analytics platform based on Spark + HDFS does not have map/reduce installed. I'm wondering if there is a distcp equivalent that leverages Spark to do the work. Our team is trying to find the best way to do cross-datacenter replication of our HDFS data

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
That should work. Gonna give it a try. Thanks ! On Tue, Aug 12, 2014 at 11:01 AM, Marcelo Vanzin van...@cloudera.com wrote: You could create a copy of the variable inside your Parse class; that way it would be serialized with the instance you create when calling map() below. On Tue, Aug

Jobs get stuck at reduceByKey stage with spark 1.0.1

2014-08-12 Thread Shivani Rao
Hello spark aficionados, We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and started noticing some weird errors. Even a simple operation like reduceByKey or count on an RDD gets stuck in cluster mode. This issue does not occur with spark 1.0.0 (in cluster or local mode) or

Re: Spark Hbase job taking long time

2014-08-12 Thread Amit Singh Hora
Hi , Today i created a table with 3 regions and 2 jobtrackers but still the spark job is taking lot of time I also noticed one thing that is the memory of client was increasing linearly is it like spark job was first bringing the complete data in memory? On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu

Spark Streaming example on your mesos cluster

2014-08-12 Thread Zia Syed
Hi, I'm trying to run streaming.NetworkWordCount example on the Mesos Cluster (0.19.1). I am able to run SparkPi examples on my mesos cluster, can run the streaming example in local[n] mode, but no luck so far with the Spark streaming examples running my mesos cluster. I dont particularly see

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Yin Huai
Hi Jenny, Have you copied hive-site.xml to spark/conf directory? If not, can you put it in conf/ and try again? Thanks, Yin On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao linlin200...@gmail.com wrote: Thanks Yin! here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't

Re: DistCP - Spark-based

2014-08-12 Thread Matei Zaharia
Good question; I don't know of one but I believe people at Cloudera had some thoughts of porting Sqoop to Spark in the future, and maybe they'd consider DistCP as part of this effort. I agree it's missing right now. Matei On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)

how to access workers from spark context

2014-08-12 Thread S. Zhou
I tried to access worker info from spark context but it seems spark context does no expose such API. The reason of doing that is: it seems spark context itself does not have logic to detect if its workers are in dead status. So I like to add such logic by myself.  BTW, it seems spark web UI

Re: how to split RDD by key and save to different path

2014-08-12 Thread 诺铁
understand, thank you small file is a problem, I am considering process data before put them in hdfs. On Tue, Aug 12, 2014 at 9:37 PM, Fengyun RAO raofeng...@gmail.com wrote: 1. be careful, HDFS are better for large files, not bunches of small files. 2. if that's really what you want, roll

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Jenny Zhao
Hi Yin, hive-site.xml was copied to spark/conf and the same as the one under $HIVE_HOME/conf. through hive cli, I don't see any problem. but for spark on yarn-cluster mode, I am not able to switch to a database other than the default one, for Yarn-client mode, it works fine. Thanks! Jenny On

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-12 Thread Marcelo Vanzin
Hi, sorry for the delay. Would you have yarn available to test? Given the discussion in SPARK-2878, this might be a different incarnation of the same underlying issue. The option in Yarn is spark.yarn.user.classpath.first On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom dan...@wibidata.com wrote: I'm

Fwd: Task closures and synchronization

2014-08-12 Thread Tobias Pfeiffer
Uh, for some reason I don't seem to automatically reply to the list any more. Here is again my message to Tom. -- Forwarded message -- Tom, On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek minnesota...@gmail.com wrote: This is a back-to-basics question. How do we know when Spark

Re: Spark Streaming example on your mesos cluster

2014-08-12 Thread Tobias Pfeiffer
Hi, On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed xia.s...@gmail.com wrote: I dont particularly see any errors on my logs, either on console, or on slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and unpacks them as well. Mesos Master shows quiet alot of Tasks created and

Re: Mllib : Save SVM model to disk

2014-08-12 Thread Hoai-Thu Vuong
you should try watching this video https://www.youtube.com/watch?v=sPhyePwo7FA, for more details, please search in the archives, I've got a same kind of question and other guys helped me to solve the problem. On Tue, Aug 12, 2014 at 12:36 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com wrote: Have

training recsys model

2014-08-12 Thread Hoai-Thu Vuong
In MLLib, I found the method to train matrix factorization model to predict the taste of user. In this function, there are some parameters such as lambda, and rank, I can not find the best value to set these parameters and how to optimize this value. Could you please give me some recommends? --

Re: how to access workers from spark context

2014-08-12 Thread S. Zhou
Sometimes workers are dead but spark context does not know it and still send jobs. On Tuesday, August 12, 2014 7:14 PM, Stanley Shi s...@pivotal.io wrote: Why do you need to detect the worker status in the application? you application generally don't need to know where it is executed.

Re: how to access workers from spark context

2014-08-12 Thread S. Zhou
actually if you search the spark mail archives you will find many similar topics. At this time, I just want to manage it by myself. On Tuesday, August 12, 2014 8:46 PM, Stanley Shi s...@pivotal.io wrote: This seems a bug, right? It's not the user's responsibility to manage the workers.

Re: Spark SQL JDBC

2014-08-12 Thread ZHENG, Xu-dong
Hi Cheng, I also meet some issues when I try to start ThriftServer based a build from master branch (I could successfully run it from the branch-1.0-jdbc branch). Below is my build command: ./make-distribution.sh --skip-java-test -Phadoop-2.4 -Phive -Pyarn -Dyarn.version=2.4.0

Re: Spark SQL JDBC

2014-08-12 Thread ZHENG, Xu-dong
Just find this is because below lines in make_distribution.sh doesn't work: if [ $SPARK_HIVE == true ]; then cp $FWDIR/lib_managed/jars/datanucleus*.jar $DISTDIR/lib/ fi There is no definition of $SPARK_HIVE in make_distribution.sh. I should set it explicitly. On Wed, Aug 13, 2014 at 1:10