Re: submit query to spark cluster using spark-sql

2014-10-24 Thread tridib
Figured it out. spark-sql --master spark://sparkmaster:7077 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-24 Thread Akhil Das
You can open the application UI (that runs on 4040) and see how much memory is being allocated to the executor tabs and from the environments tab. Thanks Best Regards On Wed, Oct 22, 2014 at 9:55 PM, Holden Karau hol...@pigscanfly.ca wrote: Hi Michael Campbell, Are you deploying against yarn

Re: hive timestamp column always returns null

2014-10-24 Thread Akhil Das
Try doing a *cat -v your_data | head -n3 *and make sure you are not having any ^M at the end of the lines. Also your 2,3 rows doens't contain any space in the data. Thanks Best Regards On Thu, Oct 23, 2014 at 9:23 AM, tridib tridib.sama...@live.com wrote: Hello Experts, I created a table

Re: how to run a dev spark project without fully rebuilding the fat jar ?

2014-10-24 Thread Akhil Das
You can use the --jars option to submit multiple jars using the spark-submit, so you can simply build the jar that you have modified. Thanks Best Regards On Thu, Oct 23, 2014 at 11:16 AM, Mohit Jaggi mohitja...@gmail.com wrote: i think you can give a list of jars - not just one - to

How can I set the IP a worker use?

2014-10-24 Thread Theodore Si
Hi all, I have two network interface card on one node, one is a Eithernet card, the other Infiniband HCA. The master has two IP addresses, lets say 1.2.3.4 (for Eithernet card) and 2.3.4.5 (for HCA). I can start the master by export SPARK_MASTER_IP='1.2.3.4';sbin/start-master.sh to let master

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-24 Thread Akhil Das
Make sure the guava jar http://mvnrepository.com/artifact/com.google.guava/guava/12.0 is present in the classpath. Thanks Best Regards On Thu, Oct 23, 2014 at 2:13 PM, Stephen Boesch java...@gmail.com wrote: After having checked out from master/head the following error occurs when attempting

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Akhil Das
Try providing the level of parallelism parameter to your reduceByKey operation. Thanks Best Regards On Fri, Oct 24, 2014 at 3:44 AM, xuhongnever xuhongne...@gmail.com wrote: my code is here: from pyspark import SparkConf, SparkContext def Undirect(edge): vector =

Re: Memory requirement of using Spark

2014-10-24 Thread Akhil Das
You can use spark-sql to solve this usecase, and you don't need to have 800G of memory (but of course if you are caching the whole data into memory, then you would need it.). You can persist the data by setting DISK_AND_MEMORY_SER property if you don't want to bring whole data into memory, in this

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu
On Thu, Oct 23, 2014 at 3:14 PM, xuhongnever xuhongne...@gmail.com wrote: my code is here: from pyspark import SparkConf, SparkContext def Undirect(edge): vector = edge.strip().split('\t') if(vector[0].isdigit()): return [(vector[0], vector[1])] return [] conf =

Re: How can I set the IP a worker use?

2014-10-24 Thread Akhil Das
Try using the --ip parameter http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually while starting the worker. like: spark-1.0.1/bin/spark-class org.apache.spark.deploy.worker.Worker --ip 1.2.3.4 spark://1.2.3.4:7077 Thanks Best Regards On Fri, Oct 24, 2014 at

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread lokeshkumar
Hi Joseph, Thanks for the help. I have tried this DecisionTree example with the latest spark code and it is working fine now. But how do we choose the maxBins for this model? Thanks Lokesh -- View this message in context:

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is processed through spark. This would also have to cache data (in memory only), so the approach I was thinking of was to build a layer that persists the appropriate RDDs and stores them in

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Jaonary Rabarisoa
In the documentation it's said that we need to override the hashCode and equals methods. Without overriding it does't work too. I get this error on REPL and stand alone application On Fri, Oct 24, 2014 at 3:29 AM, Prashant Sharma scrapco...@gmail.com wrote: Are you doing this in REPL ? Then

Re: Problem packing spark-assembly jar

2014-10-24 Thread Sean Owen
I imagine this is a side effect of the change that was just reverted, related to publishing the effective pom? sounds related but I don't know. On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I'm trying to deploy the latest from master branch and having

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Gerard Maas
There's an issue in the way case classes are handled on the REPL and you won't be able to use a case class as a key. See: https://issues.apache.org/jira/browse/SPARK-2620 BTW, case classes already implement equals and hashCode. It's not needed to implement those again. Given that you already

Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-10-24 Thread htailor
Hi All, I am relatively new to spark and currently having troubles with broadcasting large variables ~500mb in size. Th e broadcast fails with an error shown below and the memory usage on the hosts also blow up. Our hardware consists of 8 hosts (1 x 64gb (driver) and 7 x 32gb (workers)) and we

Measuring execution time

2014-10-24 Thread shahab
Hi, I just wonder if there is any built-in function to get the execution time for each of the jobs/tasks ? in simple words, how can I find out how much time is spent on loading/mapping/filtering/reducing part of a job? I can see printout in the logs but since there is no clear presentation of the

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier
Hi, I’m running a job whose simple task it is to find files that cannot be read (sometimes our gz files are corrupted). With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an exception: scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)

Re: Memory requirement of using Spark

2014-10-24 Thread jian.t
Thanks Akhil. I searched DISK_AND_MEMORY_SER trying to figure out how it works, and I cannot find any documentation on that. Do you have a link for that? If what DISK_AND_MEMORY_SER does is reading and writing to the disk with some memory caching, does that mean the output will be written to

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar
Hi all I have written a job that reads data from HBASE and writes to HDFS (fairly simple). While running the job, I noticed that a few of the tasks failed with the following error. Quick googling on the error suggests that its an unexplained error and is perhaps intermittent. What I am curious to

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
Just curious... Why would you not store the processed results in regular relational database? Not sure what you meant by persist the appropriate RDDs. Did you mean output of your job will be RDDs? On 24 October 2014 13:35, ankits ankitso...@gmail.com wrote: I want to set up spark SQL to allow

Re: Problem packing spark-assembly jar

2014-10-24 Thread Yana Kadiyska
thanks -- that was it. I could swear this had worked for me before and indeed it's fixed this morning. On Fri, Oct 24, 2014 at 6:34 AM, Sean Owen so...@cloudera.com wrote: I imagine this is a side effect of the change that was just reverted, related to publishing the effective pom? sounds

spark-submit memory too larger

2014-10-24 Thread marylucy
i used standalone spark,set spark.driver.memory=5g,but spark-submit process use 57g memory, is this normal?how to decrease it?

Re: How can I set the IP a worker use?

2014-10-24 Thread Theodore Si
I found this. So it seems that we should use -h or --host instead of -i and --ip. -i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on 在 10/24/2014 3:35 PM, Akhil Das 写道: Try using the --ip parameter

Re: Measuring execution time

2014-10-24 Thread Reza Zadeh
The Spark UI has timing information. When running locally, it is at http://localhost:4040 Otherwise the url to the UI is printed out onto the console when you startup spark shell or run a job. Reza On Fri, Oct 24, 2014 at 5:51 AM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if

Job cancelled because SparkContext was shut down - failures!

2014-10-24 Thread Sadhan Sood
Hi, Trying to run a query on spark-sql but it keeps failing with this error on the cli ( we are running spark-sql on a yarn cluster): org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at

Re: spark-submit memory too larger

2014-10-24 Thread Sameer Farooqui
That does seem a bit odd. How many Executors are running under this Driver? Does the spark-submit process start out using ~60GB of memory right away or does it start out smaller and slowly build up to that high? If so, how long does it take to get that high? Also, which version of Spark are you

[Spark SQL] Setting variables

2014-10-24 Thread Yana Kadiyska
Hi all, I'm trying to set a pool for a JDBC session. I'm connecting to the thrift server via JDBC client. My installation appears to be good(queries run fine), I can see the pools in the UI, but any attempt to set a variable (I tried spark.sql.shuffle.partitions and

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It does have support for caching using either CACHE TABLE tablename or CACHE TABLE tablename AS SELECT On Fri, Oct 24, 2014 at 1:05 AM, ankits ankitso...@gmail.com wrote: I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is

Re: [Spark SQL] Setting variables

2014-10-24 Thread Michael Armbrust
You might be hitting: https://issues.apache.org/jira/browse/SPARK-4037 On Fri, Oct 24, 2014 at 11:32 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I'm trying to set a pool for a JDBC session. I'm connecting to the thrift server via JDBC client. My installation appears to be

Re: PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes
Maybe I'll add one more question. I think that the problem is with user, so I would like to ask under which user are run Spark jobs on slaves? __ Hi, I am trying to implement function for text preprocessing in PySpark. I have amazon

Under which user is the program run on slaves?

2014-10-24 Thread jan.zikes
Hi, I would like to ask under which user is run the Spark program on slaves? My Spark is running on top of the Yarn. The reason I am asking for this is that I need to download data for NLTK library and these data are dowloaded for specific python user and I am currently struggling with this. 

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
Is there a way to cache certain (or most latest) partitions of certain tables ? On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust mich...@databricks.com wrote: It does have support for caching using either CACHE TABLE tablename or CACHE TABLE tablename AS SELECT On Fri, Oct 24, 2014 at

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It won't be transparent, but you can do so something like: CACHE TABLE newData AS SELECT * FROM allData WHERE date ... and then query newData. On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Is there a way to cache certain (or most latest) partitions of certain

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Sadhan Sood
That works perfect. Thanks again Michael On Fri, Oct 24, 2014 at 3:10 PM, Michael Armbrust mich...@databricks.com wrote: It won't be transparent, but you can do so something like: CACHE TABLE newData AS SELECT * FROM allData WHERE date ... and then query newData. On Fri, Oct 24, 2014 at

Function returning multiple Values - problem with using if-else

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, My map function will call the following function (inc) which should yield multiple values: def inc(x:Int, y:Int) ={ if(condition) { for(i - 0 to 7) yield(x, y+i) } else { for(k - 0 to 24-y) yield(x, y+k) for(j- 0 to y-16) yield(x+1,j) } } The if part works

Re: Function returning multiple Values - problem with using if-else

2014-10-24 Thread Sean Owen
This is just a Scala question really. Use ++ def inc(x:Int, y:Int) = { if (condition) { for(i - 0 to 7) yield(x, y+i) } else { (for(k - 0 to 24-y) yield(x, y+k)) ++ (for(j- 0 to y-16) yield(x+1,j)) } } On Fri, Oct 24, 2014 at 8:52 PM, HARIPRIYA AYYALASOMAYAJULA

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread Joseph Bradley
Hi Lokesh, Glad the update fixed the bug. maxBins is a parameter you can tune based on your data. Essentially, larger maxBins is potentially more accurate, but will run more slowly and use more memory. maxBins must be = training set size; I would say try some small values (4, 8, 16). If there

Re: Function returning multiple Values - problem with using if-else

2014-10-24 Thread HARIPRIYA AYYALASOMAYAJULA
Thanks Sean! On Fri, Oct 24, 2014 at 3:04 PM, Sean Owen so...@cloudera.com wrote: This is just a Scala question really. Use ++ def inc(x:Int, y:Int) = { if (condition) { for(i - 0 to 7) yield(x, y+i) } else { (for(k - 0 to 24-y) yield(x, y+k)) ++ (for(j- 0 to y-16)

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread xuhongnever
Thank you very much. Changing to groupByKey works, it runs much more faster. By the way, could you give me some explanation of the following configurations, after reading the official explanation, i'm still confused, what's the relationship between them? is there any memory overlap between them?

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Ankur Dave
At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote: error: value partitionBy is not a member of org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID, org.apache.spark.graphx.Edge[ED])] Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit

Re: spark is running extremely slow with larger data set, like 2G

2014-10-24 Thread Davies Liu
On Fri, Oct 24, 2014 at 1:37 PM, xuhongnever xuhongne...@gmail.com wrote: Thank you very much. Changing to groupByKey works, it runs much more faster. By the way, could you give me some explanation of the following configurations, after reading the official explanation, i'm still confused,

Fwd: Saving very large data sets as Parquet on S3

2014-10-24 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8) data.registerAsTable(data) data.saveAsParquetFile(s3n://target/path) This

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Just wondering, any update on this? Is there a plan to integrate CJ's work with mllib? I'm asking since SVM impl in MLLib did not give us good results and we have to resort to training our svm classifier in a serial manner on the driver node with liblinear. Also, it looks like CJ Lin is coming

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On

Re: Spark using non-HDFS data on a distributed file system cluster

2014-10-24 Thread matan
Thanks Marcelo, Let me spin this towards a parallel trajectory then, as the title change implies. I think I will further read some of the articles at https://spark.apache.org/research.html but basically, I understand Spark keeps the data in-memory, and only pulls from hdfs, or at most writes the

Re: Spark LIBLINEAR

2014-10-24 Thread k.tham
Oh, I've only seen SVMWithSGD, hadn't realized LBFGS was implemented. I'll try it out when I have time. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LIBLINEAR-tp5546p17240.html Sent from the Apache Spark User List mailing list archive at

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Marek Wiewiorka
Hi, here you can find some info regarding 1.0: https://github.com/amplab/docker-scripts Marek 2014-10-24 23:38 GMT+02:00 Josh J joshjd...@gmail.com: Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

Re: docker spark 1.1.0 cluster

2014-10-24 Thread Nicholas Chammas
Oh snap--first I've heard of this repo. Marek, We are having a discussion related to this on SPARK-3821 https://issues.apache.org/jira/browse/SPARK-3821 you may be interested in. Nick On Fri, Oct 24, 2014 at 5:50 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Hi, here you can find

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
We don't have SVMWithLBFGS, but you can check out how we implement LogisticRegressionWithLBFGS, and we also deal with some condition number improving stuff in LogisticRegressionWithLBFGS which improves the performance dramatically. Sincerely, DB Tsai

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of the inverse of B matrix ? But then B matrix keeps on changing...did u condition it after every few iterations ? Will it be possible to put that code in Breeze since it will be very useful to condition other solvers as

Re: Saving very large data sets as Parquet on S3

2014-10-24 Thread Haoyuan Li
Daniel, Currently, having Tachyon will at least help on the input part in this case. Haoyuan On Fri, Oct 24, 2014 at 2:01 PM, Daniel Mahler dmah...@gmail.com wrote: I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
oh, we just train the model in the standardized space which will help the convergence of LBFGS. Then we convert the weights to original space so the whole thing is transparent to users. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread Michael Armbrust
Usually when the SparkContext throws an NPE it means that it has been shut down due to some earlier failure. On Wed, Oct 22, 2014 at 5:29 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I got java.lang.NullPointerException. Please help! sqlContext.sql(select l_orderkey,

Re: Spark LIBLINEAR

2014-10-24 Thread DB Tsai
yeah, column normalizarion. for some of the datasets, without doing this, it will not be converged. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Oct 24, 2014 at 3:46 PM, Debasish

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, My Steps: ### HIVE CREATE TABLE CUSTOMER ( C_CUSTKEYBIGINT, C_NAME VARCHAR(25), C_ADDRESSVARCHAR(40), C_NATIONKEY BIGINT, C_PHONE VARCHAR(15), C_ACCTBALDECIMAL, C_MKTSEGMENT VARCHAR(10), C_COMMENTVARCHAR(117) ) row format serde 'com.bizo.hive.serde.csv.CSVSerde';

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, Added “l_linestatus” it works, THANK YOU!! sqlContext.sql(select l_linestatus, l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS limit 10).collect().foreach(println); 14/10/25 07:03:24 INFO DAGScheduler: Stage 12

Re: Workaround for SPARK-1931 not compiling

2014-10-24 Thread Arpit Kumar
Thanks a lot. Now it is working properly. On Sat, Oct 25, 2014 at 2:13 AM, Ankur Dave ankurd...@gmail.com wrote: At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote: error: value partitionBy is not a member of org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID,

How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-24 Thread Arpit Kumar
Hi all, I am using the GrpahLoader class to load graphs from edge list files. But then I need to change the storage level of the graph to some other thing than MEMORY_ONLY. val graph = GraphLoader.edgeListFile(sc, fname, minEdgePartitions =

Spark Worker node accessing Hive metastore

2014-10-24 Thread ken
Does a Spark worker node need access to Hive's metastore if part of a job contains Hive queries? Thanks, Ken -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-node-accessing-Hive-metastore-tp17255.html Sent from the Apache Spark User List