Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Sean, your suggestions and the links provided are just what I needed to start off with. On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote: I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain

Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Cheng Lian
Currently there’s no convenient way to convert a |SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case class. But you can convert an |RDD|/|JavaRDD| into an |RDD[Row]|/|JavaRDDRow| using |schemaRdd.rdd| and |new JavaRDDRow(schemaRdd.rdd)|. Cheng On 3/15/15 10:22 PM, Renato

Re: Spark Streaming on Yarn Input from Flume

2015-03-15 Thread tarek_abouzeid
have you fixed this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755p22055.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1.3 release

2015-03-15 Thread Sean Owen
I think (I hope) it's because the generic builds just work. Even though these are of course distributed mostly verbatim in CDH5, with tweaks to be compatible with other stuff at the edges, the stock builds should be fine too. Same for HDP as I understand. The CDH4 build may work on some builds of

[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo
Hi Spark experts, Is there a way to convert a JavaSchemaRDD (for instance loaded from a parquet file) back to a JavaRDD of a given case class? I read on stackOverFlow[1] that I could do a select over the parquet file and then by reflection get the fields out, but I guess that would be an

Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell
Thank you for your help. toDF() solved my first problem. And, the second issue was a non-issue, since the second example worked without any modification. David On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote: programmatically specifying Schema needs import

Saving Dstream into a single file

2015-03-15 Thread tarek_abouzeid
i am doing word count example on flume stream and trying to save output as text files in HDFS , but in the save directory i got multiple sub directories each having files with small size , i wonder if there is a way to append in a large file instead of saving in multiple files , as i intend to

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Sean Owen
I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
Ah most interesting—thanks. So it seems sc.textFile(longFileList) has to read all metadata before starting the read for partitioning purposes so what you do is not use it? You create a task per file that reads one file (in parallel) per task without scanning for _all_ metadata. Can’t argue

Re: deploying Spark on standalone cluster

2015-03-15 Thread tarek_abouzeid
i was having a similar issue but it was in spark and flume integration i was getting failed to bind error , but got it fixed by shutting down firewall for both machines (make sure : service iptables status = firewall stopped) -- View this message in context:

1.3 release

2015-03-15 Thread Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James
I have got NullPointerException in aggregateMessages on a graph which is the output of mapVertices function of a graph. I found the problem is because of the mapVertices funciton did not affect all the triplet of the graph. // Initial the graph, assign a counter to each vertex that contains the

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
In LBFGS version of logistic regression, the data is properly standardized, so this should not happen. Can you provide a copy of your dataset to us so we can test it? If the dataset can not be public, can you have just send me a copy so I can dig into this? I'm the author of LORWithLBFGS. Thanks.

Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
Yes I don't think this is entirely reliable in general. I would emit (label,features) pairs and then transform the values. In practice, this may happen to work fine in simple cases. On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote: Hi, I was taking a look through the

Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is

Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread Cheng Lian
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: 1. DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and commands (e.g. |SET key = value|, |ADD FILE|, |ADD JAR|, etc.) In most cases, Spark SQL

Benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs Spark SQL

2015-03-15 Thread Slim Baltagi
Hi I would like to share with you my comments on Hortonworks' benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs 'Spark SQL'. Please check them in my related blog entry at http://goo.gl/K5mk0U Thanks Slim Baltagi Chicago, IL http://www.SparkBigData.com -- View this message in context:

Re: Problem connecting to HBase

2015-03-15 Thread HARIPRIYA AYYALASOMAYAJULA
Hello all, Thank you for your responses. I did try to include the zookeeper.znode.parent property in the hbase-site.xml. It still continues to give the same error. I am using Spark 1.2.0 and hbase 0.98.9. Could you please suggest what else could be done? On Fri, Mar 13, 2015 at 10:25 PM, Ted

Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian
That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change. Cheng On 2/28/15 8:13 AM, Deborah Siegel wrote: Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-15 Thread abhi
Thanks, It worked. -Abhi On Tue, Mar 3, 2015 at 5:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote: Do you have enough resource in your cluster? You can check your resource manager to see the usage. Yep, I can

Re: Writing wide parquet file in Spark SQL

2015-03-15 Thread Cheng Lian
This article by Ryan Blue should be helpful to understand the problem http://ingest.tips/2015/01/31/parquet-row-group-size/ The TL;DR is, you may decrease |parquet.block.size| to reduce memory consumption. Anyway, 100K columns is a really big burden for Parquet, but I guess your data should

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Nick, for your suggestions. On Sun, Mar 15, 2015 at 10:41 PM, Nick Pentreath nick.pentre...@gmail.com wrote: As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still,

Re: Streaming linear regression example question

2015-03-15 Thread Margus Roo
Hi again Tried the same examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala from 1.3.0 and getting in case testing file content is: (0.0,[3.0,4.0,3.0]) (0.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[5.0,6.0,6.0]) (6.0,[7.0,4.0,7.0])

Submitting spark application using Yarn Rest API

2015-03-15 Thread Srini Karri
Hi All, I am trying to submit the spark application using yarn rest API. I am able to submit the application but final status shows as 'UNDEFINED.'. Couple of other observations: User shows as Dr.who Application type is empty though I specify it as Spark Is any one had this problem before? I

Re: Read Parquet file from scala directly

2015-03-15 Thread Cheng Lian
The parquet-tools code should be pretty helpful (although it's Java) https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command On 3/10/15 12:25 AM, Shuai Zheng wrote: Hi All, I have a lot of parquet files, and I try to open them directly

Re: From Spark web ui, how to prove the parquet column pruning working

2015-03-15 Thread Cheng Lian
Hey Yong, It seems that Hadoop `FileSystem` adds the size of a block to the metrics even if you only touch a fraction of it (reading Parquet metadata for example). This behavior can be verified by the following snippet: ```scala import org.apache.spark.sql.Row import

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Nick Pentreath
As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still, since you plan to serve these via a REST service anyway, computing on demand via a serving layer such as Oryx or

Re: Is there any problem in having a long opened connection to spark sql thrift server

2015-03-15 Thread Cheng Lian
It should be OK. If you encountered problems in having a long opened connection to the Thrift server, it should be a bug. Cheng On 3/9/15 6:41 PM, fanooos wrote: I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift

Re: Problem connecting to HBase

2015-03-15 Thread Ted Yu
org.apache.hbase % hbase % 0.98.9-hadoop2 % provided, There is no module in hbase 0.98.9 called hbase. But this would not be the root cause of the error. Most likely hbase-site.xml was not picked up. Meaning this is classpath issue. On Sun, Mar 15, 2015 at 10:04 AM, HARIPRIYA AYYALASOMAYAJULA

Re: Spark 1.2 – How to change Default (Random) port ….

2015-03-15 Thread Shailesh Birari
Hi SM, Apologize for delayed response. No, the issue is with Spark 1.2.0. There is a bug in Spark 1.2.0. Recently Spark have latest 1.3.0 release so it might have fixed in it. I am not planning to test it soon, may be after some time. You can try for it. Regards, Shailesh -- View this

Re: Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread bit1...@163.com
Thanks Cheng for the great explanation! bit1...@163.com From: Cheng Lian Date: 2015-03-16 00:53 To: bit1...@163.com; Wang, Daoyuan; user Subject: Re: Explanation on the Hive in the Spark assembly Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements

Re: RE: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks, Jerry I got that way. Just to make sure whether there can be some option to directly specifying tachyon version. fightf...@163.com From: Shao, Saisai Date: 2015-03-16 11:10 To: fightf...@163.com CC: user Subject: RE: Building spark over specified tachyon I think you could change the

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav
ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote: Hi, I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints loaded using loadLibSVMFile: val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Sparkers, I couldn't able to run spark-sql on spark.Please find the following error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Regards, Sandeep.v

Re: Slides of my talk in LA: 'Spark or Hadoop: is it an either-or proposition?'

2015-03-15 Thread Slim Baltagi
Hi The video recording of this talk titled Spark or Hadoop: is it an either-or proposition? at the Los Angeles Spark Users Group on March 12, 2015 is now available on youtube at this link: http://goo.gl/0iJZ4n Thanks Slim Baltagi http://www.SparkBigData.com -- View this message in context:

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread Ted Yu
Can you provide more information ? Such as: Version of Spark you're using Command line Thanks On Mar 15, 2015, at 9:51 PM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I couldn't able to run spark-sql on spark.Please find the following error Unable to instantiate

RE: Building spark over specified tachyon

2015-03-15 Thread Shao, Saisai
I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I'm not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From:

k-means hang without error/warning

2015-03-15 Thread Xi Shen
Hi, I am running k-means using Spark in local mode. My data set is about 30k records, and I set the k = 1000. The algorithm starts and finished 13 jobs according to the UI monitor, then it stopped working. The last log I saw was: [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner -

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rohit U
I checked the labels across the entire dataset and it looks like it has -1 and 1 (not the 0 and 1 I originally expected). I will try replacing the -1 with 0 and run it again. On Mon, Mar 16, 2015 at 12:51 AM, Rishi Yadav ri...@infoobjects.com wrote: ca you share some sample data On Sun, Mar

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted, I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration files attached below. ERROR IN SPARK n: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at

Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun.

Re: Trouble launching application that reads files

2015-03-15 Thread robert.tunney
I figured out how to use local files with file:// but not with either the persistent or ephemeral-hdfs -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-launching-application-that-reads-files-tp22065p22068.html Sent from the Apache Spark User List

Re: Re: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks haoyuan. fightf...@163.com From: Haoyuan Li Date: 2015-03-16 12:59 To: fightf...@163.com CC: Shao, Saisai; user Subject: Re: RE: Building spark over specified tachyon Here is a patch: https://github.com/apache/spark/pull/4867 On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com

Re: Streaming linear regression example question

2015-03-15 Thread Jeremy Freeman
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can hopefully include in 1.3.1. In the meantime, you can get the desired result using transform: model.trainOn(trainingData)

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted, Did you find any solution. Thanks Sandeep On Mon, Mar 16, 2015 at 10:44 AM, sandeep vura sandeepv...@gmail.com wrote: Hi Ted, I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration files attached below. ERROR IN SPARK