Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union(

Re: Fuzzy GroupBy

2015-03-26 Thread Sean Owen
The grouping is determined by the POJO's equals() method. You can also call groupBy() to group by some function of the POJOs. For example if you're grouping Doubles into nearly-equal bunches, you could group by their .intValue() On Thu, Mar 26, 2015 at 8:47 PM, Mihran Shahinian

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen

Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link on spark History server doesn't open and shows following message : HTTP ERROR: 500 Problem accessing /history/application_1425934191900_87572. Reason: Server Error -- *Powered by

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Xi Shen
It it bought in by another dependency, so you do not need to specify it explicitly...I think this is what Ted mean. On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia mchett...@rocketfuelinc.com wrote: +spark-dev Yes, the dependencies are there. I guess my question is how come the build is

Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread Marcelo Vanzin
bcc: user@, cc: cdh-user@ I recommend using CDH's mailing list whenever you have a problem with CDH. That being said, you haven't provided enough info to debug the problem. Since you're using CM, you can easily go look at the History Server's logs and see what the underlying error is. On Thu,

RE: Date and decimal datatype not working

2015-03-26 Thread BASAK, ANANDA
Thanks all. I am installing Spark 1.3 now. Thought that I should better sync with the daily evolution of this new technology. So once I install that, I will try to use the Spark-CSV library. Regards Ananda From: Dean Wampler [mailto:deanwamp...@gmail.com] Sent: Wednesday, March 25, 2015 1:17 PM

Re: Can't access file in spark, but can in hadoop

2015-03-26 Thread Ted Yu
Looks like the following assertion failed: Preconditions.checkState(storageIDsCount == locs.size()); locs is ListDatanodeInfoProto Can you enhance the assertion to log more information ? Cheers On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote: There seems to be a

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
How do I get the number of cores that I specified at the command line? I want to use spark.default.parallelism. I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the spark.default.parallelism value will be 4 * 8 = 32...I

Re: foreachRDD execution

2015-03-26 Thread Tathagata Das
Yes, that is the correct understanding. There are undocumented parameters that allow that, but I do not recommend using those :) TD On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: I have a simple and probably dumb question about foreachRDD. We are

Re: K Means cluster with spark

2015-03-26 Thread Xi Shen
Hi Sandeep, I followed the DenseKMeans example which comes with the spark package. My total vectors are about 40k, and my k=500. All my code are written in Scala. Thanks, David On Fri, 27 Mar 2015 05:51 sandeep vura sandeepv...@gmail.com wrote: Hi Shen, I am also working on k means

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0, working in windows 64 bit. My data set is about 40k vectors, each vector has about 300 features, all normalised. All work node have sufficient memory and disk space.

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
+spark-dev Yes, the dependencies are there. I guess my question is how come the build is succeeding in the mainline then, without adding these dependencies? On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at output from dependency:tree, servlet-api is brought in by

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
OH, the job I talked about has ran more than 11 hrs without a result...it doesn't make sense. On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote: Hi Burak, My iterations is set to 500. But I think it should also stop of the centroid coverages, right? My spark is 1.2.0,

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Zhan Zhang
Thanks all for the quick response. Thanks. Zhan Zhang On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote: I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved:

Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
Hi, We are trying to build spark 1.2 from source (tip of the branch-1.2 at the moment). I tried to build spark using the following command: mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package I encountered various missing class definition

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Ted Yu
Looking at output from dependency:tree, servlet-api is brought in by the following: [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile [INFO] | +- org.antlr:antlr:jar:3.2:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1:compile [INFO] | +-

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
will do! I've got to clear with my boss what I can post and in what manner, but I'll definitely do what I can to put some working code out into the world so the next person who runs into this brick wall can benefit from all this :-D DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758

Recreating the Mesos/Spark paper's experiments

2015-03-26 Thread hbogert
Hi all, For my master thesis I will be characterising performance of two-level schedulers like Mesos and after reading the paper: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf where Spark is also introduced I am wondering how some experiments and results came about. If this is not the

Re: Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
The code is very simple. val data = sc.textFile(very/large/text/file) map { l = // turn each line into dense vector Vectors.dense(...) } // the resulting data set is about 40k vectors KMeans.train(data, k=5000, maxIterations=500) I just kill my application. In the log I found this:

Difference behaviour of DateType in SparkSQL between 1.2 and 1.3

2015-03-26 Thread Wush Wu
Dear all, I am trying to upgrade the spark from 1.2 to 1.3 and switch the existed API of creating SchemaRDD to DataFrame. After testing, I notice that the following behavior is changed: ``` import java.sql.Date import com.bridgewell.SparkTestUtils import org.apache.spark.rdd.RDD import

Re: shuffle write size

2015-03-26 Thread Chen Song
Anyone can shed some light on this? On Tue, Mar 17, 2015 at 5:23 PM, Chen Song chen.song...@gmail.com wrote: I have a map reduce job that reads from three logs and joins them on some key column. The underlying data is protobuf messages in sequence files. Between mappers and reducers, the

Re: What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-03-26 Thread Sandy Ryza
Creating a SparkContext and setting master as yarn-cluster unfortunately will not work. SPARK-4924 added APIs for doing this in Spark, but won't be included until 1.4. -Sandy On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Create SparkContext set master as

FetchFailedException during shuffle

2015-03-26 Thread Chen Song
Using spark 1.3.0 on cdh5.1.0, I was running a fetch failed exception. I searched in this email list but not found anything like this reported. What could be the reason for the error? org.apache.spark.shuffle.FetchFailedException: [EMPTY_INPUT] Cannot decompress empty stream at

Re: Missing an output location for shuffle. : (

2015-03-26 Thread 李铖
Here is the work track:

Re: WordCount example

2015-03-26 Thread Saisai Shao
Hi, Did you run the word count example in Spark local mode or other mode, in local mode you have to set Local[n], where n =2. For other mode, make sure available cores larger than 1. Because the receiver inside Spark Streaming wraps as a long-running task, which will at least occupy one core.

Re: HQL function Rollup and Cube

2015-03-26 Thread ๏̯͡๏
Did you manage to connect to Hive metastore from Spark SQL. I copied hive conf file into Spark conf folder but when i run show tables, or do select * from dw_bid (dw_bid is stored in Hive) it says table not found. On Thu, Mar 26, 2015 at 11:43 PM, Chang Lim chang...@gmail.com wrote: Solved.

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
I am now seeing this error. 15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw exception: FAILED: SemanticException Line 1:23 Invalid path ''examples/src/main/resources/kv1.txt'': No files matching path

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Try to give the complete path to the file kv1.txt. On 26 Mar 2015 11:48, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am now seeing this error. 15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw exception: FAILED: SemanticException Line 1:23 Invalid path

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Nick Pentreath
From a quick look at this link - http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it seems you need to call some static methods on AccumuloInputFormat in order to set the auth, table, and range settings. Try setting these config options first and then call newAPIHadoopRDD? On

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Akhil Das
Whats your spark version? Not quiet sure, but you could be hitting this issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516 On 26 Mar 2015 11:01, Xi Shen davidshe...@gmail.com wrote: Hi, My environment is Windows 64bit, Spark + YARN. I had a job that takes a long

Re: Serialization Problem in Spark Program

2015-03-26 Thread Akhil Das
Try registering your MyObject[] with Kryo. On 25 Mar 2015 13:17, donhoff_h 165612...@qq.com wrote: Hi, experts I wrote a very simple spark program to test the KryoSerialization function. The codes are as following: object TestKryoSerialization { def main(args: Array[String]) { val

Re: writing DStream RDDs to the same file

2015-03-26 Thread Akhil Das
Heres something similar which i used to do: unionDStream.foreachRDD(rdd = { val events = rdd.count() println(Received Events : + rdd.count()) if(events 0 ){ val fw = new FileWriter(events, true) fw.write(Calendar.getInstance().getTime + , + events + \n) fw.close() } }) Sending from cellphone,

Re: Can LBFGS be used on streaming data?

2015-03-26 Thread EcoMotto Inc.
Hello DB, Thank you! Do you know how to run Linear Regression without SGD on streaming data in spark? I tried SGD but due to step size I was not getting the expected weights. Best Regards, Arunkumar On Wed, Mar 25, 2015 at 4:33 PM, DB Tsai dbt...@dbtsai.com wrote: Hi Arunkumar, I think

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Xi Shen
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8 cores...the magic number in the bug. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class org.apache.spark.sql.GroupedData: sum/max/min/mean/count. Can I plugin a function to calculate stddev? Thank you! - To unsubscribe, e-mail:

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread Takeshi Yamamuro
I think it is not `sqlContext` but hiveContext because `create temporary function` is not supported in SQLContext. On Wed, Mar 25, 2015 at 5:58 AM, Jon Chase jon.ch...@gmail.com wrote: Shahab - This should do the trick until Hao's changes are out: sqlContext.sql(create temporary function

Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have a hive table named dw_bid, when i run hive from command prompt and run describe dw_bid, it works. I want to join a avro file (table) in HDFS with this hive dw_bid table and i refer it as dw_bid from Spark SQL program, however i see 15/03/26 00:31:01 INFO HiveMetaStore.audit:

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
Does not work 15/03/26 01:07:05 INFO HiveMetaStore.audit: ugi=dvasthimal ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark 15/03/26 01:07:06 ERROR ql.Driver: FAILED: SemanticException Line 1:23 Invalid path

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-26 Thread ๏̯͡๏
Resolved. Bold text is FIX. ./bin/spark-submit -v --master yarn-cluster --jars

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Now its clear that the workers are not having the file kv1.txt in their local filesystem. You can try putting that in hdfs and use the URI to that file or try adding the file with sc.addFile Thanks Best Regards On Thu, Mar 26, 2015 at 1:38 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Does not

Recreating the Mesos/Spark paper's experiments

2015-03-26 Thread Hans van den Bogert
Hi all, For my master thesis I will be characterising performance of two-level schedulers like Mesos and after reading the paper: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf where Spark is also introduced I am wondering how some experiments and results came about. If this is not the

Re: Hive Table not from from Spark SQL

2015-03-26 Thread Michael Armbrust
What does show tables return? You can also run SET optionName to make sure that entries from you hive site are being read correctly. On Thu, Mar 26, 2015 at 4:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have tables dw_bid that is created in Hive and has nothing to do with Spark. I have

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-26 Thread NathanMarin
Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This

Parallel actions from driver

2015-03-26 Thread Aram Mkrtchyan
Hi. I'm trying to trigger DataFrame's save method in parallel from my driver. For that purposes I use ExecutorService and Futures, here's my code: val futures = [1,2,3].map( t = pool.submit( new Runnable { override def run(): Unit = { val commons = events.filter(_._1 == t).map(_._2.common)

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
As I wrote previously - indexing is not your only choice, you can preaggregate data during load or depending on your needs you need to think about other data structures, such as graphs, hyperloglog, bloom filters etc. (challenge to integrate in standard bi tools) Le 26 mars 2015 13:34, kundan

RDD Exception Handling

2015-03-26 Thread Kevin Conaway
How can we catch exceptions that are thrown from custom RDDs or custom map functions? We have a custom RDD that is throwing an exception that we would like to catch but the exception that is thrown back to the caller is a *org.apache.spark.SparkException* that does not contain any useful

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Sean Owen
An RDD is a very different creature than a NoSQL store, so I would not think of them as in the same ball-park for NoSQL-like workloads. It's not built for point queries or range scans, since any request would launch a distributed job to scan all partitions. It's not something built for, say,

Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-26 Thread Ravi Mody
After upgrading to 1.3.0, ALS.trainImplicit() has been returning vastly smaller factors (and hence scores). For example, the first few product's factor values in 1.2.0 are (0.04821, -0.00674, -0.0325). In 1.3.0, the first few factor values are (2.535456E-8, 1.690301E-8, 6.99245E-8). This

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
Hello Michael, Thanks for your time. 1. show tables from Spark program returns nothing. 2. What entities are you talking about ? (I am actually new to Hive as well) On Thu, Mar 26, 2015 at 8:35 PM, Michael Armbrust mich...@databricks.com wrote: What does show tables return? You can also run

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
hi Nick Unfortunately the Accumulo docs are woefully inadequate, and in some places, flat wrong. I'm not sure if this is a case where the docs are 'flat wrong', or if there's some wrinke with spark-notebook in the mix that's messing everything up. I've been working with some people on stack

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-26 Thread Nathan Marin
Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This

Re: Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Ted Yu
It is logged from RecurringTimer#loop(): private def loop() { try { while (!stopped) { clock.waitTillTime(nextTime) callback(nextTime) prevTime = nextTime nextTime += period logDebug(Callback for + name + called at time + prevTime) }

Re: Missing an output location for shuffle. : (

2015-03-26 Thread Michael Armbrust
I would suggest looking for errors in the logs of your executors. On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote: Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track.

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
Stack Trace: 15/03/26 08:25:42 INFO ql.Driver: OK 15/03/26 08:25:42 INFO log.PerfLogger: PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver 15/03/26 08:25:42 INFO log.PerfLogger: /PERFLOG method=releaseLocks start=1427383542966 end=1427383542966 duration=0

<    1   2