Re: Best to execute SQL in Streaming data

2014-12-23 Thread Akhil Das
Here's the java api docs https://spark.apache.org/docs/latest/api/java/index.html You can start with this example and convert it into Java (its pretty straight forward) http://stackoverflow.com/questions/25484879/sql-over-spark-streaming Eg: In Scala : val sparkConf = new SparkConf().setMaster

Best to execute SQL in Streaming data

2014-12-23 Thread bigdata4u
Hi, I have a existing batch processing system which use SQL queries to extract information from data. I want to replace this with Real time system. I am coding in Java and to use SQL in Streaming data i found few examples but none of them is complete. http://apache-spark-user-list.1001560.n3.nabb

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust wrote: > You might also try out the recently added support for views. > > On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang > wrote: > >> Ah... I see. Thanks for pointing it

Re: "Fetch Failure"

2014-12-23 Thread Stefano Ghezzi
i've eliminated fetch failed with this parameters (don't know which was the right one for the problem) to the spark-submit running with 1.2.0 --conf spark.shuffle.compress=false \ --conf spark.file.transferTo=false \ --conf spark.shuffle.manager=hash \ --conf spar

SchemaRDD to RDD[String]

2014-12-23 Thread Hafiz Mujadid
Hi dears! I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code def SchemaRDDToRDD( schemaRDD : SchemaRDD ) : RDD[ String ] = { var types

Not Serializable exception when integrating SQL and Spark Streaming

2014-12-23 Thread bigdata4u
I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName("NumberCount"); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc =

Fwd: Mesos resource allocation

2014-12-23 Thread Tim Chen
Hi Josh, If you want to cap the amount of memory per executor in Coarse grain mode, then yes you only get 240GB of memory as you mentioned. What's the reason you don't want to raise the capacity of memory you use per executor? In coarse grain mode the Spark executor is long living and it internal

Re: How to build Spark against the latest

2014-12-23 Thread Ted Yu
See http://search-hadoop.com/m/JW1q5Cew0j On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 wrote: > Hi, > The official pom.xml file only have profile for hadoop version 2.4 as the > latest version, but I installed hadoop version 2.6.0 with ambari, how can I > build spark against it, just using mvn

How to build Spark against the latest

2014-12-23 Thread guxiaobo1982
Hi, The official pom.xml file only have profile for hadoop version 2.4 as the latest version, but I installed hadoop version 2.6.0 with ambari, how can I build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a coresponding profile for it? Regards, Xiaobo

Re: How to run an action and get output?

2014-12-23 Thread Tobias Pfeiffer
Hi, On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab wrote: > > val doSomething(entry:SomeEntry, session:Session) : SomeOutput = { > val result = session.SomeOp(entry) > SomeOutput(entry.Key, result.SomeProp) > } > > I could use a transformation for rdd.map, but in case of failures, the map

Re:Re: Serialization issue when using HBase with Spark

2014-12-23 Thread yangliuyu
Thanks All. Finally the works code is below: object PlayRecord { def getUserActions(accounts: RDD[String], idType: Int, timeStart: Long, timeStop: Long, cacheSize: Int, filterSongDays: Int, filterPlaylistDays: Int): RDD[(String, (Int, Set[Long], Set[Long]))] = { acc

Re: Debugging a Spark application using Eclipse throws SecurityException

2014-12-23 Thread ey-chih chow
It's working now. Probably I didn't specify the excluded list correctly. I kept revising it and now it's working. Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-a-Spark-application-using-Eclipse-throws-SecurityException-t

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira issue for this: https://issues.apache.org/jira/browse/SPARK-4944 Hope come with a solution soon. Cheng Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, December 24, 2014 4:26 AM To: user@spark.apac

Debugging a Spark application using Eclipse throws SecurityException

2014-12-23 Thread ey-chih chow
I am using Eclipse to develop a Spark application (using Spark 1.1.0). I use the ScalaTest framework to test the application. But I was blocked by the following exception: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer informatio

Re: Spark SQL job block when use hive udf from_unixtime

2014-12-23 Thread Cheng Lian
Could you please provide a complete stacktrace? Also it would be good if you can share your hive-site.xml as well. On 12/23/14 4:42 PM, Dai, Kevin wrote: Hi, there When I use hive udf from_unixtime with the HiveContext, the job block and the log is as follow: sun.misc.Unsafe.park(Native Me

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee
You should be able to kill the job using the webUI or via spark-class. More info can be found in the thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html. HTH! On Tue, Dec 23, 2014 at 4:47 PM, durga wrote: > Hi All , > > It se

Re: Single worker locked at 100% CPU

2014-12-23 Thread Andrew Ash
Hi Phil, This sounds a lot like a deadlock in Hadoop's Configuration object that I ran into a while back. If you jstack the JVM and see a thread that looks like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546 "Executor task launch worker-6" daemon prio=10 tid=0x7f91f0

Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql("SELECT * FROM src WHERE key % 2 = 0"). sample(withReplacement =false, fraction =0.05). registerTempTable("sampled") println(table("sampled").queryExecution) val query = sql("

Re: S3 files , Spark job hungsup

2014-12-23 Thread durga
Hi All , It seems problem is little more complicated. If the job is hungup on reading s3 file.even if I kill the unix process that started the job, it is not killing spark-job. It is still hung up there. Now the questions are : How do I find spark-job based on the name? How do I kill the spark-

Exception after changing RDDs

2014-12-23 Thread kai.lu
Hi All, We are getting exception after we added one RDD to another RDD. We first declared an empty RDD "A", then received new Dstream "B" from Kafka; for each RDD in the Dstream "B", we kept adding them to the existing RDD "A". Error happened when we were trying to use the updated RDD "A". Could

weights not changed with different reg param

2014-12-23 Thread Thomas Kwan
Hi there We are on mllib 1.1.1, and trying different regularization parameters. We noticed that the regParam dont affect the weights at all. Is setting the reg param via the optimizer the right thing to do? Do we need to set our own updater? Anyone else seeing the same behaviour? thanks again tho

Escape commas in file names

2014-12-23 Thread Daniel Siegmann
I am trying to load a Parquet file which has a comma in its name. Yes, this is a valid file name in HDFS. However, sqlContext.parquetFile interprets this as a comma-separated list of parquet files. Is there any way to escape the comma so it is treated as part of a single file name? -- Daniel Sie

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-23 Thread Jon Chase
Turns out I was using the s3:// prefix (in a standalone Spark cluster). It was writing a LOT of block_* files to my S3 bucket, which was the cause for the slowness. I was coming from Amazon EMR, where Amazon's underlying FS implementation has re-mapped s3:// to s3n://, which doesn't use the block

Updating RDD used in DStream transform

2014-12-23 Thread Sean McKibben
I am wondering if there is a way to update an RDD that is used in a transform operation of a DStream. To use the example from the spark streaming programming guide, let’s say I want to update my spamInfoRDD once an hour without having to restart the streaming app. If an RDD used in Transform op

Re: S3 files , Spark job hungsup

2014-12-23 Thread Jon Chase
http://www.jets3t.org/toolkit/configuration.html Put the following properties in a file named jets3t.properties and make sure it is available during the running of your Spark job (just place it in ~/ and pass a reference to it when calling spark-submit with --file ~/jets3t.properties) httpclien

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example: rdd_table.saveAsParquetFile("/user/spark/my_

Re: JavaRDD (Data Aggregation) based on key

2014-12-23 Thread Jon Chase
Have a look at RDD.groupBy(...) and reduceByKey(...) On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh wrote: > Hi, > I have a csv file having fields as a,b,c . > I want to do aggregation(sum,average..) based on any field(a,b or c) as per > user input, > using Apache Spark Java API,Please Help Urgen

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems to work much better. Can't find the link ATM, but seems I recall that s3:// (Hadoop's original block format for s3) is no longer recommended for use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the scene

Re: "Fetch Failure"

2014-12-23 Thread Chen Song
I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) but I am still seeing FetchFailedException. On Mon, Dec 22, 2014 at 8:27 AM, steghe wrote: > Which version of spark are you running? > > It could be related to this > https://issues.apache.org/jira/browse/SPARK-3633 > > fixed in

serialize protobuf messages

2014-12-23 Thread Chen Song
Silly question, what is the best way to shuffle protobuf messages in Spark (Streaming) job? Can I use Kryo on top of protobuf Message type? -- Chen Song

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Sean Owen
Yes, my change is slightly downstream of this point in the processing though. The code is still creating a counter for each distinct score value, and then binning. I don't think that would cause a failure - just might be slow. At the extremes, you might see 'fetch failure' as a symptom of things ru

Re: MLLib beginner question

2014-12-23 Thread boci
Xiangrui: Hi, I want to using this with streaming and with job too. I using kafka (streaming) and elasticsearch (job) as source and want to calculate sentiment value from the input text. Simon: great, you have any doc how can I embed into my application without using the http interface? (how can I

Re: MLlib + Streaming

2014-12-23 Thread Fernando O.
Hey Xiangrui, Is there any plan to have a streaming compatible ALS version? Or if it's currently doable, is there any example? On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng wrote: > We have streaming linear regression (since v1.1) and k-means (v1.2) in > MLlib. You can check the user gu

Spark Port Configuration

2014-12-23 Thread Dan H.
Hi all, I'm trying to lock down ALL Spark ports and have tried using spark-defaults.conf and via the sparkContext. (The example below was run in local[*] mode, but all attempts to run in local or spark-submit.sh on cluster via jar all result in the same results). My goal is to define all commu

Re: removing first record from RDD[String]

2014-12-23 Thread Hafiz Mujadid
yep Michael Quinlan,it's working as suggested by Hoe Ren thansk to you and Hoe Ren -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20840.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 -> 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See

Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng
We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can check the user guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering -Xiangrui On Tue, D

Re: removing first record from RDD[String]

2014-12-23 Thread Michael Quinlan
Hafiz, You can probably use the RDD.mapPartitionsWithIndex method. Mike On Tue, Dec 23, 2014 at 8:35 AM, Hafiz Mujadid [via Apache Spark User List] wrote: > > hi dears! > > Is there some efficient way to drop first line of an RDD[String]? > > any suggestion? > > Thanks > > -

Re: Tuning Spark Streaming jobs

2014-12-23 Thread Timothy Chen
Hi Gerard, SPARK-4286 is the ticket I am working on, which besides supporting shuffle service it also supports the executor scaling callbacks (kill/request total) for coarse grain mode. I created SPARK-4940 to discuss more about the distribution problem, and let's bring our discussions there. Ti

MLlib + Streaming

2014-12-23 Thread Gianmarco De Francisci Morales
Hi, I have recently seen a demo of Spark where different pieces were put together (training via MLlib + deploying on Spark Streaming). I was wondering if MLlib currently works to directly train on Streaming. And, if so, what are the semantics of the algorithms? If not, would it be interesting to h

Single worker locked at 100% CPU

2014-12-23 Thread Phil Wills
I've been attempting to run a job based on MLlib's ALS implementation for a while now and have hit an issue I'm having a lot of difficulty getting to the bottom of. On a moderate size set of input data it works fine, but against larger (still well short of what I'd think of as big) sets of data, I

Re: removing first record from RDD[String]

2014-12-23 Thread Erik Erlandson
There is also a lazy implementation: http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/ I generated a PR for it -- there was also an alternate proposal for having it be a library in the new Spark Packages site: http://databricks.com/bl

Re: removing first record from RDD[String]

2014-12-23 Thread Hafiz Mujadid
that's nice if it works -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20837.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Spark UI port issue when deploying Spark driver on YARN in yarn-cluster mode on EMR

2014-12-23 Thread Tomer Benyamini
On YARN, spark does not manage the cluster, but YARN does. Usually the cluster manager UI is under http://:9026/cluster. I believe that it chooses the port for the spark driver UI randomly, but an easy way of accessing it is by clicking on the "Application Master" link under the "Tracking UI" colum

retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Thomas Kwan
Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of "fetch failure". The p

Re: removing first record from RDD[String]

2014-12-23 Thread Jörg Schad
Hi, maybe the drop function is helpful for you (even though this is probably more than you need, still interesting read) http://erikerlandson.github.io/blog/2014/07/27/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ Joerg On Tue, Dec 23, 2014 at 5:45 PM, Hao Ren wrote: > H

Re: removing first record from RDD[String]

2014-12-23 Thread Hao Ren
Hi, I guess you would like to remove the header of a CSV file. You can play with partitions. =) // src is your RDD val noHeader = src.mapPartitionsWithIndex( (i, iterator) => if (i == 0 && iterator.hasNext) { iterator.next iterator } else iterator) Thus, you don't need to

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered. The workaround is materializing t1 first: t1.cache t1.count These operations make

removing first record from RDD[String]

2014-12-23 Thread Hafiz Mujadid
hi dears! Is there some efficient way to drop first line of an RDD[String]? any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834.html Sent from the Apache Spark User List mailing list archive

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
update: t1 is good. After collecting on t1, I find that all row is ok (is_new = 0) Just after sampling, there are some rows where is_new = 1 which should have been filtered by Where clause. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-

Spark UI port issue when deploying Spark driver on YARN in yarn-cluster mode on EMR

2014-12-23 Thread Roberto Coluccio
Hello folks, I'm trying to deploy a Spark driver on Amazon EMR in yarn-cluster mode expecting to be able to access the Spark UI from the :4040 address (default port). The problem here is that the Spark UI port is always defined randomly at runtime, although I also tried to specify it in the spark-

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Thanks for the clarification Sean. Best Regards,Guru Medasani > From: so...@cloudera.com > Date: Tue, 23 Dec 2014 15:39:59 + > Subject: Re: Spark Installation Maven PermGen OutOfMemoryException > To: gdm...@outlook.com > CC: protsenk...@gmail.com; user@spark.apache.org > > The text there

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Sean Owen
The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the "-XX:MaxPermSize=512M" actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani wrote: > Hi Vladimir, > > From the link Se

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Hi Vladimir, >From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards,Guru Medasani > From: so...@cloudera.com > Date: Tue, 23 Dec 2014 15:04:42 +000

Re: Spark SQL with a sorted file

2014-12-23 Thread Cheng Lian
This Parquet bug only triggers when there exists some row groups which are either empty or contain only null binary values. So it’s still safe to turn it on if data types of all columns are boolean, numeric, and non-null binaries. You may turn it on by |SET spark.sql.parquet.filterPushdown=tr

Re: How to export data from hive into hdfs in spark program?

2014-12-23 Thread Cheng Lian
This depends on which output format you want. For Parquet, you can simply do this: |hiveContext.table("some_db.some_table").saveAsParquetFile("hdfs://path/to/file") | On 12/23/14 5:22 PM, LinQili wrote: Hi Leo: Thanks for your reply. I am talking about using hive from spark to export data fro

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Somnath Pandeya
I think you should use minimum of 2gb of memory for building it from maven . -Somnath -Original Message- From: Vladimir Protsenko [mailto:protsenk...@gmail.com] Sent: Tuesday, December 23, 2014 8:28 PM To: user@spark.apache.org Subject: Spark Installation Maven PermGen OutOfMemoryExceptio

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Sean Owen
You might try a little more. The official guidance suggests 2GB: https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko wrote: > I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github, > in

Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Vladimir Protsenko
I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github, installed maven and trying to compile system: git clone https://github.com/apache/spark.git git checkout v1.2.0 mvn -DskipTests clean package leads to OutOfMemoryException. What amount of memory does it requires? expor

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
ᐧ I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely to be deprecated anyway in favor of s3n. Also the comment section notes that Amazon has implemented an EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t. On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Hey Jay :) I tried "s3n" which uses the Jets3tNativeFileSystemStore, and the double slash went away. As far as I can see, it does look like a bug in hadoop-common; I'll file a ticket for it. Hope you are doing well, by the way! PS: Jets3tNativeFileSystemStore's implementation of pathToKey is: =

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jay Vyas
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a simple spark way to test this would be to change the uri to write to hdfs:// or maybe you could do file:// , and confirm that the extra slash goes away. - if it's indeed a jets3t issue we should add a new unit test for

ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Is anybody experiencing this? It looks like a bug in JetS3t to me, but thought I'd sanity check before filing an issue. I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL ("s3://fake-test/1234"). The code does write to S3, but with double forward slashes

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Right. I contacted the SummingBird users as well. It doesn't support Spark streaming currently. We are heading towards Storm as it is mostly widely used. Is Spark streaming production ready? Thanks Ajay On Tue, Dec 23, 2014 at 3:47 PM, Gerard Maas wrote: > I'm not aware of a project trying to

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
I'm not aware of a project trying to achieve this integration. At some point Summingbird had the intention of adding an Spark port, and that could potentially bridge Storm and Spark. Not sure if that evolved into something concrete. In any case, an attempt to bring Storm and Spark together will co

Re: Consistent hashing of RDD row

2014-12-23 Thread lev
After checking the spark code, I now realize that an rdd that was cached to disk can't be evicted, so I will just persist the rdd to disk after the random numbers are created. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Consistent-hashing-of-RDD-row-tp20

JavaRDD (Data Aggregation) based on key

2014-12-23 Thread sachin Singh
Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based on any field(a,b or c) as per user input, using Apache Spark Java API,Please Help Urgent! Thanks in advance, Regards Sachin -- View this message in context: http://apache-spark-user-list.1001560.n3

RE: How to export data from hive into hdfs in spark program?

2014-12-23 Thread LinQili
Hi Leo:Thanks for your reply.I am talking about using hive from spark to export data from hive to hdfs.maybe like: val exportData = s"insert overwrite directory '/user/linqili/tmp/src' select * from $DB.$tableName" hiveContext.sql(exportData)but it was unsupported in spark now:Exceptio

Re: Interpreting MLLib's linear regression o/p

2014-12-23 Thread Sean Owen
(In your libsvm example, your indices are not ascending.) The first weight corresponds to the first feature, of course. An indexing scheme doesn't change that or somehow make the first feature map to the second (where would the last one go then?). You'll find the first weight at offset 0 in an arr

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Hi, The question is to do streaming in Spark with Storm (not using Spark Streaming). The idea is to use Spark as a in-memory computation engine and static data coming from Cassandra/Hbase and streaming data from Storm. Thanks Ajay On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas wrote: > Hi, > >

Spark SQL job block when use hive udf from_unixtime

2014-12-23 Thread Dai, Kevin
Hi, there When I use hive udf from_unixtime with the HiveContext, the job block and the log is as follow: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueu

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Hi, I'm not sure what you are asking: Whether we can use spouts and bolts in Spark (=> no) or whether we can do streaming in Spark: http://spark.apache.org/docs/latest/streaming-programming-guide.html -kr, Gerard. On Tue, Dec 23, 2014 at 9:03 AM, Ajay wrote: > Hi, > > Can we use Storm Stre

How to export data from hive into hdfs in spark program?

2014-12-23 Thread LinQili
Hi all:I wonder if is there a way to export data from table of hive into hdfs using spark?like this: INSERT OVERWRITE DIRECTORY '/user/linqili/tmp/src' select * from $DB.$tableName

Re: Spark Job hangs up on multi-node cluster but passes on a single node

2014-12-23 Thread Akhil
That's because in your code at some place you have specified localhost instead of the ip address of the machine running the service. When run it in local mode it will work fine because everything happens on that machine and hence it will be able to connect to localhost which runs the service, now o

RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Hi, Can we use Storm Streaming as RDD in Spark? Or any way to get Spark work with Storm? Thanks Ajay

Re: Spark Job hangs up on multi-node cluster but passes on a single node

2014-12-23 Thread shaiw75
Hi, I am having the same problem. Any solution to that? Thanks, Shai -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-hangs-up-on-multi-node-cluster-but-passes-on-a-single-node-tp15886p20826.html Sent from the Apache Spark User List mailing list a