Re: using pyspark with standalone cluster

2015-06-02 Thread Akhil Das
If you want to submit applications to a remote cluster where your port 7077 is opened publically, then you would need to set the *spark.driver.host *(with the public ip of your laptop) and *spark.driver.port* (optional, if there's no firewall between your laptop and the remote cluster). Keeping

Insert overwrite to hive - ArrayIndexOutOfBoundsException

2015-06-02 Thread patcharee
Hi, I am using spark 1.3.1. I tried to insert (a new partition) into an existing partitioned hive table, but got ArrayIndexOutOfBoundsException. Below is a code snippet and the debug log. Any suggestions please. + case class Record4Dim(key: String,

Union data type

2015-06-02 Thread Agarwal, Shagun
Hi, As per the doc https://github.com/databricks/spark-avro/blob/master/README.md, Union type doesn’t support all kind of combination. Is there any plan to support union type having string long in near future? Thanks Shagun Agarwal

spark sql - reading data from sql tables having space in column names

2015-06-02 Thread Sachin Goyal
Hi, We are using spark sql (1.3.1) to load data from Microsoft sql server using jdbc (as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases ). It is working fine except when there is a space in column names (we can't modify the schemas to remove

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote: Hello All, A bit scared I did

Re: Spark stages very slow to complete

2015-06-02 Thread Karlson
Hi, the code is some hundreds lines of Python. I can try to compose a minimal example as soon as I find the time, though. Any ideas until then? Would you mind posting the code? On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote: Hi, In all (pyspark) Spark jobs, that become somewhat

Re: What is shuffle read and what is shuffle write ?

2015-06-02 Thread Akhil Das
I found an interesting presentation http://www.slideshare.net/colorant/spark-shuffle-introduction and go through this thread also http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html Thanks Best Regards On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with: mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package Thanks Best Regards On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote: I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread octavian.ganea
I was tried using reduceByKey, without success. I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . However, I got the same error as before, namely the error described here:

Re: Streaming K-medoids

2015-06-02 Thread Marko Dinic
Erik, Thank you for your answer. It seems really good, but unfortunately I'm not very familiar with Scala, so I have partly understood. Could you please explain your idea with Spark implementation? Best regards, Marko On Mon 01 Jun 2015 06:35:17 PM CEST, Erik Erlandson wrote: I haven't

Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Anders Arpteg
Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that worked fine for Spark 1.3. The job starts on the cluster (yarn-cluster mode), initial stage starts, but the job fails before any task succeeds with the following error. Any hints? [ERROR] [06/02/2015 09:05:36.962] [Executor

What is shuffle read and what is shuffle write ?

2015-06-02 Thread ๏̯͡๏
Is it input and ouput bytes/record size ? -- Deepak

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
Hi Dimple, take a look to existing transformers: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Cody Koeninger
If you're using the spark partition id directly as the key, then you don't need to access offset ranges at all, right? You can create a single instance of a partitioner in advance, and all it needs to know is the number of partitions (which is just the count of all the kafka topic/partitions). On

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai
Does it happen every time you read a parquet source? On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg arp...@spotify.com wrote: The log is from the log aggregation tool (hortonworks, yarn logs ...), so both executors and driver. I'll send a private mail to you with the full logs. Also, tried

Re: Can't build Spark 1.3

2015-06-02 Thread Ritesh Kumar Singh
It did hang for me too. High RAM consumption during build. Had to free a lot of RAM and introduce swap memory just to get it build in my 3rd attempt. Everything else looks fine. You can download the prebuilt versions from the Spark homepage to save yourself from all this trouble. Thanks, Ritesh

Re: HDFS Rest Service not available

2015-06-02 Thread Su She
Ahh, this did the trick, I had to get the name node out of same mode however before it fully worked. Thanks! On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
So in spark is after acquiring executors from ClusterManeger, does tasks are scheduled on executors based on datalocality ?I Mean if in an application there are 2 jobs and output of 1 job is used as input of another job. And in job1 I did persist on some RDD, then while running job2 will it use

Can't build Spark 1.3

2015-06-02 Thread Yakubovich, Alexey
\ I downloaded the latest Spark (1.3.) from github. Then I tried to build it. First for scala 2.10 (and hadoop 2.4): build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package That resulted in hangup after printing bunch of line like [INFO] Dependency-reduced POM written at

DataFrames coming in SparkR in Apache Spark 1.4.0

2015-06-02 Thread Emaasit
For the impatient R-user, here is a link http://people.apache.org/~pwendell/spark-nightly/spark-1.4-docs/latest/sparkr.html to get started working with DataFrames using SparkR. Or copy and paste this link into your web browser:

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Thanks, this works. Hopefully I didn't miss something important with this approach. вт, 2 июня 2015 г. в 20:15, Cody Koeninger c...@koeninger.org: If you're using the spark partition id directly as the key, then you don't need to access offset ranges at all, right? You can create a single

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
Thanks Peter. Can you share the Tokenizer.java class for Spark 1.2.1. Dimple On Tue, Jun 2, 2015 at 10:51 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi Dimple, take a look to existing transformers:

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread David Mitchell
I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, Executor Info from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com wrote: I would think the

[OFFTOPIC] Big Data Application Meetup

2015-06-02 Thread Alex Baranau
Hi everyone, I wanted to drop a note about a newly organized developer meetup in Bay Area: the Big Data Application Meetup (http://meetup.com/bigdataapps) and call for speakers. The plan is for meetup topics to be focused on application use-cases: how developers can build end-to-end solutions

Re: IDE for sparkR

2015-06-02 Thread Emaasit
Rstudio is the best IDE for running sparkR. Instructions for this can be found at this link https://github.com/apache/spark/tree/branch-1.4/R . You will need to set some environment variables as described below. *Using SparkR from RStudio* If you wish to use SparkR from RStudio or other R

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
Thanks for the quick reply Ram. Will take a look at the Tokenizer code and try it out. Dimple On Tue, Jun 2, 2015 at 10:42 AM, Ram Sriharsha sriharsha@gmail.com wrote: Hi We are in the process of adding examples for feature transformations (

Issues with Spark Streaming and Manual Clock used for Unit Tests

2015-06-02 Thread mobsniuk
I have a situation where I have multiple tests that use Spark streaming with Manual clock. The first run is OK and processes the data when I increment the clock to the sliding window duration. The second test deviates and doesn't process any data. The traces follows which indicates memory store is

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
You shouldn't have to persist the RDD at all, just call flatMap and reduce on it directly. If you try to persist it, that will try to load the original dat into memory, but here you are only scanning through it once. Matei On Jun 2, 2015, at 2:09 AM, Octavian Ganea octavian.ga...@inf.ethz.ch

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Nice to hear from you Holden ! I ended up trying exactly that (Column) - but I may have done it wrong : In [*5*]: g.agg(Column(percentile(value, 0.5))) Py4JError: An error occurred while calling o97.agg. Trace: py4j.Py4JException: Method agg([class java.lang.String, class

Re: Can't build Spark 1.3

2015-06-02 Thread Ted Yu
Have you run zinc during build ? See build/mvn which installs zinc. Cheers On Tue, Jun 2, 2015 at 12:26 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: It did hang for me too. High RAM consumption during build. Had to free a lot of RAM and introduce swap memory just to get it

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. On Tuesday, June 2, 2015, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Nice to hear from you Holden ! I ended up trying exactly that

Re: Best strategy for Pandas - Spark

2015-06-02 Thread Olivier Girardot
Thanks for the answer, I'm currently doing exactly that. I'll try to sum-up the usual Pandas = Spark Dataframe caveats soon. Regards, Olivier. Le mar. 2 juin 2015 à 02:38, Davies Liu dav...@databricks.com a écrit : The second one sounds reasonable, I think. On Thu, Apr 30, 2015 at 1:42 AM,

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Shixiong Zhu
Ryan - I sent a PR to fix your issue: https://github.com/apache/spark/pull/6599 Edward - I have no idea why the following error happened. ContextCleaner doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to support both hadoop 1 and hadoop 2. * Exception in thread Spark Context

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great. On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu zsxw...@gmail.com wrote: Ryan - I sent a PR to fix your issue: https://github.com/apache/spark/pull/6599 Edward - I have no idea why the following error happened. ContextCleaner doesn't use any Hadoop API.

Re: Spark 1.4 YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
Just testing with Spark 1.3, it looks like it sets the proxy correctly to be the YARN RM host (0101) 15/06/03 10:34:19 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 15/06/03 10:34:20 INFO yarn.ApplicationMaster: ApplicationAttemptId:

Re: Spark 1.4 YARN Application Master fails with 500 connect refused

2015-06-02 Thread Marcelo Vanzin
That code hasn't changed at all between 1.3 and 1.4; it also has been working fine for me. Are you sure you're using exactly the same Hadoop libraries (since you're building with -Phadoop-provided) and Hadoop configuration in both cases? On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Sean Owen
We are having a separate discussion about this but, I don't understand why this is a problem? You're supposed to build Spark for Hadoop 1 if you run it on Hadoop 1 and I am not sure that is happening here, given the error. I do not think this should change as I do not see that it fixes something.

Re: Can't build Spark

2015-06-02 Thread Mulugeta Mammo
Spark 1.3.1, Scala 2.11.6, Maven 3.3.3, I'm behind proxy, have set my proxy settings in maven settings. Thanks, On Tue, Jun 2, 2015 at 2:54 PM, Ted Yu yuzhih...@gmail.com wrote: Can you give us some more information ? Such as: which Spark release you were building what command you used

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Dimp Bhat
I found this : https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/ml/feature/Tokenizer.html which indicates the Tokenizer did exist in Spark 1.2.0 then and not in 1.2.1? On Tue, Jun 2, 2015 at 12:45 PM, Peter Rudenko petro.rude...@gmail.com wrote: I'm afraid there's no such class

Re: Can't build Spark

2015-06-02 Thread Ted Yu
Can you give us some more information ? Such as: which Spark release you were building what command you used Scala version you used Thanks On Tue, Jun 2, 2015 at 2:50 PM, Mulugeta Mammo mulugeta.abe...@gmail.com wrote: building Spark is throwing errors, any ideas? [FATAL] Non-resolvable

Scripting with groovy

2015-06-02 Thread Paolo Platter
Hi all, Has anyone tried to add Scripting capabilities to spark streaming using groovy? I would like to stop the streaming context, update a transformation function written in groovy( for example to manipulate json ), restart the streaming context and obtain a new behavior without re-submit the

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
Yup, exactly. All the workers will use local disk in addition to RAM, but maybe one thing you need to configure is the directory to use for that. It should be set trough spark.local.dir. By default it's /tmp, which on some machines is also in RAM, so that could be a problem. You should set it

Can't build Spark

2015-06-02 Thread Mulugeta Mammo
building Spark is throwing errors, any ideas? [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: repo.maven.apache.org from

Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4 http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf -- Ruslan Dautkhanov On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thank you, Tathagata,

Re: Can't build Spark

2015-06-02 Thread Ted Yu
I ran dev/change-version-to-2.11.sh first. I used the following command but didn't reproduce the error below: mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package My env: maven 3.3.1 Possibly the error was related to proxy setting. FYI On Tue, Jun 2, 2015 at 3:14 PM, Mulugeta Mammo

Behavior of the spark.streaming.kafka.maxRatePerPartition config param?

2015-06-02 Thread dgoldenberg
Hi, Could someone explain the behavior of the spark.streaming.kafka.maxRatePerPartition parameter? The doc says An important (configuration) is spark.streaming.kafka.maxRatePerPartition which is the maximum rate at which each Kafka partition will be read by (the) direct API. What is the default

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
Hi, Thanks for you information. I'll give spark1.4 a try when it's released. On Wed, Jun 3, 2015 at 11:31 AM, Tathagata Das t...@databricks.com wrote: Could you try it out with Spark 1.4 RC3? Also pinging, Cloudera folks, they may be aware of something. BTW, the way I have debugged memory

Application is always in process when I check out logs of completed application

2015-06-02 Thread amghost
I run spark application in spark standalone cluster with client deploy mode. I want to check out the logs of my finished application, but I always get a page telling me Application history not found - Application xxx is still in process. I am pretty sure that the application has indeed completed

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Tathagata Das
Could you try it out with Spark 1.4 RC3? Also pinging, Cloudera folks, they may be aware of something. BTW, the way I have debugged memory leaks in the past is as follows. Run with a small driver memory, say 1 GB. Periodically (maybe a script), take snapshots of histogram and also do memory

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
Hi, Thanks for you reply. Here's the top 30 entries of jmap -histo:live result: num #instances #bytes class name -- 1: 40802 145083848 [B 2: 99264 12716112 methodKlass 3: 99264 12291480

Re: in GraphX,program with Pregel runs slower and slower after several iterations

2015-06-02 Thread Cheuk Lam
I've been encountering something similar too. I suspected that was related to the lineage growth of the graph/RDDs. So I checkpoint the graph every 60 Pregel rounds, after doing which my program doesn't slow down any more (except that every checkpoint takes some extra time). -- View this

Re: Spark 1.4 YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf
Thanks Marcelo - looks like it was my fault. Seems when we deployed the new version of spark it was picking up the wrong yarn site and setting the wrong proxy host. All good now! On Wed, Jun 3, 2015 at 11:01 AM, Marcelo Vanzin van...@cloudera.com wrote: That code hasn't changed at all between

Filter operation to return two RDDs at once.

2015-06-02 Thread ๏̯͡๏
I want to do this val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId != NULL_VALUE) val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId == NULL_VALUE) This will run two different stages can this be done in one stage ? val (qtSessionsWithQt,

How to create fewer output files for Spark job ?

2015-06-02 Thread ๏̯͡๏
I am running a series of spark functions with 9000 executors and its resulting in 9000+ files that is execeeding the namespace file count qutota. How can Spark be configured to use CombinedOutputFormat. {code} protected def writeOutputRecords(detailRecords: RDD[(AvroKey[DetailOutputRecord],

build jar with all dependencies

2015-06-02 Thread Pa Rö
hello community, i have build a jar file from my spark app with maven (mvn clean compile assembly:single) and the following pom file: project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi= http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Shixiong Zhu
How about other jobs? Is it an executor log, or a driver log? Could you post other logs near this error, please? Thank you. Best Regards, Shixiong Zhu 2015-06-02 17:11 GMT+08:00 Anders Arpteg arp...@spotify.com: Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that worked

Shared / NFS filesystems

2015-06-02 Thread Pradyumna Achar
Hello! I have Spark running in standalone mode, and there are a bunch of worker nodes connected to the master. The workers have a shared file system, but the master node doesn't. Is this something that's not going to work? i.e., should the master node also be on the same shared filesystem mounted

Re: Shared / NFS filesystems

2015-06-02 Thread Akhil Das
You can run/submit your code from one of the worker which has access to the file system and it should be fine i think. Give it a try. Thanks Best Regards On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com wrote: Hello! I have Spark running in standalone mode, and there

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Anders Arpteg
The log is from the log aggregation tool (hortonworks, yarn logs ...), so both executors and driver. I'll send a private mail to you with the full logs. Also, tried another job as you suggested, and it actually worked fine. The first job was reading from a parquet source, and the second from an

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
Can you run using spark-submit? What is happening is that you are running a simple java program -- you've wrapped spark-core in your fat jar but at runtime you likely need the whole Spark system in order to run your application. I would mark the spark-core as provided(so you don't wrap it in your

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread ayan guha
I would think the easiest way would be to create a view in DB with column names with no space. In fact, you can pass a sql in place of a real table. From documentation: The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used. For

Re: build jar with all dependencies

2015-06-02 Thread Pa Rö
okay, but how i can compile my app to run this without -Dconfig.file=alt_ reference1.conf? 2015-06-02 15:43 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen correctly. If it's in a non-standard location you can pass -Dconfig.file=alt_reference1.conf to java to tell it where to look. If this is a config that

Re: How to read sequence File.

2015-06-02 Thread ๏̯͡๏
Spark Shell: val x = sc.sequenceFile(/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761, classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text]) OR val x = sc.sequenceFile(/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761, classOf[org.apache.hadoop.io.Text],

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(table) sqlContext.sql(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM,

Re: How to read sequence File.

2015-06-02 Thread Akhil Das
Basically, you need to convert it to a serializable format before doing the collect/take. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence /sigmoid) *.map(_._2.toString)* sFile.take(5).foreach(println) Use the

Transactional guarantee while saving DataFrame into a DB

2015-06-02 Thread Mohammad Tariq
Hi list, With the help of Spark DataFrame API we can save a DataFrame into a database table through insertIntoJDBC() call. However, I could not find any info about how it handles the transactional guarantee. What if my program gets killed during the processing? Would it end up in partial load?

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg(percentile(key,0.5)) ?? Le mar. 2 juin 2015 à 15:37, Yana Kadiyska yana.kadiy...@gmail.com a écrit : Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int,

Re: build jar with all dependencies

2015-06-02 Thread Paul Röwer
which maven dependency i need, too?? http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html Am 02.06.2015 um 16:04 schrieb Yana Kadiyska: Can you run using spark-submit? What is happening is that you are running a simple java program -- you've

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690 https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 -

updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Hi all, In my streaming job I'm using kafka streaming direct approach and want to maintain state with updateStateByKey. My PairRDD has message's topic name + partition id as a key. So, I assume that updateByState could work within same partition as KafkaRDD and not lead to shuffles. Actually this

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
Ah, interesting. While working on my new Tungsten shuffle manager, I came up with some nice testing interfaces for allowing me to manually trigger spills in order to deterministically test those code paths without requiring large amounts of data to be shuffled. Maybe I could make similar test

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
Is it possible in JavaSparkContext ? JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDDStringlines = jsc.textFile(args[0]); If yes , does its programmer's responsibilty to first calculate splits locations and then instantiate spark context with preferred locations? How does its achieved

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
My suggestion is that you change the Spark setting which controls the compression codec that Spark uses for internal data transfers. Set spark.io.compression.codec to lzf in your SparkConf. On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello Josh, Are you suggesting

Re: Spark 1.3.0: how to let Spark history load old records?

2015-06-02 Thread Otis Gospodnetic
I think Spark doesn't keep historical metrics. You can use something like SPM for that - http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support *

Scala By the Bay + Big Data Scala 2015 Program Announced

2015-06-02 Thread Alexy Khrabrov
The programs and schedules for Scala By the Bay (SBTB) and Big Data Scala By the Bay (BDS) 2015 conferences are announced and published: Scala By the Bay http://scala.bythebay.io — August 13-16 ( scala.bythebay.io) Big Data Scala By the Bay http://bigdatascala.bythebay.io — August 16-18 (

Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread dimple
Hi, I would like to embed my own transformer in the Spark.ml Pipleline but do not see an example of it. Can someone share an example of which classes/interfaces I need to extend/implement in order to do so. Thanks. Dimple -- View this message in context:

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document this). The primary difference between MR and Spark here is that MR runs each task in its own YARN container, while Spark runs multiple tasks within an executor, which

Re: updateStateByKey and kafka direct approach without shuffle

2015-06-02 Thread Krot Viacheslav
Cody, Thanks, good point. I fixed getting partition id to: class MyPartitioner(offsetRanges: Array[OffsetRange]) extends Partitioner { override def numPartitions: Int = offsetRanges.size override def getPartition(key: Any): Int = { // this is set in .map(m =

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Ram Sriharsha
Hi We are in the process of adding examples for feature transformations ( https://issues.apache.org/jira/browse/SPARK-7546) and this should be available shortly on Spark Master. In the meanwhile, the best place to start would be to look at how the Tokenizer works here:

Join highly skewed datasets

2015-06-02 Thread ๏̯͡๏
We use Scoobi + MR to perform joins and we particularly use blockJoin() API of scoobi /** Perform an equijoin with another distributed list where this list is considerably smaller * than the right (but too large to fit in memory), and where the keys of right may be * particularly skewed. */

SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-02 Thread Haopu Wang
Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Igor Berman
Hi, small mock data doesn't reproduce the problem. IMHO problem is reproduced when we make shuffle big enough to split data into disk. We will work on it to understand and reproduce the problem(not first priority though...) On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote: How