Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Mohammad Tariq
Hi Divya, Do you you have inbounds enabled on port 50070 of your NN machine. Also, it's a good idea to have the public DNS in your /etc/hosts for proper name resolution. [image: --] Tariq, Mohammad [image: https://]about.me/mti

[Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 and Hadoop 2.7.2 When I am trying to view Any of the web UI of the cluster either hadoop or Spark ,I am getting below error " This site can’t be reached " Has anybody using EMR and able to view WebUI . Could you please share the steps. Would really

unsubscribe

2016-09-12 Thread 常明敏
unsubscribe

Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
istm what you can only do is inject `collect` methods map-by-map like; `df.map(x => do something...).collect` // check intermediate results in maps This only works for small datasets though. // maropu On Tue, Sep 13, 2016 at 1:38 AM, Attias, Hagai wrote: > Hi, > > Not

Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package. Since all XML columns are optional, some columns may or may not exist. When I register the Dataframe as a table, how do I check if a nested column is existing or not? My column name is "emp" which is already exploded and I am trying to

Re: Strings not converted when calling Scala code from a PySpark app

2016-09-12 Thread Holden Karau
Ah yes so the Py4J conversions only apply on the driver program - your DStream however is RDDs of pickled objects. If you want to with a transform function use Spark SQL transferring DataFrames back and forth between Python and Scala spark can be much easier. On Monday, September 12, 2016, Alexis

Strings not converted when calling Scala code from a PySpark app

2016-09-12 Thread Alexis Seigneurin
Hi, *TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though.* I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are

Re: Spark with S3 DirectOutputCommitter

2016-09-12 Thread Srikanth
Thanks Steve! We are already using HDFS as an intermediate store. This is for the last stage of processing which has to put data in S3. The output is partitioned by 3 fields, like .../field1=111/field2=999/date=-MM-DD/* Given that there are 100s for folders and 1000s of subfolder and part

LDA spark ML visualization

2016-09-12 Thread janardhan shetty
Hi, I am trying to visualize the LDA model developed in spark scala (2.0 ML) in LDAvis. Is there any links to convert the spark model parameters to the following 5 params to visualize ? 1. φ, the K × W matrix containing the estimated probability mass function over the W terms in the vocabulary

Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Mario Ds Briggs
Daniel, I believe it is related to https://issues.apache.org/jira/browse/SPARK-13979 and happens only when task fails in a executor (probably for some other reason u hit the latter in parquet and not csv). The PR in there, should be shortly available in IBM's Analytics for Spark. thanks

How to know how are the slaves for an application

2016-09-12 Thread Xiaoye Sun
Hi all, I am currently making some changes in Spark in my research project. In my development, after an application has been submitted to the spark master, I want to get the IP addresses of all the slaves used by that application, so that the spark master is able to talk to the slave machines

Re: Spark transformations

2016-09-12 Thread Thunder Stumpges
Yep, totally with you on this. None of it is ideal but doesn't sound like there will be any changes coming to the visibility of ml supporting classes. -Thunder On Mon, Sep 12, 2016 at 10:10 AM janardhan shetty wrote: > Thanks Thunder. To copy the code base is difficult

Re: Partition n keys into exacly n partitions

2016-09-12 Thread Denis Bolshakov
Just provide own partitioner. One I wrote a partitioner which keeps similar keys together in one partitioner. Best regards, Denis On 12 September 2016 at 19:44, sujeet jog wrote: > Hi, > > Is there a way to partition set of data with n keys into exactly n > partitions.

Re: Spark transformations

2016-09-12 Thread janardhan shetty
Thanks Thunder. To copy the code base is difficult since we need to copy in entirety or transitive dependency files as well. If we need to do complex operations of taking a column as a whole instead of each element in a row is not possible as of now. Trying to find few pointers to easily solve

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Evan Zamir
Yep, done. https://issues.apache.org/jira/browse/SPARK-17508 On Mon, Sep 12, 2016 at 9:06 AM Nick Pentreath wrote: > Could you create a JIRA ticket for it? > > https://issues.apache.org/jira/browse/SPARK > > On Thu, 8 Sep 2016 at 07:50 evanzamir

Re: Spark Java Heap Error

2016-09-12 Thread Baktaawar
Hi I even tried the dataframe.cache() action to carry out the cross tab transformation. However still I get the same OOM error. recommender_ct.cache() --- Py4JJavaError Traceback (most recent

Partition n keys into exacly n partitions

2016-09-12 Thread sujeet jog
Hi, Is there a way to partition set of data with n keys into exactly n partitions. For ex : - tuple of 1008 rows with key as x tuple of 1008 rows with key as y and so on total 10 keys ( x, y etc ) Total records = 10080 NumOfKeys = 10 i want to partition the 10080 elements into exactly 10

Re: Spark transformations

2016-09-12 Thread Thunder Stumpges
Hi Janardhan, I have run into similar issues and asked similar questions. I also ran into many problems with private code when trying to write my own Model/Transformer/Estimator. (you might be able to find my question to the group regarding this, I can't really tell if my emails are getting

Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Attias, Hagai
Hi, Not sure what you mean, can you give an example? Hagai. From: Takeshi Yamamuro Date: Monday, September 12, 2016 at 7:24 PM To: Hagai Attias Cc: "user@spark.apache.org" Subject: Re: Debugging a spark application in a none

Re: Getting figures from spark streaming

2016-09-12 Thread Thunder Stumpges
Just a guess, but doesn't the `.apply(0)' at the end of each of your print statements take just the first one of the returned list? On Wed, Sep 7, 2016 at 12:36 AM Ashok Kumar wrote: > Any help on this warmly appreciated. > > > On Tuesday, 6 September 2016, 21:31,

Re: Debugging a spark application in a none lazy mode

2016-09-12 Thread Takeshi Yamamuro
Hi, Spark does not have such mode. How about getting local arrays by `collect` methods for debugging? // maropu On Tue, Sep 13, 2016 at 12:44 AM, Hagai wrote: > Hi guys, > Lately I was looking for a way to debug my spark application locally. > > However, since all

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Nick Pentreath
Could you create a JIRA ticket for it? https://issues.apache.org/jira/browse/SPARK On Thu, 8 Sep 2016 at 07:50 evanzamir wrote: > When I am trying to use LinearRegression, it seems that unless there is a > column specified with weights, it will raise a py4j error. Seems

Debugging a spark application in a none lazy mode

2016-09-12 Thread Hagai
Hi guys, Lately I was looking for a way to debug my spark application locally. However, since all transformations are actually being executed when the action is encountered, I have no way to look at the data after each transformation. Does spark support a non-lazy mode which enables to execute

Re: Re: Selecting the top 100 records per group by?

2016-09-12 Thread Mich Talebzadeh
Hi, I don't understand why you need to add a column row_number when you can use rank or dens_rank? Why one cannot one use rank or dens_rank here? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
Hi Ayan, "My problem is to get data on to HDFS for the first time." well, you have to put them on the cluster. With this simple command you can load them within HDFS: hdfs dfs -put $LOCAL_SRC_DIR $HDFS_PATH Then, i think you have to use coalesce in order to create an uber super mega file :)

回复:Re: Selecting the top 100 records per group by?

2016-09-12 Thread luohui20001
hi kevinwindow function is what you need, like below:val hivetable = hc.sql("select * from house_sale_pv_location") val overLocation = Window.partitionBy(hivetable.col("lp_location_id")) val sortedDF = hivetable.withColumn("rowNumber",

RE: questions about using dapply

2016-09-12 Thread xingye
Hi, Felix Thanks for the information. as in my previous email, I've made MARGIN capitalized and it worked with dapplyCollect, but it does not work in dapply. If I use dapply and put the original apply function as a function for dapply,cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd",

Re: Spark tasks blockes randomly on standalone cluster

2016-09-12 Thread Denis Bolshakov
Hello, I see such behavior from time to time. Similar problem is described here: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-Memory-Task-hangs-td12377.html We also use speculative as a workaround (our spark version is 1.6.0). But I would like to share one of observations. We

Re: Spark metrics when running with YARN?

2016-09-12 Thread Vladimir Tretyakov
Hello Saisai Shao, Jacek Laskowski , thx for information. We are working on Spark monitoring tool and our users have different setup modes (Standalone, Mesos, YARN). Looked at code, found: /** * Attempt to start a Jetty server bound to the supplied hostName:port using the given * context

Spark tasks blockes randomly on standalone cluster

2016-09-12 Thread bogdanbaraila
We are having a quite complex application that runs on Spark Standalone. In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state. Extra info:

Re: Small files

2016-09-12 Thread ayan guha
Hi Thanks for your mail. I have read few of those posts. But always I see solutions assume data is on hdfs already. My problem is to get data on to HDFS for the first time. One way I can think of is to load small files on each cluster machines on the same folder. For example load file 1-0.3 mil

Re: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-12 Thread Daniel Lopes
Thanks Steve, But this error occurs only with parquet files, CSVs works. Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
That is a good question Ayan. A few searches on so returns me: http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge good luck,

Small files

2016-09-12 Thread ayan guha
Hi I have a general question: I have 1.6 mil small files, about 200G all put together. I want to put them on hdfs for spark processing. I know sequence file is the way to go because putting small files on hdfs is not correct practice. Also, I can write a code to consolidate small files to seq

Unsubscribe

2016-09-12 Thread bijuna
> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark word count program , need help on integration

2016-09-12 Thread gobi s
Hi, I am new to spark. I want to develop a word count app and deploy it in local mode. from outside I want to trigger the program and get the word count output and show it to the UI. I need help on integration of Spark and outside. i) How to trigger the Spark app from the j2ee app

Re: Spark metrics when running with YARN?

2016-09-12 Thread Saisai Shao
Here is the yarn RM REST API for you to refer ( http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html). You can use these APIs to query applications running on yarn. On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski wrote: > Hi Vladimir, > >

Re: Using Zeppelin with Spark FP

2016-09-12 Thread Sachin Janani
Yes zeppelin 0.6.1 works properly with Spark 2.0 On Mon, Sep 12, 2016 at 1:10 PM, Mich Talebzadeh wrote: > Does Zeppelin work OK with Spark 2? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Using Zeppelin with Spark FP

2016-09-12 Thread Mich Talebzadeh
Does Zeppelin work OK with Spark 2? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Re: Using Zeppelin with Spark FP

2016-09-12 Thread Sachin Janani
Zeppelin imports ZeppelinContext object in spark interpreter using which you can plot dataframe,dataset and even rdd.To do so you just need to use "z.show(df)" in the paragraph (here df is the Dataframe which you want to plot) Regards, Sachin Janani On Mon, Sep 12, 2016 at 11:20 AM, andy