RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread Udbhav Agarwal
That sounds great. Thanks. Can I assume that source for a stream in spark can only be some external source like kafka etc.? Source cannot be some rdd in spark or some external file ? Thanks, Udbhav From: ayan guha [mailto:guha.a...@gmail.com] Sent: Friday, September 16, 2016 3:01 AM To: Udbhav

Re: Missing output partition file in S3

2016-09-16 Thread Steve Loughran
On 15 Sep 2016, at 19:37, Chen, Kevin > wrote: Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result,

Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran
On 16 Sep 2016, at 01:03, Peyman Mohajerian > wrote: You can listen to files in a specific directory using: Take a look at: http://spark.apache.org/docs/latest/streaming-programming-guide.html streamingContext.fileStream yes, this works here's

Re: Impersonate users using the same SparkContext

2016-09-16 Thread Steve Loughran
> On 16 Sep 2016, at 04:43, gsvigruha > wrote: > > Hi, > > is there a way to impersonate multiple users using the same SparkContext > (e.g. like this > https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html) > when going

Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation? On 15 September 2016 at 21:37, Chen, Kevin wrote: > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? > My spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a

RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
In fact you can use rdd as well using queue stream but it is considered for testing, as per documents. On 16 Sep 2016 17:44, "ayan guha" wrote: > Rdd no. File yes, using fileStream. But filestream does not support > replay, I think. You need to manage checkpoint yourself. >

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
You may have to decrease the checkpoint interval to say 5 if you're getting StackOverflowError. You may have a particularly deep lineage being created during iterations. No space left on device means you don't have enough local disk to accommodate the big shuffles in some stage. You can add more

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
Hi Sean, At the moment I am using Zeppelin with Spark SQL to get data from Hive. So any connection here using visitation has to be through this sort of API. I know Tableau only uses SQL. Zeppelin can use Spark sql directly or through Spark Thrift Server. The question is a user may want to

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-16 Thread Olivier Girardot
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal for me, I managed to make it work anyway like that :> df.withColumn("a", struct(struct(myUDF(df("a.b.c." // I didn't put back the

RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
Rdd no. File yes, using fileStream. But filestream does not support replay, I think. You need to manage checkpoint yourself. On 16 Sep 2016 16:56, "Udbhav Agarwal" wrote: > That sounds great. Thanks. > > Can I assume that source for a stream in spark can only be some

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are several ways here to query the source data directly with no extra step or latency here. Even Spark SQL is real-time-ish for queries on the source data, and Impala (or heck Drill etc) are. On Thu, Sep 15, 2016 at 10:56 PM, Mich

Re: countApprox

2016-09-16 Thread Sean Owen
countApprox gives the best answer within some timeout. Is it possible that 1ms is more than enough to count this exactly? then the confidence wouldn't matter. Although that seems way too fast, you're counting ranges whose values don't actually matter, and maybe the Python side is smart enough to

Re: very slow parquet file write

2016-09-16 Thread tosaigan...@gmail.com
Hi, try this conf val sc = new SparkContext(conf) sc.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false) Regards, Sai Ganesh On Thu, Sep 15, 2016 at 11:34 PM, gaurav24 [via Apache Spark User List] < ml-node+s1001560n27738...@n3.nabble.com> wrote: > Hi Rok, > > facing

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Mich Talebzadeh
Is your Hive Thrift Server up and running on port jdbc:hive2://10001? Do the following netstat -alnp |grep 10001 and see whether it is actually running HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Advait Mohan Raut
Hi I am trying to run Spark 2.0.0 in the Microsoft Windows environment without hadoop or hive. I am running it in the local mode i.e. cmd> spark-shell and can run the shell. When I try to run the sample example

How PolynomialExpansion works

2016-09-16 Thread Nirav Patel
Doc says: Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it to above. Thanks -- [image: What's New with Xactly]

Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread anupama . gangadhar
Hi, I am trying to connect to Hive from Spark application in Kerborized cluster and get the following exception. Spark version is 1.4.1 and Hive is 1.2.1. Outside of spark the connection goes through fine. Am I missing any configuration parameters? ava.sql.SQLException: Could not open

Re: How PolynomialExpansion works

2016-09-16 Thread Sean Owen
The result includes, essentially, all the terms in (x+y) and (x+y)^2, and so on up if you chose a higher power. It is not just the second-degree terms. On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel wrote: > Doc says: > > Take a 2-variable feature vector as an example: (x,

Re: Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Jacek Laskowski
Hi Advait, It's due to https://issues.apache.org/jira/browse/SPARK-15565. See http://stackoverflow.com/a/38945867/1305344 for a solution (that's spark.sql.warehouse.dir away). Upvote if it works for you. Thanks! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering

Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet
Hi! Can you split init code with current comand? I thing it is main problem in your code. 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" написал: > Has anyone using Spark 1.6.2 encountered very slow responses from pulling > data from PostgreSQL using JDBC? I can get to

Re: Spark output data to S3 is very slow

2016-09-16 Thread Takeshi Yamamuro
Hi, Have you seen the previous thread? https://www.mail-archive.com/user@spark.apache.org/msg56791.html // maropu On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li wrote: > Hi, > > > I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very > quickly, but the last

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
Thanks, That is what I am missing. I have added cache before action, and that 2nd processing is avoided. 2016-09-10 5:10 GMT-07:00 Cody Koeninger : > Hard to say without seeing the code, but if you do multiple actions on an > Rdd without caching, the Rdd will be computed

Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
Hi, It'd be better to set `predicates` in jdbc arguments for loading in parallel. See: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200 // maropu On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim wrote: > I

Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim
I am testing this in spark-shell. I am following the Spark documentation by simply adding the PostgreSQL driver to the Spark Classpath. SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell Then, I run the code below to connect to the PostgreSQL database to query. This is when I have

feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-16 Thread AlexG
Do Ignite and Alluxio offer reasonable means of transferring data, in memory, from Spark to MPI? A straightforward way to transfer data is use piping, but unless you have MPI processes running in a one-to-one mapping to the Spark partitions, this will require some complicated logic to get working

Spark output data to S3 is very slow

2016-09-16 Thread Qiang Li
Hi, I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very quickly, but the last step, spark spend lots of time to rename or move data from s3 temporary directory to real directory, then I try to set

Re: Spark metrics when running with YARN?

2016-09-16 Thread Vladimir Tretyakov
Hello. Found that there is also Spark metric Sink like MetricsServlet. which is enabled by default: https://apache.googlesource.com/spark/+/refs/heads/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#40 Tried urls: On master: http://localhost:8080/metrics/master/json/

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Sorry, it wasn't the count it was the reduce method that retrieves information from the RDD. I has to go through all the rdd values to return the result. 2016-09-16 11:18 GMT-03:00 chen yong : > Dear Dirceu, > > > I am totally confused . In your reply you mentioned ".the

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Dear Dirceu, I am totally confused . In your reply you mentioned ".the count does that, ..." .However, in the code snippet shown in the attachment file FelixProblem.png of your previous mail, I cannot find any 'count' ACTION is called. Would you please clearly show me the line it is

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Also, I wonder what is the right way to debug spark program. If I use ten anonymous function in one spark program, for debugging each of them, i have to place a COUNT action in advace and then remove it after debugging. Is that the right way? 发件人: Dirceu

Spark can't connect to secure phoenix

2016-09-16 Thread Ashish Gupta
Hi All, I am running a spark program on secured cluster which creates SqlContext for creating dataframe over phoenix table. When I run my program in local mode with --master option set to local[2] my program works completely fine, however when I try to run same program with master option set

Re: Missing output partition file in S3

2016-09-16 Thread Tracy Li
Sent from my iPhone > On Sep 15, 2016, at 1:37 PM, Chen, Kevin wrote: > > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? My > spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a

Hive api vs Dataset api

2016-09-16 Thread igor.berman
Hi, I wanted to understand if there is any other advantage besides api syntax when using hive/table api vs. dataset api in spark sql(v2.0)? Any additional optimizations maybe? I'm most interested in parquet partitioned tables stored on s3. Is there any difference if I'm comfortable with dataset

Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Hello Felix, No, this line isn't the one that is triggering the execution of the function, the count does that, unless your count val is a lazy val. The count method is the one that retrieves the information of the rdd, it has do go through all of it's data do determine how many records the RDD

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
Oh this is the netflix dataset right? I recognize it from the number of users/items. It's not fast on a laptop or anything, and takes plenty of memory, but succeeds. I haven't run this recently but it worked in Spark 1.x. On Fri, Sep 16, 2016 at 5:13 PM, Roshani Nagmote

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
I am also surprised that I face this problems with fairy small dataset on 14 M4.2xlarge machines. Could you please let me know on which dataset you can run 100 iterations of rank 30 on your laptop? I am currently just trying to run the default example code given with spark to run ALS on movie

Re: App works, but executor state is "killed"

2016-09-16 Thread satishl
Any solutions for this? Spark version: 1.4.1, running in standalone mode. All my applications complete succesfully but the spark master UI shows that the executors in KILLED Status. Is it just a UI bug or are my executors actually KILLED? -- View this message in context:

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
Hello, Thanks for your reply. Yes, Its netflix dataset. And when I get no space on device, my ‘/mnt’ directory gets filled up. I checked. /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class org.apache.spark.examples.mllib.MovieLensALS --jars

JDBC Very Slow

2016-09-16 Thread Benjamin Kim
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data from PostgreSQL using JDBC? I can get to the table and see the schema, but when I do a show, it takes very long or keeps timing out. The code is simple. val jdbcDF = sqlContext.read.format("jdbc").options(