Re: use kafka streams API aggregate ?

2018-01-30 Thread 郭鹏飞
hi, Today I do it too. check your kafka version, then follow one of the guides below. http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

use kafka streams API aggregate ?

2018-01-30 Thread 446463...@qq.com
Hi I am new to kafka. today I use kafka streams API for real timing process data and I have no idea with this Can someone help me ? 446463...@qq.com

回复: Re: use kafka streams API aggregate ?

2018-01-30 Thread 446463...@qq.com
oh sorry, I means just use Kafka streams API do the aggregate and not depend on Spark 446463...@qq.com 发件人: 郭鹏飞 发送时间: 2018-01-30 23:06 收件人: 446463...@qq.com 抄送: user 主题: Re: use kafka streams API aggregate ? hi, Today I do it too. check your kafka version, then follow one of the guides

Re: mapGroupsWithState in Python

2018-01-30 Thread ayan guha
Any help would be much appreciated :) On Mon, Jan 29, 2018 at 6:25 PM, ayan guha wrote: > Hi > > I want to write something in Structured streaming: > > 1. I have a dataset which has 3 columns: id, last_update_timestamp, > attribute > 2. I am receiving the data through

ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-30 Thread michelleyang
I tried to use One vs Rest in spark ml with pipeline and crossValidator for multimultinomial in logistic regression. It came out with empty coefficients. I figured out it was the setting of ParamGridBuilder. Can anyone help me understand how does the parameter setting affect the crossValidator

spark job error

2018-01-30 Thread shyla deshpande
I am running Zeppelin on EMR. with the default settings. I am getting the following error. Restarting the Zeppelin application fixes the problem. What default settings do I need to override that will help fix this error. org.apache.spark.SparkException: Job aborted due to stage failure: Task 71

[Doubt] GridSearch for Hyperparameter Tuning in Spark

2018-01-30 Thread Aakash Basu
Hi, Is there any available pyspark ML or MLLib API for Grid Search similar to GridSearchCV from model_selection of sklearn? I found this - https://spark.apache.org/docs/2.2.0/ml-tuning.html, but it has cross-validation and train-validation for hp-tuning and not pure grid search. Any help?

spark.sql.adaptive.enabled has no effect

2018-01-30 Thread 张万新
Hi there, As far as I know, when *spark.sql.adaptive.enabled* is set to true, the number of post shuffle partitions should change with the map output size. But in my application there is a stage reading 900GB shuffled files only with 200 partitions (which is the default number of

Data Integration with Chinese Social Media Sites

2018-01-30 Thread Sanjay Kulkarni
Hi All, One of the requirement in our project is to integrate data from Chinese Social media platforms - WeChat, Weibo, Baidu(Search). Has anyone done it using Spark; are there any pre-built connectors available and challenges involved in setting up the same. I am sure there could be multiple

[Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-30 Thread LongVehicle
Hello everyone, We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We have an RDD (3GB) that we periodically (every 30min) refresh by reading from HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/, and then we create /RDD rdd = df.as[T].rdd/. The

Issue with Cast in Spark Sql

2018-01-30 Thread Arnav kumar
Hi Experts I am trying to convert a string with decimal value to decimal in Spark Sql and load it into Hive/Sql Server. In Hive instead of getting converted to decimal all my values are coming as null. In Sql Server instead of getting decimal values are coming without precision Can you please

why groupByKey still shuffle if SQL does "Distribute By" on same columns ?

2018-01-30 Thread Dibyendu Bhattacharya
Hi, I am trying something like this.. val sesDS: Dataset[XXX] = hiveContext.sql(select).as[XXX] The select statement is something like this : "select * from sometable DISTRIBUTE by col1, col2, col3" Then comes groupByKey... val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))

Re: Issue with Cast in Spark Sql

2018-01-30 Thread naresh Goud
Spark/Hive converting decimal to null value if we specify the precision more than available precision in file. Below example give you details. I am not sure why its converting into Null. Note: You need to trim string before casting to decimal Table data with col1 and col2 columns val r =

Re: spark job error

2018-01-30 Thread Jacek Laskowski
Hi, Start with spark.executor.memory 2g. You may also give spark.yarn.executor.memoryOverhead a try. See https://spark.apache.org/docs/latest/configuration.html and https://spark.apache.org/docs/latest/running-on-yarn.html for more in-depth information. Pozdrawiam, Jacek Laskowski

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-30 Thread Bryan Cutler
Hi Michelle, Your original usage of ParamGridBuilder was not quite right, `addGrid` expects (some parameter, array of values for that parameter). If you want to do a grid search with different regularization values, you would do the following: val paramMaps = new

Spark Structured Streaming for Twitter Streaming data

2018-01-30 Thread Divya Gehlot
Hi, I am exploring the spark structured streaming . When turned to internet to understand about it I could find its more integrated with Kafka or other streaming tool like Kenesis. I couldnt find where we can use Spark Streaming API for twitter streaming data . Would be grateful ,if any body used