date:20160517

Re:Re: Re: How to change output mode to Update

2016-05-17 Thread Todd

Hi Sachin,

Could you please give the url of jira-15146? Thanks!







At 2016-05-18 13:33:47, "Sachin Aggarwal"  wrote:


Hi, there is some code I have added in jira-15146 please have a look at it, I 
have not finished it. U can use the same code in ur example as of now

On 18-May-2016 10:46 AM, "Saisai Shao"  wrote:

> .mode(SaveMode.Overwrite)


From my understanding mode is not supported in continuous query.


def mode(saveMode: SaveMode): DataFrameWriter = {
// mode() is used for non-continuous queries
  // outputMode() is used for continuous queries
assertNotStreaming("mode() can only be called on non-continuous queries")
this.mode = saveMode
this
}


On Wed, May 18, 2016 at 12:25 PM, Todd  wrote:

Thanks Ted.

I didn't try, but I think SaveMode and OuputMode are different things.
Currently, the spark code contain two output mode, Append and Update.  Append 
is the default mode,but looks that there is no way to change to Update.

Take a look at DataFrameWriter#startQuery

Thanks.








At 2016-05-18 12:10:11, "Ted Yu"  wrote:

Have you tried adding:


.mode(SaveMode.Overwrite)



On Tue, May 17, 2016 at 8:55 PM, Todd  wrote:

scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 
seconds")).option("checkpointLocation", 
"file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult")
org.apache.spark.sql.AnalysisException: Aggregations are not supported on 
streaming DataFrames/Datasets in Append output mode. Consider changing output 
mode to Update.;
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:142)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:59)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:46)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:46)
  at 
org.apache.spark.sql.ContinuousQueryManager.startQuery(ContinuousQueryManager.scala:190)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:351)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:279)



I brief the spark code, looks like there is no way to change output mode to 
Update?

Re: duplicate jar problem in yarn-cluster mode

2016-05-17 Thread Saisai Shao

I think it is already fixed if your problem is exactly the same as what
mentioned in this JIRA (https://issues.apache.org/jira/browse/SPARK-14423).

Thanks
Jerry

On Wed, May 18, 2016 at 2:46 AM, satish saley 
wrote:

> Hello,
> I am executing a simple code with yarn-cluster
>
> --master
> yarn-cluster
> --name
> Spark-FileCopy
> --class
> my.example.SparkFileCopy
> --properties-file
> spark-defaults.conf
> --queue
> saleyq
> --executor-memory
> 1G
> --driver-memory
> 1G
> --conf
> spark.john.snow.is.back=true
> --jars
> hdfs://myclusternn.com:8020/tmp/saley/examples/examples-new.jar
> --conf
> spark.executor.extraClassPath=examples-new.jar
> --conf
> spark.driver.extraClassPath=examples-new.jar
> --verbose
> examples-new.jar
> hdfs://myclusternn.com:8020/tmp/saley/examples/input-data/text/data.txt
> hdfs://myclusternn.com:8020/tmp/saley/examples/output-data/spark
>
>
> I am facing
>
> Resource hdfs://
> myclusternn.com/user/saley/.sparkStaging/application_5181/examples-new.jar
> changed on src filesystem (expected 1463440119942, was 1463440119989
> java.io.IOException: Resource hdfs://
> myclusternn.com/user/saley/.sparkStaging/application_5181/examples-new.jar
> changed on src filesystem (expected 1463440119942, was 1463440119989
>
> I see a jira for this
> https://issues.apache.org/jira/browse/SPARK-1921
>
> Is this yet to be fixed or fixed as part of another jira and need some
> additional config?
>

Re: Load Table as DataFrame

2016-05-17 Thread Jörn Franke

Do you have the full source code? Why do you convert a data frame to rdd - this 
does not make sense to me?

> On 18 May 2016, at 06:13, Mohanraj Ragupathiraj  wrote:
> 
> I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million 
> rows. From the DataFrame I created an RDD of JavaBean and use it for joining 
> with data from a file.
> 
>   Map phoenixInfoMap = new HashMap();
>   phoenixInfoMap.put("table", tableName);
>   phoenixInfoMap.put("zkUrl", zkURL);
>   DataFrame df = 
> sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
>   JavaRDD tableRows = df.toJavaRDD();
>   JavaPairRDD dbData = tableRows.mapToPair(
>   new PairFunction()
>   {
>   @Override
>   public Tuple2 call(Row row) throws Exception
>   {
>   return new Tuple2(row.getAs("ID"), 
> row.getAs("NAME"));
>   }
>   });
>  
> Now my question - Lets say the file has 2 unique million entries matching 
> with the table. Is the entire table loaded into memory as RDD or only the 
> matching 2 million records from the table will be loaded into memory as RDD ?
> 
> 
> http://stackoverflow.com/questions/37289849/phoenix-spark-load-table-as-dataframe
> 
> -- 
> Thanks and Regards
> Mohan

Re: Accessing Cassandra data from Spark Shell

2016-05-17 Thread Cassa L

Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine.
However, I need to use spark-1.5.2 version. With it, it does not work. I
keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
for Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
wrote:

> Yes, it is very simple to access Cassandra data using Spark shell.
>
>
>
> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>
> $SPARK_HOME/bin/spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>
>
>
> Step 2: Create a DataFrame pointing to your Cassandra table
>
> val dfCassTable = sqlContext.read
>
>
> .format("org.apache.spark.sql.cassandra")
>
>  .options(Map(
> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>
>  .load()
>
>
>
> From this point onward, you have complete access to the DataFrame API. You
> can even register it as a temporary table, if you would prefer to use
> SQL/HiveQL.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Monday, May 9, 2016 9:28 PM
> *To:* u...@cassandra.apache.org; user
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> You can use SparkShell to access Cassandra via the Spark Cassandra
> connector. The getting started article on our support page will probably
> give you a good steer to get started even if you’re not using Instaclustr:
> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>
>
>
> Cheers
>
> Ben
>
>
>
> On Tue, 10 May 2016 at 14:08 Cassa L  wrote:
>
> Hi,
>
> Has anyone tried accessing Cassandra data using SparkShell? How do you do
> it? Can you use HiveContext for Cassandra data? I'm using community version
> of Cassandra-3.0
>
>
>
> Thanks,
>
> LCassa
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>

Re: Re: How to change output mode to Update

2016-05-17 Thread Sachin Aggarwal

Hi, there is some code I have added in jira-15146 please have a look at it,
I have not finished it. U can use the same code in ur example as of now
On 18-May-2016 10:46 AM, "Saisai Shao"  wrote:

> > .mode(SaveMode.Overwrite)
>
> From my understanding mode is not supported in continuous query.
>
> def mode(saveMode: SaveMode): DataFrameWriter = {
>   // mode() is used for non-continuous queries
>   // outputMode() is used for continuous queries
>   assertNotStreaming("mode() can only be called on non-continuous queries")
>   this.mode = saveMode
>   this
> }
>
>
> On Wed, May 18, 2016 at 12:25 PM, Todd  wrote:
>
>> Thanks Ted.
>>
>> I didn't try, but I think SaveMode and OuputMode are different things.
>> Currently, the spark code contain two output mode, Append and Update.
>> Append is the default mode,but looks that there is no way to change to
>> Update.
>>
>> Take a look at DataFrameWriter#startQuery
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>> At 2016-05-18 12:10:11, "Ted Yu"  wrote:
>>
>> Have you tried adding:
>>
>> .mode(SaveMode.Overwrite)
>>
>> On Tue, May 17, 2016 at 8:55 PM, Todd  wrote:
>>
>>> scala> records.groupBy("name").count().write.trigger(ProcessingTime("30
>>> seconds")).option("checkpointLocation",
>>> "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult")
>>> org.apache.spark.sql.AnalysisException: Aggregations are not supported
>>> on streaming DataFrames/Datasets in Append output mode. Consider changing
>>> output mode to Update.;
>>>   at
>>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:142)
>>>   at
>>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:59)
>>>   at
>>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:46)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>>>   at
>>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:46)
>>>   at
>>> org.apache.spark.sql.ContinuousQueryManager.startQuery(ContinuousQueryManager.scala:190)
>>>   at
>>> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:351)
>>>   at
>>> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:279)
>>>
>>>
>>> I brief the spark code, looks like there is no way to change output mode
>>> to Update?
>>>
>>
>>
>

Re: Re: How to change output mode to Update

2016-05-17 Thread Saisai Shao

> .mode(SaveMode.Overwrite)

>From my understanding mode is not supported in continuous query.

def mode(saveMode: SaveMode): DataFrameWriter = {
  // mode() is used for non-continuous queries
  // outputMode() is used for continuous queries
  assertNotStreaming("mode() can only be called on non-continuous queries")
  this.mode = saveMode
  this
}


On Wed, May 18, 2016 at 12:25 PM, Todd  wrote:

> Thanks Ted.
>
> I didn't try, but I think SaveMode and OuputMode are different things.
> Currently, the spark code contain two output mode, Append and Update.
> Append is the default mode,but looks that there is no way to change to
> Update.
>
> Take a look at DataFrameWriter#startQuery
>
> Thanks.
>
>
>
>
>
>
> At 2016-05-18 12:10:11, "Ted Yu"  wrote:
>
> Have you tried adding:
>
> .mode(SaveMode.Overwrite)
>
> On Tue, May 17, 2016 at 8:55 PM, Todd  wrote:
>
>> scala> records.groupBy("name").count().write.trigger(ProcessingTime("30
>> seconds")).option("checkpointLocation",
>> "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult")
>> org.apache.spark.sql.AnalysisException: Aggregations are not supported on
>> streaming DataFrames/Datasets in Append output mode. Consider changing
>> output mode to Update.;
>>   at
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:142)
>>   at
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:59)
>>   at
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:46)
>>   at
>> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>>   at
>> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:46)
>>   at
>> org.apache.spark.sql.ContinuousQueryManager.startQuery(ContinuousQueryManager.scala:190)
>>   at
>> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:351)
>>   at
>> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:279)
>>
>>
>> I brief the spark code, looks like there is no way to change output mode
>> to Update?
>>
>
>

Can Pyspark access Scala API?

2016-05-17 Thread Abi

Can Pyspark access Scala API? The accumulator in pysPark does not have local 
variable available . The Scala API does have it available

SPARK - DataFrame for BulkLoad

2016-05-17 Thread Mohanraj Ragupathiraj

I have 100 million records to be inserted to a HBase table (PHOENIX) as a
result of a Spark Job. I would like to know if i convert it to a Dataframe
and save it, will it do Bulk load (or) it is not the efficient way to write
data to a HBase table

-- 
Thanks and Regards
Mohan

Load Table as DataFrame

2016-05-17 Thread Mohanraj Ragupathiraj

I have created a DataFrame from a HBase Table (PHOENIX) which has 500
million rows. From the DataFrame I created an RDD of JavaBean and use it
for joining with data from a file.

Map phoenixInfoMap = new HashMap();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df =
sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD tableRows = df.toJavaRDD();
JavaPairRDD dbData = tableRows.mapToPair(
new PairFunction()
{
@Override
public Tuple2 call(Row row) throws Exception
{
return new Tuple2(row.getAs("ID"), row.getAs("NAME"));
}
});

Now my question - Lets say the file has 2 unique million entries matching
with the table. Is the entire table loaded into memory as RDD or only the
matching 2 million records from the table will be loaded into memory as RDD
?


http://stackoverflow.com/questions/37289849/phoenix-spark-load-table-as-dataframe

-- 
Thanks and Regards
Mohan

Re: How to change output mode to Update

2016-05-17 Thread Ted Yu

Have you tried adding:

.mode(SaveMode.Overwrite)

On Tue, May 17, 2016 at 8:55 PM, Todd  wrote:

> scala> records.groupBy("name").count().write.trigger(ProcessingTime("30
> seconds")).option("checkpointLocation",
> "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult")
> org.apache.spark.sql.AnalysisException: Aggregations are not supported on
> streaming DataFrames/Datasets in Append output mode. Consider changing
> output mode to Update.;
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:142)
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:59)
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:46)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>   at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:46)
>   at
> org.apache.spark.sql.ContinuousQueryManager.startQuery(ContinuousQueryManager.scala:190)
>   at
> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:351)
>   at
> org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:279)
>
>
> I brief the spark code, looks like there is no way to change output mode
> to Update?
>

Load Table as DataFrame

2016-05-17 Thread Mohanraj Ragupathiraj

I have created a DataFrame from a HBase Table (PHOENIX) which has 500
million rows. From the DataFrame I created an RDD of JavaBean and use it
for joining with data from a file.

Map phoenixInfoMap = new HashMap();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df =
sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD tableRows = df.toJavaRDD();
JavaPairRDD dbData = tableRows.mapToPair(
new PairFunction()
{
@Override
public Tuple2 call(Row row) throws Exception
{
return new Tuple2(row.getAs("ID"), row.getAs("NAME"));
}
});

Now my question - Lets say the file has 2 unique million entries matching
with the table. Is the entire table loaded into memory as RDD or only the
matching 2 million records from the table will be loaded into memory as RDD
?


http://stackoverflow.com/questions/37289849/phoenix-spark-load-table-as-dataframe

-- 
Thanks and Regards
Mohan

Re: 答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Gajanan Satone

Thanks for sharing,

Please consider me.

Thanks,
Gajanan

On Wed, May 18, 2016 at 8:34 AM, 谭成灶  wrote:

> Thanks for your sharing!
> Please include me too
> --
> 发件人: Mich Talebzadeh 
> 发送时间: ‎2016/‎5/‎18 5:16
> 收件人: user @spark 
> 主题: Re: My notes on Spark Performance & Tuning Guide
>
> Hi all,
>
> Many thanks for your tremendous interest in the forthcoming notes. I have
> had nearly thirty requests and many supporting kind words from the
> colleagues in this forum.
>
> I will strive to get the first draft ready as soon as possible. Apologies
> for not being more specific. However, hopefully not too long for your
> perusal.
>
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 May 2016 at 11:08, Mich Talebzadeh 
> wrote:
>
> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>

How to change output mode to Update

2016-05-17 Thread Todd

scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 
seconds")).option("checkpointLocation", 
"file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult")
org.apache.spark.sql.AnalysisException: Aggregations are not supported on 
streaming DataFrames/Datasets in Append output mode. Consider changing output 
mode to Update.;
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:142)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:59)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForStreaming$1.apply(UnsupportedOperationChecker.scala:46)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:46)
  at 
org.apache.spark.sql.ContinuousQueryManager.startQuery(ContinuousQueryManager.scala:190)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:351)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:279)



I brief the spark code, looks like there is no way to change output mode to 
Update?

Re: How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Saisai Shao

It is not supported now, currently only filestream is supported.

Thanks
Jerry

On Wed, May 18, 2016 at 10:14 AM, Todd  wrote:

> Hi,
> I am wondering whether structured streaming supports Kafka as data source.
> I brief the source code（meanly related with DataSourceRegister trait）, and
> didn't find kafka data source things
> If
> Thanks.
>
>
>
>
>

Re: SparkR query

2016-05-17 Thread Sun Rui

I guess that you are using an old version of Spark, 1.4.

please try Spark version 1.5+

> On May 17, 2016, at 18:42, Mike Lewis  wrote:
> 
> Thanks, I’m just using RStudio. Running locally is fine, just issue with 
> having cluster in Linux and workers looking for Windows path,
> Which must be being passed through by the driver I guess. I checked the 
> spark-env.sh on each node and the appropriate SPARK_HOME is set
> correctly….
>  
>  
> From: Sun Rui [mailto:sunrise_...@163.com] 
> Sent: 17 May 2016 11:32
> To: Mike Lewis
> Cc: user@spark.apache.org
> Subject: Re: SparkR query
>  
> Lewis,
> 1. Could you check the values of “SPARK_HOME” environment on all of your 
> worker nodes?
> 2. How did you start your SparkR shell?
>  
> On May 17, 2016, at 18:07, Mike Lewis  > wrote:
>  
> Hi,
>  
> I have a SparkR driver process that connects to a master running on Linux,
> I’ve tried to do a simple test, e.g.
>  
> sc <- sparkR.init(master="spark://my-linux-host.dev.local:7077 
> ",  
> sparkEnvir=list(spark.cores.max="4"))
> x <- SparkR:::parallelize(sc,1:100,2)
> y <- count(x)
>  
> But I can see that the worker nodes are failing, they are looking for the 
> Windows (rather than linux path) to
> Daemon.R
>  
> 16/05/17 06:52:33 INFO BufferedStreamThread: Fatal error: cannot open file 
> 'E:/Dev/spark/spark-1.4.0-bin-hadoop2.6/R/lib/SparkR/worker/daemon.R': No 
> such file or directory
>  
> Is this a configuration setting that I’m missing, the worker nodes (linux) 
> shouldn’t be looking in the spark home of the driver (windows) ?
> If so, I’d appreciate someone letting me know what I need to change/set.
> 
> Thanks,
> Mike Lewis
>  
> 
> --
>  
> This email has been sent to you on behalf of Nephila Advisors LLC 
> (“Advisors”). Advisors provides consultancy services to Nephila Capital Ltd. 
> (“Capital”), an investment advisor managed and carrying on business in 
> Bermuda. Advisors and its employees do not act as agents for Capital or the 
> funds it advises and do not have the authority to bind Capital or such funds 
> to any transaction or agreement. 
> 
> The information in this e-mail, and any attachment therein, is confidential 
> and for use by the addressee only. Any use, disclosure, reproduction, 
> modification or distribution of the contents of this e-mail, or any part 
> thereof, other than by the intended recipient, is strictly prohibited. If you 
> are not the intended recipient, please return the e-mail to the sender and 
> delete it from your computer. This email is for information purposes only, 
> nothing contained herein constitutes an offer to sell or buy securities, as 
> such an offer may only be made from a properly authorized offering document. 
> Although Nephila attempts to sweep e-mail and attachments for viruses, it 
> does not guarantee that either are virus-free and accepts no liability for 
> any damage sustained as a result of viruses. 
> --
>  
> 
> --
>  
> This email has been sent to you on behalf of Nephila Advisors UK (“Advisors 
> UK”). Advisors UK provides consultancy services to Nephila Capital Ltd. 
> (“Capital”), an investment advisor managed and carrying on business in 
> Bermuda. Advisors UK and its employees do not act as agents for Capital or 
> the funds it advises and do not have the authority to bind Capital or such 
> funds to any transaction or agreement. 
> 
> The information in this e-mail, and any attachment therein, is confidential 
> and for use by the addressee only. Any use, disclosure, reproduction, 
> modification or distribution of the contents of this e-mail, or any part 
> thereof, other than by the intended recipient, is strictly prohibited. If you 
> are not the intended recipient, please return the e-mail to the sender and 
> delete it from your computer. This email is for information purposes only, 
> nothing contained herein constitutes an offer to sell or buy securities, as 
> such an offer may only be made from a properly authorized offering document. 
> Although Nephila attempts to sweep e-mail and attachments for viruses, it 
> does not guarantee that either are virus-free and accepts no liability for 
> any damage sustained as a result of viruses. 
> --
>  
> 
> --
>  
> This email has been sent to you on behalf of Nephila Advisors LLC 
>

答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread 谭成灶

Thanks for your sharing!
Please include me too

发件人: Mich Talebzadeh
发送时间: ‎2016/‎5/‎18 5:16
收件人: user @spark
主题: Re: My notes on Spark Performance & Tuning Guide

Hi all,

Many thanks for your tremendous interest in the forthcoming notes. I have
had nearly thirty requests and many supporting kind words from the
colleagues in this forum.

I will strive to get the first draft ready as soon as possible. Apologies
for not being more specific. However, hopefully not too long for your
perusal.


Regards,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 12 May 2016 at 11:08, Mich Talebzadeh  wrote:

> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang

I think you can write it in gitbook and share it in user mail list then
everyone can comment on that.

On Wed, May 18, 2016 at 10:12 AM, Vinayak Agrawal <
vinayakagrawa...@gmail.com> wrote:

> Please include me too.
>
> Vinayak Agrawal
> Big Data Analytics
> IBM
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>
> On May 17, 2016, at 2:15 PM, Mich Talebzadeh 
> wrote:
>
> Hi all,
>
> Many thanks for your tremendous interest in the forthcoming notes. I have
> had nearly thirty requests and many supporting kind words from the
> colleagues in this forum.
>
> I will strive to get the first draft ready as soon as possible. Apologies
> for not being more specific. However, hopefully not too long for your
> perusal.
>
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 May 2016 at 11:08, Mich Talebzadeh 
> wrote:
>
>> Hi Al,,
>>
>>
>> Following the threads in spark forum, I decided to write up on
>> configuration of Spark including allocation of resources and configuration
>> of driver, executors, threads, execution of Spark apps and general
>> troubleshooting taking into account the allocation of resources for Spark
>> applications and OS tools at the disposal.
>>
>> Since the most widespread configuration as I notice is with "Spark
>> Standalone Mode", I have decided to write these notes starting with
>> Standalone and later on moving to Yarn
>>
>>
>>-
>>
>>*Standalone *– a simple cluster manager included with Spark that
>>makes it easy to set up a cluster.
>>-
>>
>>*YARN* – the resource manager in Hadoop 2.
>>
>>
>> I would appreciate if anyone interested in reading and commenting to get
>> in touch with me directly on mich.talebza...@gmail.com so I can send the
>> write-up for their review and comments.
>>
>>
>> Just to be clear this is not meant to be any commercial proposition or
>> anything like that. As I seem to get involved with members troubleshooting
>> issues and threads on this topic, I thought it is worthwhile writing a note
>> about it to summarise the findings for the benefit of the community.
>>
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


-- 
Best Regards

Jeff Zhang

How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Todd

Hi,
I am wondering whether structured streaming supports Kafka as data source. I 
brief the source code（meanly related with DataSourceRegister trait）, and didn't 
find kafka data source things
If
Thanks.

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Vinayak Agrawal

Please include me too. 

Vinayak Agrawal
Big Data Analytics
IBM

"To Strive, To Seek, To Find and Not to Yield!"
~Lord Alfred Tennyson

> On May 17, 2016, at 2:15 PM, Mich Talebzadeh  
> wrote:
> 
> Hi all,
> 
> Many thanks for your tremendous interest in the forthcoming notes. I have had 
> nearly thirty requests and many supporting kind words from the colleagues in 
> this forum.
> 
> I will strive to get the first draft ready as soon as possible. Apologies for 
> not being more specific. However, hopefully not too long for your perusal.
> 
> 
> Regards,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 12 May 2016 at 11:08, Mich Talebzadeh  wrote:
>> Hi Al,,
>> 
>> 
>> Following the threads in spark forum, I decided to write up on configuration 
>> of Spark including allocation of resources and configuration of driver, 
>> executors, threads, execution of Spark apps and general troubleshooting 
>> taking into account the allocation of resources for Spark applications and 
>> OS tools at the disposal.
>> 
>> Since the most widespread configuration as I notice is with "Spark 
>> Standalone Mode", I have decided to write these notes starting with 
>> Standalone and later on moving to Yarn
>> 
>> Standalone – a simple cluster manager included with Spark that makes it easy 
>> to set up a cluster.
>> YARN – the resource manager in Hadoop 2.
>> 
>> I would appreciate if anyone interested in reading and commenting to get in 
>> touch with me directly on mich.talebza...@gmail.com so I can send the 
>> write-up for their review and comments.
>> 
>> Just to be clear this is not meant to be any commercial proposition or 
>> anything like that. As I seem to get involved with members troubleshooting 
>> issues and threads on this topic, I thought it is worthwhile writing a note 
>> about it to summarise the findings for the benefit of the community.
>> 
>> Regards.
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>

Re:Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd

Thanks you guys for the help.I will try






At 2016-05-18 07:17:08, "Mich Talebzadeh"  wrote:

Thanks Chris,


In a nutshell I don't think one can do that.


So let us see.  Here is my program that is looking for share prices > 95.9. It 
does work. It is pretty simple


import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import scala.collection.mutable.ArrayBuffer
//
object CEP_AVG {
  def main(args: Array[String]) {
// Create a local StreamingContext with two working thread and batch interval 
of n seconds.
val sparkConf = new SparkConf().
 setAppName("CEP_AVG").
 setMaster("local[2]").
 set("spark.cores.max", "2").
 set("spark.streaming.concurrentJobs", "2").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
  val sc = new SparkContext(sparkConf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val kafkaParams = Map[String, String]("bootstrap.servers" -> "rhes564:9092", 
"schema.registry.url" -> "http://rhes564:8081;, "zookeeper.connect" -> 
"rhes564:2181", "group.id" -> "CEP_AVG" )
val topics = Set("newtopic")


val DStream = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topics)
DStream.cache()
val lines = DStream.map(_._2)
val price = lines.map(_.split(',').view(2)).map(_.toDouble)

val windowLength = 4
val slidingInterval = 2
val countByValueAndWindow = price.filter(_ > 
95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
countByValueAndWindow.print()

//
//Now how I can get the distinct price values here?
//
//val countDistinctByValueAndWindow = price.filter(_ > 0.0).reduceByKey((t1, 
t2) -> t1).countByValueAndWindow(Seconds(windowLength), 
Seconds(slidingInterval))
//countDistinctByValueAndWindow.print()

ssc.start()
ssc.awaitTermination()
//ssc.stop()
  }
}


Ok What can be used here below


//val countDistinctByValueAndWindow = price.filter(_ > 0.0).reduceByKey((t1, 
t2) -> t1).countByValueAndWindow(Seconds(windowLength), 
Seconds(slidingInterval))
//countDistinctByValueAndWindow.print()


Let me know your thoughts?


Thanks







Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 17 May 2016 at 23:47, Chris Fregly  wrote:

you can use HyperLogLog with Spark Streaming to accomplish this.


here is an example from my fluxcapacitor GitHub repo:


https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx


here's an accompanying SlideShare presentation from one of my recent meetups 
(slides 70-83):


http://www.slideshare.net/cfregly/spark-after-dark-20-apache-big-data-conf-vancouver-may-11-2016-61970037



and a YouTube video for those that prefer video (starting at 32 mins into the 
video for your convenience):


https://youtu.be/wM9Z0PLx3cw?t=1922




On Tue, May 17, 2016 at 12:17 PM, Mich Talebzadeh  
wrote:

Ok but how about something similar to


val countByValueAndWindow = price.filter(_ > 
95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))





Using a new count => countDistinctByValueAndWindow ?


val countDistinctByValueAndWindow = price.filter(_ > 
95.0).countDistinctByValueAndWindow(Seconds(windowLength), 
Seconds(slidingInterval))




HTH



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 17 May 2016 at 20:02, Michael Armbrust  wrote:

In 2.0 you won't be able to do this.  The long term vision would be to make 
this possible, but a window will be required (like the 24 hours you suggest).


On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:

Hi,
We have a requirement to do count(distinct) in a processing batch against all 
the streaming data(eg, last 24 hours' data),that is,when we do 
count(distinct),we actually want to compute distinct against last 24 hours' 
data.
Does structured streaming support this scenario?Thanks!









--

Chris Fregly
Research Scientist @ Flux Capacitor AI
"Bringing AI Back to the Future!"
San Francisco, CA
http://fluxcapacitor.com

Why spark 1.6.1 wokers auto stopped and can not register with master:Worker registration failed: Duplicate worker ID?

2016-05-17 Thread sunday2000

Hi,
  
   A client woker auto stoppped, and has this error message, do u know why this 
happen?
  
 16/05/18 03:34:37 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/18 03:34:37 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/18 03:34:37 INFO Worker: Master with url spark://master:7077 requested 
this worker to reconnect.
16/05/18 03:34:37 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
16/05/18 03:34:37 ERROR Worker: Worker registration failed: Duplicate worker ID

How to run hive queries in async mode using spark sql

2016-05-17 Thread Raju Bairishetti

I am using spark sql for running hive queries also. Is there any way to run
hive queries in asyc mode using spark sql.

Does it return any hive handle or if yes how to get the results from hive
handle using spark sql?

-- 
Thanks,
Raju Bairishetti,

www.lazada.com

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread nguyen duc tuan

There's no *RowSimilarity *method in RowMatrix class. You have to transpose
your matrix to use that method. However, when the number of rows is large,
this approach is still very slow.
Try to use approximate nearest neighbor (ANN) methods instead such as LSH.
There are several implements of LSH on spark that you can find on github.
For example: https://github.com/karlhigley/spark-neighbors.

An other option, you can use ANN libraries on a single machine. There's a
good benchmark of ANN libraries here:
https://github.com/erikbern/ann-benchmarks

2016-05-17 23:24 GMT+07:00 Rex X :

> Each row of the given matrix is Vector[Double]. Want to find out the
> nearest neighbor row to each row using cosine similarity.
>
> The problem here is the complexity: O( 10^20 )
>
> We need to do *blocking*, and do the row-wise comparison within each
> block. Any tips for best practice?
>
> In Spark, we have RowMatrix.*ColumnSimilarity*, but I didn't find
> *RowSimilarity* method.
>
>
> Thank you.
>
>
> Regards
> Rex
>
>
>
>

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-17 Thread Ramaswamy, Muthuraman

Thank you for the input.

Apparently, I was referring to incorrect Schema Registry Server. Once the 
correct Schema Registry Server IP is used, serializer worked for me.

Thanks again,

~Muthu

From: Jan Uyttenhove >
Reply-To: "j...@insidin.com" 
>
Date: Tuesday, May 17, 2016 at 3:18 AM
To: "Ramaswamy, Muthuraman" 
>
Cc: spark users >
Subject: Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent 
Serializers as Value Decoder.

I think that if the Confluent deserializer cannot fetch the schema for the avro 
message (key and/or value), you end up with no data. You should check the logs 
of the Schemaregistry, it should show the HTTP requests it receives so you can 
check if the deserializer can connect to it and if so, what the response code 
looks like.

If you use the Confluent serializer, each avro message is first serialized and 
afterwards the schema id is added to it. This way, the Confluent deserializer 
can fetch the schema id first and use it to lookup the schema in the 
Schemaregistry.

On Tue, May 17, 2016 at 2:19 AM, Ramaswamy, Muthuraman 
> wrote:
Yes, I can see the messages. Also, I wrote a quick custom decoder for avro and 
it works fine for the following:

>> kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
>> brokers}, valueDecoder=decoder)

But, when I use the Confluent Serializers to leverage the Schema Registry 
(based on the link shown below), it doesn’t work for me. I am not sure whether 
I need to configure any more details to consume the Schema Registry. I can 
fetch the schema from the schema registry based on is Ids. The decoder method 
is not returning any values for me.

~Muthu

On 5/16/16, 10:49 AM, "Cody Koeninger" 
> wrote:

>Have you checked to make sure you can receive messages just using a
>byte array for value?
>
>On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>> 
>wrote:
>> I am trying to consume AVRO formatted message through
>> KafkaUtils.createDirectStream. I followed the listed below example (refer
>> link) but the messages are not being fetched by the Stream.
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
>>
>> Is there any code missing that I must add to make the above sample work.
>> Say, I am not sure how the confluent serializers would know the avro schema
>> info as it knows only the Schema Registry URL info.
>>
>> Appreciate your help.
>>
>> ~Muthu
>>
>>
>>

--
Jan Uyttenhove
Streaming data & digital solutions architect @ Insidin bvba

j...@insidin.com
+32 474 56 24 39

https://twitter.com/xorto
https://www.linkedin.com/in/januyttenhove

This e-mail and any files transmitted with it are intended solely for the use 
of the individual or entity to whom they are addressed. It may contain 
privileged and confidential information. If you are not the intended recipient 
please notify the sender immediately and destroy this e-mail. Any form of 
reproduction, dissemination, copying, disclosure, modification, distribution 
and/or publication of this e-mail message is strictly prohibited. Whilst all 
efforts are made to safeguard e-mails, the sender cannot guarantee that 
attachments are virus free or compatible with your systems and does not accept 
liability in respect of viruses or computer problems experienced.

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh

Thanks Chris,

In a nutshell I don't think one can do that.

So let us see.  Here is my program that is looking for share prices > 95.9.
It does work. It is pretty simple

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.kafka.KafkaUtils
import scala.collection.mutable.ArrayBuffer
//
object CEP_AVG {
  def main(args: Array[String]) {
// Create a local StreamingContext with two working thread and batch
interval of n seconds.
val sparkConf = new SparkConf().
 setAppName("CEP_AVG").
 setMaster("local[2]").
 set("spark.cores.max", "2").
 set("spark.streaming.concurrentJobs", "2").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
  val sc = new SparkContext(sparkConf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val kafkaParams = Map[String, String]("bootstrap.servers" ->
"rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
"zookeeper.connect" -> "rhes564:2181", "group.id" -> "CEP_AVG" )
val topics = Set("newtopic")

val DStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topics)
DStream.cache()
val lines = DStream.map(_._2)
val price = lines.map(_.split(',').view(2)).map(_.toDouble)

val windowLength = 4
val slidingInterval = 2

*val countByValueAndWindow = price.filter(_ >
95.0).countByValueAndWindow(Seconds(windowLength),
Seconds(slidingInterval))countByValueAndWindow.print()*
//
//Now how I can get the distinct price values here?
//
//val countDistinctByValueAndWindow = price.filter(_ >
0.0).reduceByKey((t1, t2) ->
t1).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
//countDistinctByValueAndWindow.print()

ssc.start()
ssc.awaitTermination()
//ssc.stop()
  }
}
Ok What can be used here below

//val countDistinctByValueAndWindow = price.filter(_ >
0.0).reduceByKey((t1, t2) ->
t1).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
//countDistinctByValueAndWindow.print()

Let me know your thoughts?

Thanks

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 17 May 2016 at 23:47, Chris Fregly  wrote:

> you can use HyperLogLog with Spark Streaming to accomplish this.
>
> here is an example from my fluxcapacitor GitHub repo:
>
>
> https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx
>
> here's an accompanying SlideShare presentation from one of my recent
> meetups (slides 70-83):
>
>
> http://www.slideshare.net/cfregly/spark-after-dark-20-apache-big-data-conf-vancouver-may-11-2016-61970037
>
>
> 
> and a YouTube video for those that prefer video (starting at 32 mins into
> the video for your convenience):
>
> https://youtu.be/wM9Z0PLx3cw?t=1922
>
>
> On Tue, May 17, 2016 at 12:17 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Ok but how about something similar to
>>
>> val countByValueAndWindow = price.filter(_ >
>> 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
>>
>>
>> Using a new count => c*ountDistinctByValueAndWindow ?*
>>
>> val countDistinctByValueAndWindow = price.filter(_ >
>> 95.0).countDistinctByValueAndWindow(Seconds(windowLength),
>> Seconds(slidingInterval))
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 May 2016 at 20:02, Michael Armbrust  wrote:
>>
>>> In 2.0 you won't be able to do this.  The long term vision would be to
>>> make this possible, but a window will be required (like the 24 hours you
>>> suggest).
>>>
>>> On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:
>>>
 Hi,
 We have a requirement to do count(distinct) in a processing batch
 against all the streaming data(eg, last 24 hours' data),that is,when we do
 count(distinct),we actually want to compute distinct against last 24 hours'
 data.
 Does structured streaming support this

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly

you can use HyperLogLog with Spark Streaming to accomplish this.

here is an example from my fluxcapacitor GitHub repo:

https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx

here's an accompanying SlideShare presentation from one of my recent
meetups (slides 70-83):

http://www.slideshare.net/cfregly/spark-after-dark-20-apache-big-data-conf-vancouver-may-11-2016-61970037


and a YouTube video for those that prefer video (starting at 32 mins into
the video for your convenience):

https://youtu.be/wM9Z0PLx3cw?t=1922


On Tue, May 17, 2016 at 12:17 PM, Mich Talebzadeh  wrote:

> Ok but how about something similar to
>
> val countByValueAndWindow = price.filter(_ >
> 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
>
>
> Using a new count => c*ountDistinctByValueAndWindow ?*
>
> val countDistinctByValueAndWindow = price.filter(_ >
> 95.0).countDistinctByValueAndWindow(Seconds(windowLength),
> Seconds(slidingInterval))
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 May 2016 at 20:02, Michael Armbrust  wrote:
>
>> In 2.0 you won't be able to do this.  The long term vision would be to
>> make this possible, but a window will be required (like the 24 hours you
>> suggest).
>>
>> On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:
>>
>>> Hi,
>>> We have a requirement to do count(distinct) in a processing batch
>>> against all the streaming data(eg, last 24 hours' data),that is,when we do
>>> count(distinct),we actually want to compute distinct against last 24 hours'
>>> data.
>>> Does structured streaming support this scenario?Thanks!
>>>
>>
>>
>

-- 
*Chris Fregly*
Research Scientist @ Flux Capacitor AI
"Bringing AI Back to the Future!"
San Francisco, CA
http://fluxcapacitor.com

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Abi

Please include me too

On May 12, 2016 6:08:14 AM EDT, Mich Talebzadeh  
wrote:
>Hi Al,,
>
>
>Following the threads in spark forum, I decided to write up on
>configuration of Spark including allocation of resources and
>configuration
>of driver, executors, threads, execution of Spark apps and general
>troubleshooting taking into account the allocation of resources for
>Spark
>applications and OS tools at the disposal.
>
>Since the most widespread configuration as I notice is with "Spark
>Standalone Mode", I have decided to write these notes starting with
>Standalone and later on moving to Yarn
>
>
>   -
>
> *Standalone *– a simple cluster manager included with Spark that makes
>   it easy to set up a cluster.
>   -
>
>   *YARN* – the resource manager in Hadoop 2.
>
>
>I would appreciate if anyone interested in reading and commenting to
>get in
>touch with me directly on mich.talebza...@gmail.com so I can send the
>write-up for their review and comments.
>
>
>Just to be clear this is not meant to be any commercial proposition or
>anything like that. As I seem to get involved with members
>troubleshooting
>issues and threads on this topic, I thought it is worthwhile writing a
>note
>about it to summarise the findings for the benefit of the community.
>
>
>Regards.
>
>
>Dr Mich Talebzadeh
>
>
>
>LinkedIn *
>https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>*
>
>
>
>http://talebzadehmich.wordpress.com

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores

Please sent me to me too !


Thanks ! ! !


Cesar Flores

On Tue, May 17, 2016 at 4:55 PM, Femi Anthony  wrote:

> Please send it to me as well.
>
> Thanks
>
> Sent from my iPhone
>
> On May 17, 2016, at 12:09 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
> Can you please send me as well.
>
> Thanks
> Raghav
> On 12 May 2016 20:02, "Tom Ellis"  wrote:
>
>> I would like to also Mich, please send it through, thanks!
>>
>> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>>
>>> Me too, send me the guide.
>>>
>>> Enviado desde mi iPhone
>>>
>>> El 12 may 2016, a las 12:11, Ashok Kumar >> > escribió:
>>>
>>> Hi Dr Mich,
>>>
>>> I will be very keen to have a look at it and review if possible.
>>>
>>> Please forward me a copy
>>>
>>> Thanking you warmly
>>>
>>>
>>> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> Hi Al,,
>>>
>>>
>>> Following the threads in spark forum, I decided to write up on
>>> configuration of Spark including allocation of resources and configuration
>>> of driver, executors, threads, execution of Spark apps and general
>>> troubleshooting taking into account the allocation of resources for Spark
>>> applications and OS tools at the disposal.
>>>
>>> Since the most widespread configuration as I notice is with "Spark
>>> Standalone Mode", I have decided to write these notes starting with
>>> Standalone and later on moving to Yarn
>>>
>>>
>>>- *Standalone *– a simple cluster manager included with Spark that
>>>makes it easy to set up a cluster.
>>>- *YARN* – the resource manager in Hadoop 2.
>>>
>>>
>>> I would appreciate if anyone interested in reading and commenting to get
>>> in touch with me directly on mich.talebza...@gmail.com so I can send
>>> the write-up for their review and comments.
>>>
>>> Just to be clear this is not meant to be any commercial proposition or
>>> anything like that. As I seem to get involved with members troubleshooting
>>> issues and threads on this topic, I thought it is worthwhile writing a note
>>> about it to summarise the findings for the benefit of the community.
>>>
>>> Regards.
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>


-- 
Cesar Flores

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Femi Anthony

Please send it to me as well.

Thanks

Sent from my iPhone

> On May 17, 2016, at 12:09 PM, Raghavendra Pandey 
>  wrote:
> 
> Can you please send me as well.
> 
> Thanks 
> Raghav
> 
>> On 12 May 2016 20:02, "Tom Ellis"  wrote:
>> I would like to also Mich, please send it through, thanks!
>> 
>>> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>>> Me too, send me the guide.
>>> 
>>> Enviado desde mi iPhone
>>> 
 El 12 may 2016, a las 12:11, Ashok Kumar  
 escribió:

 Hi Dr Mich,

 I will be very keen to have a look at it and review if possible.

 Please forward me a copy

 Thanking you warmly

 On Thursday, 12 May 2016, 11:08, Mich Talebzadeh 
  wrote:

 Hi Al,,

 Following the threads in spark forum, I decided to write up on 
 configuration of Spark including allocation of resources and configuration 
 of driver, executors, threads, execution of Spark apps and general 
 troubleshooting taking into account the allocation of resources for Spark 
 applications and OS tools at the disposal.

 Since the most widespread configuration as I notice is with "Spark 
 Standalone Mode", I have decided to write these notes starting with 
 Standalone and later on moving to Yarn

 Standalone – a simple cluster manager included with Spark that makes it 
 easy to set up a cluster.
 YARN – the resource manager in Hadoop 2.

 I would appreciate if anyone interested in reading and commenting to get 
 in touch with me directly on mich.talebza...@gmail.com so I can send the 
 write-up for their review and comments.

 Just to be clear this is not meant to be any commercial proposition or 
 anything like that. As I seem to get involved with members troubleshooting 
 issues and threads on this topic, I thought it is worthwhile writing a 
 note about it to summarise the findings for the benefit of the community.

 Regards.

 Dr Mich Talebzadeh

 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 http://talebzadehmich.wordpress.com

Re: Error joining dataframes

2016-05-17 Thread Mich Talebzadeh

pretty simple, a similar construct to tables projected as DF

val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
val rs = s.join(t,"time_id").join(c,"channel_id")

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 May 2016 at 21:52, Bijay Kumar Pathak  wrote:

> Hi,
>
> Try this one:
>
>
> df_join = df1.*join*(df2, 'Id', "fullouter")
>
> Thanks,
> Bijay
>
>
> On Tue, May 17, 2016 at 9:39 AM, ram kumar 
> wrote:
>
>> Hi,
>>
>> I tried to join two dataframe
>>
>> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>>
>> df_join.registerTempTable("join_test")
>>
>>
>> When querying "Id" from "join_test"
>>
>> 0: jdbc:hive2://> *select Id from join_test;*
>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>> 0: jdbc:hive2://>
>>
>> Is there a way to merge the value of df1("Id") and df2("Id") into one "Id"
>>
>> Thanks
>>
>
>

Re: Error joining dataframes

2016-05-17 Thread Bijay Kumar Pathak

Hi,

Try this one:


df_join = df1.*join*(df2, 'Id', "fullouter")

Thanks,
Bijay


On Tue, May 17, 2016 at 9:39 AM, ram kumar  wrote:

> Hi,
>
> I tried to join two dataframe
>
> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>
> df_join.registerTempTable("join_test")
>
>
> When querying "Id" from "join_test"
>
> 0: jdbc:hive2://> *select Id from join_test;*
> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
> 0: jdbc:hive2://>
>
> Is there a way to merge the value of df1("Id") and df2("Id") into one "Id"
>
> Thanks
>

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh

Ok but how about something similar to

val countByValueAndWindow = price.filter(_ >
95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))


Using a new count => c*ountDistinctByValueAndWindow ?*

val countDistinctByValueAndWindow = price.filter(_ >
95.0).countDistinctByValueAndWindow(Seconds(windowLength),
Seconds(slidingInterval))


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 May 2016 at 20:02, Michael Armbrust  wrote:

> In 2.0 you won't be able to do this.  The long term vision would be to
> make this possible, but a window will be required (like the 24 hours you
> suggest).
>
> On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:
>
>> Hi,
>> We have a requirement to do count(distinct) in a processing batch against
>> all the streaming data(eg, last 24 hours' data),that is,when we do
>> count(distinct),we actually want to compute distinct against last 24 hours'
>> data.
>> Does structured streaming support this scenario?Thanks!
>>
>
>

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove

Hmm. I see. Yes, I guess that won't work then.

I don't understand what you are proposing about UDFRegistration. I only see
methods that take tuples of various sizes (1 .. 22).

On Tue, May 17, 2016 at 1:00 PM, Michael Armbrust 
wrote:

> I don't think that you will be able to do that.  ScalaReflection is based
> on the TypeTag of the object, and thus the schema of any particular object
> won't be available to it.
>
> Instead I think you want to use the register functions in UDFRegistration
> that take a schema. Does that make sense?
>
> On Tue, May 17, 2016 at 11:48 AM, Andy Grove 
> wrote:
>
>>
>> Hi,
>>
>> I have a requirement to create types dynamically in Spark and then
>> instantiate those types from Spark SQL via a UDF.
>>
>> I tried doing the following:
>>
>> val addressType = StructType(List(
>>   new StructField("state", DataTypes.StringType),
>>   new StructField("zipcode", DataTypes.IntegerType)
>> ))
>>
>> sqlContext.udf.register("Address", (args: Seq[Any]) => new
>> GenericRowWithSchema(args.toArray, addressType))
>>
>> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>>
>> This seems reasonable to me but this fails with:
>>
>> Exception in thread "main" java.lang.UnsupportedOperationException:
>> Schema for type
>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is not
>> supported
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
>> at
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
>> at
>> org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>>
>> It looks like it would be simple to update ScalaReflection to be able to
>> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
>> submit a patch I wanted to see if there is already a way of achieving this.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust

In 2.0 you won't be able to do this.  The long term vision would be to make
this possible, but a window will be required (like the 24 hours you
suggest).

On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:

> Hi,
> We have a requirement to do count(distinct) in a processing batch against
> all the streaming data(eg, last 24 hours' data),that is,when we do
> count(distinct),we actually want to compute distinct against last 24 hours'
> data.
> Does structured streaming support this scenario?Thanks!
>

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Michael Armbrust

I don't think that you will be able to do that.  ScalaReflection is based
on the TypeTag of the object, and thus the schema of any particular object
won't be available to it.

Instead I think you want to use the register functions in UDFRegistration
that take a schema. Does that make sense?

On Tue, May 17, 2016 at 11:48 AM, Andy Grove 
wrote:

>
> Hi,
>
> I have a requirement to create types dynamically in Spark and then
> instantiate those types from Spark SQL via a UDF.
>
> I tried doing the following:
>
> val addressType = StructType(List(
>   new StructField("state", DataTypes.StringType),
>   new StructField("zipcode", DataTypes.IntegerType)
> ))
>
> sqlContext.udf.register("Address", (args: Seq[Any]) => new
> GenericRowWithSchema(args.toArray, addressType))
>
> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>
> This seems reasonable to me but this fails with:
>
> Exception in thread "main" java.lang.UnsupportedOperationException: Schema
> for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
> not supported
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
> at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>
> It looks like it would be simple to update ScalaReflection to be able to
> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
> submit a patch I wanted to see if there is already a way of achieving this.
>
> Thanks,
>
> Andy.
>
>
>

Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove

Hi,

I have a requirement to create types dynamically in Spark and then
instantiate those types from Spark SQL via a UDF.

I tried doing the following:

val addressType = StructType(List(
  new StructField("state", DataTypes.StringType),
  new StructField("zipcode", DataTypes.IntegerType)
))

sqlContext.udf.register("Address", (args: Seq[Any]) => new
GenericRowWithSchema(args.toArray, addressType))

sqlContext.sql("SELECT Address('NY', 12345)").show(10)

This seems reasonable to me but this fails with:

Exception in thread "main" java.lang.UnsupportedOperationException: Schema
for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
not supported
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)

It looks like it would be simple to update ScalaReflection to be able to
infer the schema from a GenericRowWithSchema, but before I file a JIRA and
submit a patch I wanted to see if there is already a way of achieving this.

Thanks,

Andy.

duplicate jar problem in yarn-cluster mode

2016-05-17 Thread satish saley

Hello,
I am executing a simple code with yarn-cluster

--master
yarn-cluster
--name
Spark-FileCopy
--class
my.example.SparkFileCopy
--properties-file
spark-defaults.conf
--queue
saleyq
--executor-memory
1G
--driver-memory
1G
--conf
spark.john.snow.is.back=true
--jars
hdfs://myclusternn.com:8020/tmp/saley/examples/examples-new.jar
--conf
spark.executor.extraClassPath=examples-new.jar
--conf
spark.driver.extraClassPath=examples-new.jar
--verbose
examples-new.jar
hdfs://myclusternn.com:8020/tmp/saley/examples/input-data/text/data.txt
hdfs://myclusternn.com:8020/tmp/saley/examples/output-data/spark


I am facing

Resource hdfs://
myclusternn.com/user/saley/.sparkStaging/application_5181/examples-new.jar
changed on src filesystem (expected 1463440119942, was 1463440119989
java.io.IOException: Resource hdfs://
myclusternn.com/user/saley/.sparkStaging/application_5181/examples-new.jar
changed on src filesystem (expected 1463440119942, was 1463440119989

I see a jira for this
https://issues.apache.org/jira/browse/SPARK-1921

Is this yet to be fixed or fixed as part of another jira and need some
additional config?

Pls Assist: error when creating cluster on AWS using spark's ec2 scripts

2016-05-17 Thread Marco Mistroni

Hi
 was wondering if anyone can assist here..
I am trying to create a spark cluster on AWS using scripts located in
spark-1.6.1/ec2 directory

When the spark_ec2.py scripts tries to do a rsync to copy directories over
to teh AWS
master node it fails miserably with this stack trace

DEBUG:spark ecd logger:Issuing command..:['rsync', '-rv', '-e', 'ssh -o
StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i
ec2AccessKey.pem', 'c:/tmp-spark/',
u'r...@ec2-54-218-75-130.us-west-2.compute.amazonaws.com:/']
Traceback (most recent call last):
  File "./spark_ec2.py", line 1545, in 
main()
  File "./spark_ec2.py", line 1536, in main
real_main()
  File "./spark_ec2.py", line 1371, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, True)
  File "./spark_ec2.py", line 849, in setup_cluster
modules=modules
  File "./spark_ec2.py", line 1133, in deploy_files
subprocess.check_call(command)
  File "C:\Python27\lib\subprocess.py", line 535, in check_call
retcode = call(*popenargs, **kwargs)
  File "C:\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
  File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
  File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

here's my take on the what's happening
1. spark_ec2.py script download some scripts from git repo to a temporary
directory (fro what i can see, in the temp directory there's only 1 file,
root\spark-ec2\ec2-variables.sh
2. spark_ec2.py script tries to copy over the downloaded files to AWS


the error happens at this line (roughly line 1130), while invoking this
command

DEBUG:spark ecd logger:Issuing command..:['rsync', '-rv', '-e', 'ssh -o
StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i
ec2AccessKey.pem', 'c:/tmp-spark/',
u'r...@ec2-54-218-75-130.us-west-2.compute.amazonaws.com:/']


subprocess.check_call(command)

what am i missing? perhaps an rsync executable?
the status of my cluster is that , as far as i can see, spark is not
installed,


could anyone help?
thanks
 m arco

Re: Silly Question on my part...

2016-05-17 Thread Gene Pang

Hi Michael,

Yes, you can use Alluxio to share Spark RDDs. Here is a blog post about
getting started with Spark and Alluxio (
http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/),
and some documentation (
http://alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html).

I hope that helps,
Gene

On Tue, May 17, 2016 at 8:36 AM, Michael Segel 
wrote:

> Thanks for the response.
>
> That’s what I thought, but I didn’t want to assume anything.
> (You know what happens when you ass u me … :-)
>
>
> Not sure about Tachyon though.  Its a thought, but I’m very conservative
> when it comes to design choices.
>
>
> On May 16, 2016, at 5:21 PM, John Trengrove 
> wrote:
>
> If you are wanting to share RDDs it might be a good idea to check out
> Tachyon / Alluxio.
>
> For the Thrift server, I believe the datasets are located in your Spark
> cluster as RDDs and you just communicate with it via the Thrift
> JDBC Distributed Query Engine connector.
>
> 2016-05-17 5:12 GMT+10:00 Michael Segel :
>
>> For one use case.. we were considering using the thrift server as a way
>> to allow multiple clients access shared RDDs.
>>
>> Within the Thrift Context, we create an RDD and expose it as a hive table.
>>
>> The question  is… where does the RDD exist. On the Thrift service node
>> itself, or is that just a reference to the RDD which is contained with
>> contexts on the cluster?
>>
>>
>> Thx
>>
>> -Mike
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
>

Error joining dataframes

2016-05-17 Thread ram kumar

Hi,

I tried to join two dataframe

df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")

df_join.registerTempTable("join_test")


When querying "Id" from "join_test"

0: jdbc:hive2://> *select Id from join_test;*
*Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
*ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
0: jdbc:hive2://>

Is there a way to merge the value of df1("Id") and df2("Id") into one "Id"

Thanks

Re: yarn-cluster mode error

2016-05-17 Thread Sandeep Nemuri

Can you post the complete stack trace ?
ᐧ

On Tue, May 17, 2016 at 7:00 PM,  wrote:

> Hi,
>
> i am getting error below while running application on yarn-cluster mode.
>
> *ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM*
>
> Anyone can suggest why i am getting this error message?
>
> Thanks
> Raj
>
>
>
>
> Sent from Yahoo Mail. Get the app 
>



-- 
*  Regards*
*  Sandeep Nemuri*

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread rakesh sharma

It would be a rare doc. Please share

Get Outlook for Android

On Tue, May 17, 2016 at 9:14 AM -0700, "Natu Lauchande" 
> wrote:

Hi Mich,

I am also interested in the write up.

Regards,
Natu

On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh 
> wrote:
Hi Al,,

Following the threads in spark forum, I decided to write up on configuration of 
Spark including allocation of resources and configuration of driver, executors, 
threads, execution of Spark apps and general troubleshooting taking into 
account the allocation of resources for Spark applications and OS tools at the 
disposal.

Since the most widespread configuration as I notice is with "Spark Standalone 
Mode", I have decided to write these notes starting with Standalone and later 
on moving to Yarn

  *   Standalone - a simple cluster manager included with Spark that makes it 
easy to set up a cluster.

  *   YARN - the resource manager in Hadoop 2.

I would appreciate if anyone interested in reading and commenting to get in 
touch with me directly on 
mich.talebza...@gmail.com so I can send the 
write-up for their review and comments.

Just to be clear this is not meant to be any commercial proposition or anything 
like that. As I seem to get involved with members troubleshooting issues and 
threads on this topic, I thought it is worthwhile writing a note about it to 
summarise the findings for the benefit of the community.

Regards.

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread Rex X

Each row of the given matrix is Vector[Double]. Want to find out the
nearest neighbor row to each row using cosine similarity.

The problem here is the complexity: O( 10^20 )

We need to do *blocking*, and do the row-wise comparison within each block.
Any tips for best practice?

In Spark, we have RowMatrix.*ColumnSimilarity*, but I didn't find
*RowSimilarity* method.


Thank you.


Regards
Rex

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Natu Lauchande

Hi Mich,

I am also interested in the write up.

Regards,
Natu

On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh  wrote:

> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Raghavendra Pandey

Can you please send me as well.

Thanks
Raghav
On 12 May 2016 20:02, "Tom Ellis"  wrote:

> I would like to also Mich, please send it through, thanks!
>
> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>
>> Me too, send me the guide.
>>
>> Enviado desde mi iPhone
>>
>> El 12 may 2016, a las 12:11, Ashok Kumar > > escribió:
>>
>> Hi Dr Mich,
>>
>> I will be very keen to have a look at it and review if possible.
>>
>> Please forward me a copy
>>
>> Thanking you warmly
>>
>>
>> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi Al,,
>>
>>
>> Following the threads in spark forum, I decided to write up on
>> configuration of Spark including allocation of resources and configuration
>> of driver, executors, threads, execution of Spark apps and general
>> troubleshooting taking into account the allocation of resources for Spark
>> applications and OS tools at the disposal.
>>
>> Since the most widespread configuration as I notice is with "Spark
>> Standalone Mode", I have decided to write these notes starting with
>> Standalone and later on moving to Yarn
>>
>>
>>- *Standalone *– a simple cluster manager included with Spark that
>>makes it easy to set up a cluster.
>>- *YARN* – the resource manager in Hadoop 2.
>>
>>
>> I would appreciate if anyone interested in reading and commenting to get
>> in touch with me directly on mich.talebza...@gmail.com so I can send the
>> write-up for their review and comments.
>>
>> Just to be clear this is not meant to be any commercial proposition or
>> anything like that. As I seem to get involved with members troubleshooting
>> issues and threads on this topic, I thought it is worthwhile writing a note
>> about it to summarise the findings for the benefit of the community.
>>
>> Regards.
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>

Re: Executor memory requirement for reduceByKey

2016-05-17 Thread Raghavendra Pandey

Even though it does not sound intuitive,  reduce by key expects all values
for a particular key for a partition to be loaded into memory. So once you
increase the partitions you can run the jobs.

Re: Silly Question on my part...

2016-05-17 Thread Dood


On 5/16/2016 12:12 PM, Michael Segel wrote:

For one use case.. we were considering using the thrift server as a way to 
allow multiple clients access shared RDDs.

Within the Thrift Context, we create an RDD and expose it as a hive table.

The question  is… where does the RDD exist. On the Thrift service node itself, 
or is that just a reference to the RDD which is contained with contexts on the 
cluster?



You can share RDDs using Apache Ignite - it is a distributed memory 
grid/cache with tons of additional functionality. The advantage is extra 
resilience (you can mirror caches or just partition them), you can query 
the contents of the caches in standard SQL etc. Since the caches persist 
past the existence of the Spark app, you can share them (obviously). You 
also get read/write through to SQL or NoSQL databases on the back end 
for persistence and loading/dumping caches to secondary storage. It is 
written in Java so very easy to use from Scala/Spark apps.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel

Thanks for the response. 

That’s what I thought, but I didn’t want to assume anything. 
(You know what happens when you ass u me … :-) 


Not sure about Tachyon though.  Its a thought, but I’m very conservative when 
it comes to design choices. 


> On May 16, 2016, at 5:21 PM, John Trengrove  
> wrote:
> 
> If you are wanting to share RDDs it might be a good idea to check out Tachyon 
> / Alluxio.
> 
> For the Thrift server, I believe the datasets are located in your Spark 
> cluster as RDDs and you just communicate with it via the Thrift JDBC 
> Distributed Query Engine connector.
> 
> 2016-05-17 5:12 GMT+10:00 Michael Segel  >:
> For one use case.. we were considering using the thrift server as a way to 
> allow multiple clients access shared RDDs.
> 
> Within the Thrift Context, we create an RDD and expose it as a hive table.
> 
> The question  is… where does the RDD exist. On the Thrift service node 
> itself, or is that just a reference to the RDD which is contained with 
> contexts on the cluster?
> 
> 
> Thx
> 
> -Mike
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
>

dataframe stat corr for multiple columns

2016-05-17 Thread Ankur Jain

Hello Team,

In my current usecase I am loading data from CSV using spark-csv and trying to 
correlate all variables.

As of now if we want to correlate 2 column in a dataframe df.stat.corr works 
great but if we want to correlate multiple columns this won't work.
In case of R we can use corrplot and correlate all numeric columns in a single 
line of code. Can you guide me how to achieve the same with dataframe or sql?

There seems a way in spark-mllib
http://spark.apache.org/docs/latest/mllib-statistics.html

[cid:image001.png@01D1B069.D3099410]

But it seems that it don't take input as dataframe...

Regards,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak

Hi, I know about that approach.
I don't want to run mess of classes from single jar, I want to utilize
distributed cache functionality and ship application jar and dependent jars
explicitly.
--deploy-mode client unfortunately copies and distributes all jars
repeatedly for every spark job started from driver class...

2016-05-17 15:41 GMT+02:00 :

> Hi Serega,
>
> Create a jar including all the the dependencies and execute it like below
> through shell script
>
> /usr/local/spark/bin/spark-submit \  //location of your spark-submit
> --class classname \  //location of your main classname
> --master yarn \
> --deploy-mode cluster \
> /home/hadoop/SparkSampleProgram.jar  //location of your jar file
>
> Thanks
> Raj
>
>
>
> Sent from Yahoo Mail. Get the app 
>
>
> On Tuesday, May 17, 2016 6:03 PM, Serega Sheypak 
> wrote:
>
>
> hi, I'm trying to:
> 1. upload my app jar files to HDFS
> 2. run spark-submit with:
> 2.1. --master yarn --deploy-mode cluster
> or
> 2.2. --master yarn --deploy-mode client
>
> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar
>
> When spark job is submitted, SparkSubmit client outputs:
> Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar
> ...
>
> and then spark application main class fails with class not found exception.
> Is there any workaround?
>
>
>

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak

spark-submit --conf "spark.driver.userClassPathFirst=true" --class
com.MyClass --master yarn --deploy-mode client --jars
hdfs:///my-lib.jar,hdfs:///my-seocnd-lib.jar jar-wth-com-MyClass.jar
job_params



2016-05-17 15:41 GMT+02:00 Serega Sheypak :

> https://issues.apache.org/jira/browse/SPARK-10643
>
> Looks like it's the reason...
>
> 2016-05-17 15:31 GMT+02:00 Serega Sheypak :
>
>> No, and it looks like a problem.
>>
>> 2.2. --master yarn --deploy-mode client
>> means:
>> 1. submit spark as yarn app, but spark-driver is started on local
>> machine.
>> 2. A upload all dependent jars to HDFS and specify jar HDFS paths in
>> --jars arg.
>> 3. Driver runs my Spark Application main class named "MySuperSparkJob"
>> and MySuperSparkJob fails because it doesn't get jars, thay are all in
>> HDFS and not accessible from local machine...
>>
>>
>> 2016-05-17 15:18 GMT+02:00 Jeff Zhang :
>>
>>> Do you put your app jar on hdfs ? The app jar must be on your local
>>> machine.
>>>
>>> On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak <
>>> serega.shey...@gmail.com> wrote:
>>>
 hi, I'm trying to:
 1. upload my app jar files to HDFS
 2. run spark-submit with:
 2.1. --master yarn --deploy-mode cluster
 or
 2.2. --master yarn --deploy-mode client

 specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar

 When spark job is submitted, SparkSubmit client outputs:
 Warning: Skip remote jar
 hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ...

 and then spark application main class fails with class not found
 exception.
 Is there any workaround?

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread spark.raj

Hi Serega,
Create a jar including all the the dependencies and execute it like below 
through shell script

/usr/local/spark/bin/spark-submit \  //location of your spark-submit
--class classname \  //location of your main classname
--master yarn \
--deploy-mode cluster \
/home/hadoop/SparkSampleProgram.jar  //location of your jar file

ThanksRaj
 

Sent from Yahoo Mail. Get the app 

On Tuesday, May 17, 2016 6:03 PM, Serega Sheypak  
wrote:
 

 hi, I'm trying to:1. upload my app jar files to HDFS2. run spark-submit 
with:2.1. --master yarn --deploy-mode clusteror2.2. --master yarn --deploy-mode 
client
specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar 
When spark job is submitted, SparkSubmit client outputs:Warning: Skip remote 
jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ...

and then spark application main class fails with class not found exception.Is 
there any workaround?

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak

https://issues.apache.org/jira/browse/SPARK-10643

Looks like it's the reason...

2016-05-17 15:31 GMT+02:00 Serega Sheypak :

> No, and it looks like a problem.
>
> 2.2. --master yarn --deploy-mode client
> means:
> 1. submit spark as yarn app, but spark-driver is started on local machine.
> 2. A upload all dependent jars to HDFS and specify jar HDFS paths in
> --jars arg.
> 3. Driver runs my Spark Application main class named "MySuperSparkJob" and 
> MySuperSparkJob
> fails because it doesn't get jars, thay are all in HDFS and not accessible
> from local machine...
>
>
> 2016-05-17 15:18 GMT+02:00 Jeff Zhang :
>
>> Do you put your app jar on hdfs ? The app jar must be on your local
>> machine.
>>
>> On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak > > wrote:
>>
>>> hi, I'm trying to:
>>> 1. upload my app jar files to HDFS
>>> 2. run spark-submit with:
>>> 2.1. --master yarn --deploy-mode cluster
>>> or
>>> 2.2. --master yarn --deploy-mode client
>>>
>>> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar
>>>
>>> When spark job is submitted, SparkSubmit client outputs:
>>> Warning: Skip remote jar
>>> hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ...
>>>
>>> and then spark application main class fails with class not found
>>> exception.
>>> Is there any workaround?
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

yarn-cluster mode error

2016-05-17 Thread spark.raj

Hi,
i am getting error below while running application on yarn-cluster mode.
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM

Anyone can suggest why i am getting this error message?

Thanks
Raj
 

Sent from Yahoo Mail. Get the app

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak

No, and it looks like a problem.

2.2. --master yarn --deploy-mode client
means:
1. submit spark as yarn app, but spark-driver is started on local machine.
2. A upload all dependent jars to HDFS and specify jar HDFS paths in --jars
arg.
3. Driver runs my Spark Application main class named "MySuperSparkJob"
and MySuperSparkJob
fails because it doesn't get jars, thay are all in HDFS and not accessible
from local machine...

2016-05-17 15:18 GMT+02:00 Jeff Zhang :

> Do you put your app jar on hdfs ? The app jar must be on your local
> machine.
>
> On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak 
> wrote:
>
>> hi, I'm trying to:
>> 1. upload my app jar files to HDFS
>> 2. run spark-submit with:
>> 2.1. --master yarn --deploy-mode cluster
>> or
>> 2.2. --master yarn --deploy-mode client
>>
>> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar
>>
>> When spark job is submitted, SparkSubmit client outputs:
>> Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar
>> ...
>>
>> and then spark application main class fails with class not found
>> exception.
>> Is there any workaround?
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang

Do you put your app jar on hdfs ? The app jar must be on your local
machine.

On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak 
wrote:

> hi, I'm trying to:
> 1. upload my app jar files to HDFS
> 2. run spark-submit with:
> 2.1. --master yarn --deploy-mode cluster
> or
> 2.2. --master yarn --deploy-mode client
>
> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar
>
> When spark job is submitted, SparkSubmit client outputs:
> Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar
> ...
>
> and then spark application main class fails with class not found exception.
> Is there any workaround?
>



-- 
Best Regards

Jeff Zhang

spark job is not running on yarn clustor mode

2016-05-17 Thread spark.raj

Hi friends,
I am running spark streaming job on yarn cluster mode but it is failing. It is 
working fine in yarn-client mode. and also spark-examples are running good in 
spark-cluster mode. below is the log file for the spark streaming job on 
yarn-cluster mode. Can anyone help me on this.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/15/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/17 16:17:47 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
16/05/17 16:17:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/05/17 16:17:48 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1463479181441_0003_02
16/05/17 16:17:49 INFO spark.SecurityManager: Changing view acls to: hadoop
16/05/17 16:17:49 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/05/17 16:17:49 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
16/05/17 16:17:49 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
16/05/17 16:17:49 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: found keyword== 
userTwitterToken=9ACWejzaHVyxpPDYCHnDsO98U 
01safwuyLO8B8S94v5i0p90SzxEPZqUUmCaDkYOj1FKN1dXKZC 
702828259411521536-PNoSkM8xNIvuEVvoQ9Pj8fj7D8CkYp1 
OntoQStrmwrztnzi1MSlM56sKc23bqUCC2WblbDPiiP8P
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = 
9ACWejzaHVyxpPDYCHnDsO98U 01safwuyLO8B8S94v5i0p90SzxEPZqUUmCaDkYOj1FKN1dXKZC 
702828259411521536-PNoSkM8xNIvuEVvoQ9Pj8fj7D8CkYp1 
OntoQStrmwrztnzi1MSlM56sKc23bqUCC2WblbDPiiP8P
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = 1
16/05/17 16:17:49 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = 2
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = Tue 
May 17 00:00:00 IST 2016
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = Tue 
May 17 00:00:00 IST 2016
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = 
nokia,samsung,iphone,blackberry
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = All
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = mo
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = en
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: DemoJava called = 
retweet
16/05/17 16:17:49 INFO spark.SparkTweetStreamingHDFSLoad: Twitter 
Token...[Ljava.lang.String;@3ee5e48d
16/05/17 16:17:49 INFO spark.SparkContext: Running Spark version 1.5.2
16/05/17 16:17:49 WARN spark.SparkConf: 
SPARK_JAVA_OPTS was detected (set to '-Dspark.driver.port=53411').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an 
application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or 
worker)

16/05/17 16:17:49 WARN spark.SparkConf: Setting 
'spark.executor.extraJavaOptions' to '-Dspark.driver.port=53411' as a 
work-around.
16/05/17 16:17:49 WARN spark.SparkConf: Setting 'spark.driver.extraJavaOptions' 
to '-Dspark.driver.port=53411' as a work-around.
16/05/17 16:17:49 INFO spark.SecurityManager: Changing view acls to: hadoop
16/05/17 16:17:49 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/05/17 16:17:49 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop); users 
with modify permissions: Set(hadoop)
16/05/17 16:17:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/05/17 16:17:49 INFO Remoting: Starting remoting
16/05/17 16:17:50 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@172.16.28.195:53411]
16/05/17 16:17:50 INFO util.Utils: Successfully started service 'sparkDriver' 
on port 53411.
16/05/17 16:17:50 INFO spark.SparkEnv: Registering MapOutputTracker
16/05/17 16:17:50 INFO spark.SparkEnv: Registering BlockManagerMaster
16/05/17 16:17:50 INFO storage.DiskBlockManager: Created local directory at

Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak

hi, I'm trying to:
1. upload my app jar files to HDFS
2. run spark-submit with:
2.1. --master yarn --deploy-mode cluster
or
2.2. --master yarn --deploy-mode client

specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar

When spark job is submitted, SparkSubmit client outputs:
Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar
...

and then spark application main class fails with class not found exception.
Is there any workaround?

RE: SparkR query

2016-05-17 Thread Mike Lewis

Thanks, I’m just using RStudio. Running locally is fine, just issue with having 
cluster in Linux and workers looking for Windows path,
Which must be being passed through by the driver I guess. I checked the 
spark-env.sh on each node and the appropriate SPARK_HOME is set
correctly….


From: Sun Rui [mailto:sunrise_...@163.com]
Sent: 17 May 2016 11:32
To: Mike Lewis
Cc: user@spark.apache.org
Subject: Re: SparkR query

Lewis,
1. Could you check the values of “SPARK_HOME” environment on all of your worker 
nodes?
2. How did you start your SparkR shell?

On May 17, 2016, at 18:07, Mike Lewis 
> wrote:

Hi,

I have a SparkR driver process that connects to a master running on Linux,
I’ve tried to do a simple test, e.g.

sc <- sparkR.init(master="spark://my-linux-host.dev.local:7077",  
sparkEnvir=list(spark.cores.max="4"))
x <- SparkR:::parallelize(sc,1:100,2)
y <- count(x)

But I can see that the worker nodes are failing, they are looking for the 
Windows (rather than linux path) to
Daemon.R


16/05/17 06:52:33 INFO BufferedStreamThread: Fatal error: cannot open file 
'E:/Dev/spark/spark-1.4.0-bin-hadoop2.6/R/lib/SparkR/worker/daemon.R': No such 
file or directory

Is this a configuration setting that I’m missing, the worker nodes (linux) 
shouldn’t be looking in the spark home of the driver (windows) ?
If so, I’d appreciate someone letting me know what I need to change/set.

Thanks,
Mike Lewis


--
This email has been sent to you on behalf of Nephila Advisors LLC (“Advisors”). 
Advisors provides consultancy services to Nephila Capital Ltd. (“Capital”), an 
investment advisor managed and carrying on business in Bermuda. Advisors and 
its employees do not act as agents for Capital or the funds it advises and do 
not have the authority to bind Capital or such funds to any transaction or 
agreement.

The information in this e-mail, and any attachment therein, is confidential and 
for use by the addressee only. Any use, disclosure, reproduction, modification 
or distribution of the contents of this e-mail, or any part thereof, other than 
by the intended recipient, is strictly prohibited. If you are not the intended 
recipient, please return the e-mail to the sender and delete it from your 
computer. This email is for information purposes only, nothing contained herein 
constitutes an offer to sell or buy securities, as such an offer may only be 
made from a properly authorized offering document. Although Nephila attempts to 
sweep e-mail and attachments for viruses, it does not guarantee that either are 
virus-free and accepts no liability for any damage sustained as a result of 
viruses.
--

--
This email has been sent to you on behalf of Nephila Advisors UK (“Advisors 
UK”). Advisors UK provides consultancy services to Nephila Capital Ltd. 
(“Capital”), an investment advisor managed and carrying on business in Bermuda. 
Advisors UK and its employees do not act as agents for Capital or the funds it 
advises and do not have the authority to bind Capital or such funds to any 
transaction or agreement.

The information in this e-mail, and any attachment therein, is confidential and 
for use by the addressee only. Any use, disclosure, reproduction, modification 
or distribution of the contents of this e-mail, or any part thereof, other than 
by the intended recipient, is strictly prohibited. If you are not the intended 
recipient, please return the e-mail to the sender and delete it from your 
computer. This email is for information purposes only, nothing contained herein 
constitutes an offer to sell or buy securities, as such an offer may only be 
made from a properly authorized offering document. Although Nephila attempts to 
sweep e-mail and attachments for viruses, it does not guarantee that either are 
virus-free and accepts no liability for any damage sustained as a result of 
viruses.
--


--
This email has been sent to you on behalf of Nephila Advisors LLC (“Advisors”). 
Advisors provides consultancy services to Nephila Capital Ltd. (“Capital”), an 
investment advisor managed and carrying on business in Bermuda. Advisors and 
its employees do not act as agents for Capital or the funds it advises and do 
not have the authority to bind Capital or such funds to any transaction or 
agreement.

The information in this e-mail, and any attachment therein, is

Re: SparkR query

2016-05-17 Thread Sun Rui

Lewis,
1. Could you check the values of “SPARK_HOME” environment on all of your worker 
nodes?
2. How did you start your SparkR shell?

> On May 17, 2016, at 18:07, Mike Lewis  wrote:
> 
> Hi,
>  
> I have a SparkR driver process that connects to a master running on Linux,
> I’ve tried to do a simple test, e.g.
>  
> sc <- sparkR.init(master="spark://my-linux-host.dev.local:7077 
> ",  
> sparkEnvir=list(spark.cores.max="4"))
> x <- SparkR:::parallelize(sc,1:100,2)
> y <- count(x)
>  
> But I can see that the worker nodes are failing, they are looking for the 
> Windows (rather than linux path) to
> Daemon.R
>  
> 16/05/17 06:52:33 INFO BufferedStreamThread: Fatal error: cannot open file 
> 'E:/Dev/spark/spark-1.4.0-bin-hadoop2.6/R/lib/SparkR/worker/daemon.R': No 
> such file or directory
>  
> Is this a configuration setting that I’m missing, the worker nodes (linux) 
> shouldn’t be looking in the spark home of the driver (windows) ?
> If so, I’d appreciate someone letting me know what I need to change/set.
> 
> Thanks,
> Mike Lewis
>  
> 
> --
>  
> This email has been sent to you on behalf of Nephila Advisors LLC 
> (“Advisors”). Advisors provides consultancy services to Nephila Capital Ltd. 
> (“Capital”), an investment advisor managed and carrying on business in 
> Bermuda. Advisors and its employees do not act as agents for Capital or the 
> funds it advises and do not have the authority to bind Capital or such funds 
> to any transaction or agreement. 
> 
> The information in this e-mail, and any attachment therein, is confidential 
> and for use by the addressee only. Any use, disclosure, reproduction, 
> modification or distribution of the contents of this e-mail, or any part 
> thereof, other than by the intended recipient, is strictly prohibited. If you 
> are not the intended recipient, please return the e-mail to the sender and 
> delete it from your computer. This email is for information purposes only, 
> nothing contained herein constitutes an offer to sell or buy securities, as 
> such an offer may only be made from a properly authorized offering document. 
> Although Nephila attempts to sweep e-mail and attachments for viruses, it 
> does not guarantee that either are virus-free and accepts no liability for 
> any damage sustained as a result of viruses. 
> --
>  
> 
> --
>  
> This email has been sent to you on behalf of Nephila Advisors UK (“Advisors 
> UK”). Advisors UK provides consultancy services to Nephila Capital Ltd. 
> (“Capital”), an investment advisor managed and carrying on business in 
> Bermuda. Advisors UK and its employees do not act as agents for Capital or 
> the funds it advises and do not have the authority to bind Capital or such 
> funds to any transaction or agreement. 
> 
> The information in this e-mail, and any attachment therein, is confidential 
> and for use by the addressee only. Any use, disclosure, reproduction, 
> modification or distribution of the contents of this e-mail, or any part 
> thereof, other than by the intended recipient, is strictly prohibited. If you 
> are not the intended recipient, please return the e-mail to the sender and 
> delete it from your computer. This email is for information purposes only, 
> nothing contained herein constitutes an offer to sell or buy securities, as 
> such an offer may only be made from a properly authorized offering document. 
> Although Nephila attempts to sweep e-mail and attachments for viruses, it 
> does not guarantee that either are virus-free and accepts no liability for 
> any damage sustained as a result of viruses. 
> --

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-17 Thread Jan Uyttenhove

I think that if the Confluent deserializer cannot fetch the schema for the
avro message (key and/or value), you end up with no data. You should check
the logs of the Schemaregistry, it should show the HTTP requests it
receives so you can check if the deserializer can connect to it and if so,
what the response code looks like.

If you use the Confluent serializer, each avro message is first serialized
and afterwards the schema id is added to it. This way, the Confluent
deserializer can fetch the schema id first and use it to lookup the schema
in the Schemaregistry.

On Tue, May 17, 2016 at 2:19 AM, Ramaswamy, Muthuraman <
muthuraman.ramasw...@viasat.com> wrote:

> Yes, I can see the messages. Also, I wrote a quick custom decoder for avro
> and it works fine for the following:
>
> >> kvs = KafkaUtils.createDirectStream(ssc, [topic],
> {"metadata.broker.list": brokers}, valueDecoder=decoder)
>
> But, when I use the Confluent Serializers to leverage the Schema Registry
> (based on the link shown below), it doesn’t work for me. I am not sure
> whether I need to configure any more details to consume the Schema
> Registry. I can fetch the schema from the schema registry based on is Ids.
> The decoder method is not returning any values for me.
>
> ~Muthu
>
>
>
> On 5/16/16, 10:49 AM, "Cody Koeninger"  wrote:
>
> >Have you checked to make sure you can receive messages just using a
> >byte array for value?
> >
> >On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
> > wrote:
> >> I am trying to consume AVRO formatted message through
> >> KafkaUtils.createDirectStream. I followed the listed below example
> (refer
> >> link) but the messages are not being fetched by the Stream.
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
> >>
> >> Is there any code missing that I must add to make the above sample work.
> >> Say, I am not sure how the confluent serializers would know the avro
> schema
> >> info as it knows only the Schema Registry URL info.
> >>
> >> Appreciate your help.
> >>
> >> ~Muthu
> >>
> >>
> >>
>

-- 
Jan Uyttenhove
Streaming data & digital solutions architect @ Insidin bvba

j...@insidin.com
+32 474 56 24 39

https://twitter.com/xorto
https://www.linkedin.com/in/januyttenhove

This e-mail and any files transmitted with it are intended solely for the
use of the individual or entity to whom they are addressed. It may contain
privileged and confidential information. If you are not the intended
recipient please notify the sender immediately and destroy this e-mail. Any
form of reproduction, dissemination, copying, disclosure, modification,
distribution and/or publication of this e-mail message is strictly
prohibited. Whilst all efforts are made to safeguard e-mails, the sender
cannot guarantee that attachments are virus free or compatible with your
systems and does not accept liability in respect of viruses or computer
problems experienced.

SparkR query

2016-05-17 Thread Mike Lewis

Hi,

I have a SparkR driver process that connects to a master running on Linux,
I’ve tried to do a simple test, e.g.

sc <- sparkR.init(master="spark://my-linux-host.dev.local:7077",  
sparkEnvir=list(spark.cores.max="4"))
x <- SparkR:::parallelize(sc,1:100,2)
y <- count(x)

But I can see that the worker nodes are failing, they are looking for the 
Windows (rather than linux path) to
Daemon.R


16/05/17 06:52:33 INFO BufferedStreamThread: Fatal error: cannot open file 
'E:/Dev/spark/spark-1.4.0-bin-hadoop2.6/R/lib/SparkR/worker/daemon.R': No such 
file or directory

Is this a configuration setting that I’m missing, the worker nodes (linux) 
shouldn’t be looking in the spark home of the driver (windows) ?
If so, I’d appreciate someone letting me know what I need to change/set.

Thanks,
Mike Lewis


--
This email has been sent to you on behalf of Nephila Advisors LLC (“Advisors”). 
Advisors provides consultancy services to Nephila Capital Ltd. (“Capital”), an 
investment advisor managed and carrying on business in Bermuda. Advisors and 
its employees do not act as agents for Capital or the funds it advises and do 
not have the authority to bind Capital or such funds to any transaction or 
agreement.

The information in this e-mail, and any attachment therein, is confidential and 
for use by the addressee only. Any use, disclosure, reproduction, modification 
or distribution of the contents of this e-mail, or any part thereof, other than 
by the intended recipient, is strictly prohibited. If you are not the intended 
recipient, please return the e-mail to the sender and delete it from your 
computer. This email is for information purposes only, nothing contained herein 
constitutes an offer to sell or buy securities, as such an offer may only be 
made from a properly authorized offering document. Although Nephila attempts to 
sweep e-mail and attachments for viruses, it does not guarantee that either are 
virus-free and accepts no liability for any damage sustained as a result of 
viruses.
--

--
This email has been sent to you on behalf of Nephila Advisors UK (“Advisors 
UK”). Advisors UK provides consultancy services to Nephila Capital Ltd. 
(“Capital”), an investment advisor managed and carrying on business in Bermuda. 
Advisors UK and its employees do not act as agents for Capital or the funds it 
advises and do not have the authority to bind Capital or such funds to any 
transaction or agreement.

The information in this e-mail, and any attachment therein, is confidential and 
for use by the addressee only. Any use, disclosure, reproduction, modification 
or distribution of the contents of this e-mail, or any part thereof, other than 
by the intended recipient, is strictly prohibited. If you are not the intended 
recipient, please return the e-mail to the sender and delete it from your 
computer. This email is for information purposes only, nothing contained herein 
constitutes an offer to sell or buy securities, as such an offer may only be 
made from a properly authorized offering document. Although Nephila attempts to 
sweep e-mail and attachments for viruses, it does not guarantee that either are 
virus-free and accepts no liability for any damage sustained as a result of 
viruses.
--

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang

I don't think this related with file system recovery.
spark.deploy.recoveryDirectory
is standalone configuration which take effect in standalone mode, but you
are in local mode.

Can you just start pyspark using "bin/pyspark --master local[4]" ?

On Wed, May 11, 2016 at 3:52 AM, Imran Akbar  wrote:

> I have some Python code that consistently ends up in this state:
>
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the
> Java server
> Traceback (most recent call last):
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 690, in start
> self.socket.connect((self.address, self.port))
>   File "/usr/lib/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the
> Java server
> Traceback (most recent call last):
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 690, in start
> self.socket.connect((self.address, self.port))
>   File "/usr/lib/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/home/ubuntu/spark/python/pyspark/sql/dataframe.py", line 280, in
> collect
> port = self._jdf.collectToPython()
>   File "/home/ubuntu/spark/python/pyspark/traceback_utils.py", line 78, in
> __exit__
> self._context._jsc.setCallSite(None)
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 811, in __call__
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 624, in send_command
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 579, in _get_connection
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 585, in _create_connection
>   File
> "/home/ubuntu/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 697, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect
> to the Java server
>
> Even though I start pyspark with these options:
> ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g
> --packages com.databricks:spark-csv_2.11:1.4.0
> --spark.deploy.recoveryMode=FILESYSTEM
>
> and this in my /conf/spark-env.sh file:
> - SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM
> -Dspark.deploy.recoveryDirectory=/user/recovery"
>
> How can I get HA to work in Spark?
>
> thanks,
> imran
>
>


-- 
Best Regards

Jeff Zhang

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang

The following sample code works for me. Could you share your code ?

df = DataFrame([1,2,3])
df_b=sc.broadcast(df)
def f(a):
  print(df_b.value)

sc.parallelize(range(1,10)).foreach(f)


On Sat, May 14, 2016 at 12:59 AM, abi  wrote:

> pandas dataframe is broadcasted successfully. giving errors in datanode
> function called kernel
>
> Code:
>
> dataframe_broadcast  = sc.broadcast(dataframe)
>
> def kernel():
> df_v = dataframe_broadcast.value
>
>
> Error:
>
> I get this error when I try accessing the value member of the broadcast
> variable. Apprently it does not have a value, hence it tries to load from
> the file again.
>
>   File
> "C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
> line 97, in value
> self._value = self.load(self._path)
>   File
> "C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
> line 88, in load
> return pickle.load(f)
> ImportError: No module named indexes.base
>
> at
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at
>
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

unsubscribe

2016-05-17 Thread aruna jakhmola

Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd

Hi,
We have a requirement to do count(distinct) in a processing batch against all 
the streaming data(eg, last 24 hours' data),that is,when we do 
count(distinct),we actually want to compute distinct against last 24 hours' 
data.
Does structured streaming support this scenario?Thanks!

Re:Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Todd

Thanks Ted!






At 2016-05-17 16:16:09, "Ted Yu"  wrote:

Please take a look at:


[SPARK-13146][SQL] Management API for continuous queries

[SPARK-14555] Second cut of Python API for Structured Streaming



On Mon, May 16, 2016 at 11:46 PM, Todd  wrote:

Hi,

Are there code examples about how to use the structured streaming feature? 
Thanks.

Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Ted Yu

Please take a look at:

[SPARK-13146][SQL] Management API for continuous queries
[SPARK-14555] Second cut of Python API for Structured Streaming

On Mon, May 16, 2016 at 11:46 PM, Todd  wrote:

> Hi,
>
> Are there code examples about how to use the structured streaming feature?
> Thanks.
>

Re: JDBC SQL Server RDD

2016-05-17 Thread Suresh Thalamati

What is the error you are getting ?

At least on the  main code line I see JDBCRDD is marked as private[sql].  
Simple alternative  might be to call sql server using data frame api , and get 
rdd from data frame. 

eg:
val df = 
sqlContext.read.jdbc("jdbc:sqlserver://usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com
 
;database=ProdAWS;user=sa;password=?s3iY2mv6.H",
 "(select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY)" , new java.util.Properties())

val rdd = df.rdd 


Hope that helps
-suresh

> On May 15, 2016, at 12:05 PM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi ,
> 
> I am trying to test sql server connection with JDBC RDD but unable to connect.
> 
> val myRDD = new JdbcRDD( sparkContext, () => 
> DriverManager.getConnection(sqlServerConnectionString) ,
>   "select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY limit ?, ?",
>   0, 5, 1, r => r.getString("CTRY_NA") + ", " + 
> r.getString("CTRY_SHRT_NA"))
> 
> 
> sqlServerConnectionString here is 
> jdbc:sqlserver://usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com 
> ;database=ProdAWS;user=sa;password=?s3iY2mv6.H
> 
> 
> can you please let me know what I am doing worng. I tried solutions from all 
> forums but didnt find any luck
> 
> Thanks,
> Asmath.

Adding a new column to a temporary table

2016-05-17 Thread Mich Talebzadeh

Hi,

Let us create a DF based on an existing table in Hive using spark-shell

scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@7c666865
// Go to correct database in Hive
scala> HiveContext.sql("use oraclehadoop")
res31: org.apache.spark.sql.DataFrame = [result: string]
// Create a DF based on Hive table sales
scala> val s = HiveContext.table("sales")
s: org.apache.spark.sql.DataFrame = [prod_id: bigint, cust_id: bigint,
time_id: timestamp, channel_id: bigint, promo_id: bigint, quantity_sold:
decimal(10,0), amount_sold: decimal(10,0), year: int, month: int]
// Register it as a temporary table
scala> s.registerTempTable("tmp")
//Get the rows
scala> HiveContext.sql("select count(1) from sales").show
+--+
|   _c0|
+--+
|917359|
+--+

// However, you cannot add a column to that table as shown below as it
expects that table to exist in Hive database

HiveContext.sql("ALTER TABLE tmp ADD COLUMNS(newcol INT)")

16/05/17 08:20:24 ERROR Driver: FAILED: SemanticException [Error 10001]:
Table not found oraclehadoop.tmp
org.apache.hadoop.hive.ql.parse.SemanticException: Table not found
oraclehadoop.tmp

So the assumption is that in-memory table is a placeholder in Spark?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com

Code Example of Structured Streaming of 2.0

2016-05-17 Thread Todd

Hi,

Are there code examples about how to use the structured streaming feature? 
Thanks.

71 matches

Mail list logo