Kafka Direct Stream - dynamic topic subscription

2017-10-27 Thread Ramanan, Buvana (Nokia - US/Murray Hill)
Hello,

Using Spark 2.2.0. Interested in seeing the action of dynamic topic 
subscription.

Tried this example: streaming.DirectKafkaWordCount (which uses 
org.apache.spark.streaming.kafka010)

I start with 8 Kafka partitions in my topic and found that Spark Streaming 
executes 8 tasks (one per partition), which is what is expected. While this 
example process was going on, I increased the Kafka partitions to 16 and 
started producing data to the new partitions as well.

I expected that the Kafka consumer that Spark uses, would detect this change 
and spawn new tasks for the new partitions. But I find that it only reads from 
the old partitions and does not read from new partitions. When I do a restart, 
it reads from all 16 partitions.

Is this expected?

What is meant by dynamic topic subscription?

Does it apply only to topics with a name that matches a regular expression and 
it does not apply to dynamically growing partitions?

Thanks,
Buvana



Spark 2.2.0 GC Overhead Limit Exceeded and OOM errors in the executors

2017-10-27 Thread Supun Nakandala
Hi all,

I am trying to do some image analytics type workload using Spark. The
images are read in JPEG format and then are converted to the raw format in
map functions and this causes the size of the partitions to grow by an
order of 1. In addition to this, I am caching some of the data because my
pipeline is iterative.

I get OOM errors and GC overhead limit exceeded errors and I fix them by
increasing the heap size or number of partitions even though after doing
that there is still high GC pressure.

I know that my partitions should be small enough such that it can fit in
memory. But when I did the calculation using the size of cache partitions
available in Spark UI I think the individual partitions are small enough
given the heap size and storage fraction. I am interested in getting your
input on what other things can cause OOM errors in executors. Is caching
data can be a problem (SPARK-1777)?

Thank you in advance.
-Supun


StringIndexer on several columns in a DataFrame with Scala

2017-10-27 Thread Md. Rezaul Karim
Hi All,

There are several categorical columns in my dataset as follows:
[image: Inline images 1]

How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical
column. But that sounds hilarious, I know.
A possible workaround
in
PySpark is combining several StringIndexer on a list and use a Pipeline to
execute them all as follows:

from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(df) for column in
list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()

How I can do the same in Scala? I tried the following:

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
  val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
  indexers = indexers :+ index
}

 val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()

However, I am experiencing NullPointerException at

for (colName <- featureCol)

I am sure, I am doing something wrong. Any suggestion?



Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html



Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
Yes I checked both the output location and console too. It doesnt have any
data.

link also has the code and question that I have raised with Azure
HDInsights.

https://github.com/Azure/spark-eventhubs/issues/195


On Fri, Oct 27, 2017 at 3:22 PM, Shixiong(Ryan) Zhu  wrote:

> The codes in the link write the data into files. Did you check the output
> location?
>
> By the way, if you want to see the data on the console, you can use the
> console sink by changing this line *format("parquet").option("path",
> outputPath + "/ETL").partitionBy("creationTime").start()* to
> *format("console").start().*
>
> On Fri, Oct 27, 2017 at 8:41 AM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi TathagataDas,
>>
>> I was trying to use eventhub with spark streaming. Looks like I was able
>> to make connection successfully but cannot see any data on the console. Not
>> sure if eventhub is supported or not.
>>
>> https://github.com/Azure/spark-eventhubs/blob/master/example
>> s/src/main/scala/com/microsoft/spark/sql/examples/EventHubsS
>> tructuredStreamingExample.scala
>> is the code snippet I have used to connect to eventhub
>>
>> Thanks,
>> Asmath
>>
>>
>>
>> On Thu, Oct 26, 2017 at 9:39 AM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Thanks TD.
>>>
>>> On Wed, Oct 25, 2017 at 6:42 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Please do not confuse old Spark Streaming (DStreams) with Structured
 Streaming. Structured Streaming's offset and checkpoint management is far
 more robust than DStreams.
 Take a look at my talk - https://spark-summit.org/201
 7/speakers/tathagata-das/

 On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Mohammed <
 mdkhajaasm...@gmail.com> wrote:

> Thanks Subhash.
>
> Have you ever used zero data loss concept with streaming. I am bit
> worried to use streamig when it comes to data loss.
>
> https://blog.cloudera.com/blog/2017/06/offset-management-for
> -apache-kafka-with-apache-spark-streaming/
>
>
> does structured streaming handles it internally?
>
> On Wed, Oct 25, 2017 at 3:10 PM, Subhash Sriram <
> subhash.sri...@gmail.com> wrote:
>
>> No problem! Take a look at this:
>>
>> http://spark.apache.org/docs/latest/structured-streaming-pro
>> gramming-guide.html#recovering-from-failures-with-checkpointing
>>
>> Thanks,
>> Subhash
>>
>> On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Hi Sriram,
>>>
>>> Thanks. This is what I was looking for.
>>>
>>> one question, where do we need to specify the checkpoint directory
>>> in case of structured streaming?
>>>
>>> Thanks,
>>> Asmath
>>>
>>> On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram <
>>> subhash.sri...@gmail.com> wrote:
>>>
 Hi Asmath,

 Here is an example of using structured streaming to read from Kafka:

 https://github.com/apache/spark/blob/master/examples/src/mai
 n/scala/org/apache/spark/examples/sql/streaming/StructuredKa
 fkaWordCount.scala

 In terms of parsing the JSON, there is a from_json function that
 you can use. The following might help:

 https://databricks.com/blog/2017/02/23/working-complex-data-
 formats-structured-streaming-apache-spark-2-1.html

 I hope this helps.

 Thanks,
 Subhash

 On Wed, Oct 25, 2017 at 2:59 PM, KhajaAsmath Mohammed <
 mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> Could anyone provide suggestions on how to parse json data from
> kafka and load it back in hive.
>
> I have read about structured streaming but didn't find any
> examples. is there any best practise on how to read it and parse it 
> with
> structured streaming for this use case?
>
> Thanks,
> Asmath
>


>>>
>>
>

>>>
>>
>


Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jean Georges Perrin
May I ask what is the use case? Although it is a very interesting question, but 
I would be concerned about going further than a proof of concept. A lot of the 
enterprises I see and visit are barely on Java8, so starting to talk JDK 9 
might be a slight overkill but if you have a good story, I’m all for it!

jg


> On Oct 27, 2017, at 03:44, Zhang, Liyun  wrote:
> 
> Thanks your suggestion, seems that scala 2.12.4 support jdk9
>  
> Scala 2.12.4 is now available.
> 
> Our benchmarks show a further reduction in compile times since 2.12.3 of 
> 5-10%.
> 
> Improved Java 9 friendliness, with more to come!
> 
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  
>  
>  
>  
>  
> From: Reynold Xin [mailto:r...@databricks.com] 
> Sent: Friday, October 27, 2017 10:26 AM
> To: Zhang, Liyun ; d...@spark.apache.org; 
> user@spark.apache.org
> Subject: Re: Anyone knows how to build and spark on jdk9?
>  
> It probably depends on the Scala version we use in Spark supporting Java 9 
> first. 
>  
> On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun  wrote:
> Hi all:
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I 
> search for jiras related to JDK9. I only found SPARK-13278.  This means now 
> spark can build or run successfully on JDK9 ?
>  
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  


Re: Structured Stream in Spark

2017-10-27 Thread Shixiong(Ryan) Zhu
The codes in the link write the data into files. Did you check the output
location?

By the way, if you want to see the data on the console, you can use the
console sink by changing this line *format("parquet").option("path",
outputPath + "/ETL").partitionBy("creationTime").start()* to
*format("console").start().*

On Fri, Oct 27, 2017 at 8:41 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi TathagataDas,
>
> I was trying to use eventhub with spark streaming. Looks like I was able
> to make connection successfully but cannot see any data on the console. Not
> sure if eventhub is supported or not.
>
> https://github.com/Azure/spark-eventhubs/blob/master/
> examples/src/main/scala/com/microsoft/spark/sql/examples/
> EventHubsStructuredStreamingExample.scala
> is the code snippet I have used to connect to eventhub
>
> Thanks,
> Asmath
>
>
>
> On Thu, Oct 26, 2017 at 9:39 AM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Thanks TD.
>>
>> On Wed, Oct 25, 2017 at 6:42 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Please do not confuse old Spark Streaming (DStreams) with Structured
>>> Streaming. Structured Streaming's offset and checkpoint management is far
>>> more robust than DStreams.
>>> Take a look at my talk - https://spark-summit.org/201
>>> 7/speakers/tathagata-das/
>>>
>>> On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Thanks Subhash.

 Have you ever used zero data loss concept with streaming. I am bit
 worried to use streamig when it comes to data loss.

 https://blog.cloudera.com/blog/2017/06/offset-management-for
 -apache-kafka-with-apache-spark-streaming/


 does structured streaming handles it internally?

 On Wed, Oct 25, 2017 at 3:10 PM, Subhash Sriram <
 subhash.sri...@gmail.com> wrote:

> No problem! Take a look at this:
>
> http://spark.apache.org/docs/latest/structured-streaming-pro
> gramming-guide.html#recovering-from-failures-with-checkpointing
>
> Thanks,
> Subhash
>
> On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi Sriram,
>>
>> Thanks. This is what I was looking for.
>>
>> one question, where do we need to specify the checkpoint directory in
>> case of structured streaming?
>>
>> Thanks,
>> Asmath
>>
>> On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram <
>> subhash.sri...@gmail.com> wrote:
>>
>>> Hi Asmath,
>>>
>>> Here is an example of using structured streaming to read from Kafka:
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/mai
>>> n/scala/org/apache/spark/examples/sql/streaming/StructuredKa
>>> fkaWordCount.scala
>>>
>>> In terms of parsing the JSON, there is a from_json function that you
>>> can use. The following might help:
>>>
>>> https://databricks.com/blog/2017/02/23/working-complex-data-
>>> formats-structured-streaming-apache-spark-2-1.html
>>>
>>> I hope this helps.
>>>
>>> Thanks,
>>> Subhash
>>>
>>> On Wed, Oct 25, 2017 at 2:59 PM, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Hi,

 Could anyone provide suggestions on how to parse json data from
 kafka and load it back in hive.

 I have read about structured streaming but didn't find any
 examples. is there any best practise on how to read it and parse it 
 with
 structured streaming for this use case?

 Thanks,
 Asmath

>>>
>>>
>>
>

>>>
>>
>


Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Sean Owen
Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is
in place already, and the last issue is dealing with how Scala closures are
now implemented quite different with lambdas / invokedynamic. This affects
the ClosureCleaner. For the interested, this is as far as I know the main
remaining issue:

Despite the odd naming, all of these versions of Java are successors to
Java 9. Supporting any of them is probably the same thing, so, the work is
still for now getting it working on Java 9.

Whereas Java has been very backwards-compatible in the past, the new module
structure is almost certain to break something in Spark or its
dependencies. Removing JAXB from the JDK alone causes issues. Getting it to
run at all on Java 9 may require changes, whereas compatibility with new
Java major releases in the past generally came for free. It'll be worth
trying to make that happen soonish. I'm guessing for Spark 3.x in first
half of next year?

But, first things first. Scala 2.12 support.

On Fri, Oct 27, 2017 at 6:02 PM Jörn Franke  wrote:

> Scala 2.12 is not yet supported on Spark - this means also not JDK9:
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220
>
> If you look at the Oracle support then jdk 9 is anyway only supported for
> 6 months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK
> 18.9 is lts (5 years).
> http://www.oracle.com/technetwork/java/eol-135779.html
>
> I do not think Spark should support non-lts releases. Especially for JDK9
> I do not see a strong technical need, but maybe I am overlooking something.
> Of course http2 etc would be nice for the web interfaces, but currently not
> very urgent.
>
> On 27. Oct 2017, at 04:44, Zhang, Liyun  wrote:
>
> Thanks your suggestion, seems that scala 2.12.4 support jdk9
>
>
>
> Scala 2.12.4  is now
> available.
>
> Our benchmarks
> 
>  show
> a further reduction in compile times since 2.12.3 of 5-10%.
>
> Improved Java 9 friendliness, with more to come!
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>
>
>
>
>
>
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com ]
> *Sent:* Friday, October 27, 2017 10:26 AM
> *To:* Zhang, Liyun ; d...@spark.apache.org;
> user@spark.apache.org
> *Subject:* Re: Anyone knows how to build and spark on jdk9?
>
>
>
> It probably depends on the Scala version we use in Spark supporting Java 9
> first.
>
>
>
> On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun 
> wrote:
>
> Hi all:
>
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9
> env. I search for jiras related to JDK9. I only found SPARK-13278
> .  This means now
> spark can build or run successfully on JDK9 ?
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>
>


Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jörn Franke
Scala 2.12 is not yet supported on Spark - this means also not JDK9:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220

If you look at the Oracle support then jdk 9 is anyway only supported for 6 
months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK 18.9 is 
lts (5 years).
http://www.oracle.com/technetwork/java/eol-135779.html

I do not think Spark should support non-lts releases. Especially for JDK9 I do 
not see a strong technical need, but maybe I am overlooking something. Of 
course http2 etc would be nice for the web interfaces, but currently not very 
urgent. 

> On 27. Oct 2017, at 04:44, Zhang, Liyun  wrote:
> 
> Thanks your suggestion, seems that scala 2.12.4 support jdk9
>  
> Scala 2.12.4 is now available.
> 
> Our benchmarks show a further reduction in compile times since 2.12.3 of 
> 5-10%.
> 
> Improved Java 9 friendliness, with more to come!
> 
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  
>  
>  
>  
>  
> From: Reynold Xin [mailto:r...@databricks.com] 
> Sent: Friday, October 27, 2017 10:26 AM
> To: Zhang, Liyun ; d...@spark.apache.org; 
> user@spark.apache.org
> Subject: Re: Anyone knows how to build and spark on jdk9?
>  
> It probably depends on the Scala version we use in Spark supporting Java 9 
> first. 
>  
> On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun  wrote:
> Hi all:
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I 
> search for jiras related to JDK9. I only found SPARK-13278.  This means now 
> spark can build or run successfully on JDK9 ?
>  
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  


Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
Hi TathagataDas,

I was trying to use eventhub with spark streaming. Looks like I was able to
make connection successfully but cannot see any data on the console. Not
sure if eventhub is supported or not.

https://github.com/Azure/spark-eventhubs/blob/master/examples/src/main/scala/com/microsoft/spark/sql/examples/EventHubsStructuredStreamingExample.scala

is the code snippet I have used to connect to eventhub

Thanks,
Asmath



On Thu, Oct 26, 2017 at 9:39 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Thanks TD.
>
> On Wed, Oct 25, 2017 at 6:42 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Please do not confuse old Spark Streaming (DStreams) with Structured
>> Streaming. Structured Streaming's offset and checkpoint management is far
>> more robust than DStreams.
>> Take a look at my talk - https://spark-summit.org/201
>> 7/speakers/tathagata-das/
>>
>> On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Thanks Subhash.
>>>
>>> Have you ever used zero data loss concept with streaming. I am bit
>>> worried to use streamig when it comes to data loss.
>>>
>>> https://blog.cloudera.com/blog/2017/06/offset-management-for
>>> -apache-kafka-with-apache-spark-streaming/
>>>
>>>
>>> does structured streaming handles it internally?
>>>
>>> On Wed, Oct 25, 2017 at 3:10 PM, Subhash Sriram <
>>> subhash.sri...@gmail.com> wrote:
>>>
 No problem! Take a look at this:

 http://spark.apache.org/docs/latest/structured-streaming-pro
 gramming-guide.html#recovering-from-failures-with-checkpointing

 Thanks,
 Subhash

 On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed <
 mdkhajaasm...@gmail.com> wrote:

> Hi Sriram,
>
> Thanks. This is what I was looking for.
>
> one question, where do we need to specify the checkpoint directory in
> case of structured streaming?
>
> Thanks,
> Asmath
>
> On Wed, Oct 25, 2017 at 2:52 PM, Subhash Sriram <
> subhash.sri...@gmail.com> wrote:
>
>> Hi Asmath,
>>
>> Here is an example of using structured streaming to read from Kafka:
>>
>> https://github.com/apache/spark/blob/master/examples/src/mai
>> n/scala/org/apache/spark/examples/sql/streaming/StructuredKa
>> fkaWordCount.scala
>>
>> In terms of parsing the JSON, there is a from_json function that you
>> can use. The following might help:
>>
>> https://databricks.com/blog/2017/02/23/working-complex-data-
>> formats-structured-streaming-apache-spark-2-1.html
>>
>> I hope this helps.
>>
>> Thanks,
>> Subhash
>>
>> On Wed, Oct 25, 2017 at 2:59 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Could anyone provide suggestions on how to parse json data from
>>> kafka and load it back in hive.
>>>
>>> I have read about structured streaming but didn't find any examples.
>>> is there any best practise on how to read it and parse it with 
>>> structured
>>> streaming for this use case?
>>>
>>> Thanks,
>>> Asmath
>>>
>>
>>
>

>>>
>>
>


Re: Structured streaming with event hubs

2017-10-27 Thread KhajaAsmath Mohammed
I was looking at this example but didnt get any output from it when used.

https://github.com/Azure/spark-eventhubs/blob/master/examples/src/main/scala/com/microsoft/spark/sql/examples/EventHubsStructuredStreamingExample.scala



On Fri, Oct 27, 2017 at 9:18 AM, ayan guha  wrote:

> Does event hub support seuctured streaming at all yet?
>
> On Fri, 27 Oct 2017 at 1:43 pm, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> Could anyone share if there is any code snippet on how to use spark
>> structured streaming with event hubs ??
>>
>> Thanks,
>> Asmath
>>
>> Sent from my iPhone
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Structured streaming with event hubs

2017-10-27 Thread ayan guha
Does event hub support seuctured streaming at all yet?

On Fri, 27 Oct 2017 at 1:43 pm, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> Could anyone share if there is any code snippet on how to use spark
> structured streaming with event hubs ??
>
> Thanks,
> Asmath
>
> Sent from my iPhone
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha


Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Thakrar, Jayesh
What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit 
the parallel/concurrent execution features.

Anyway, see if this works:

val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), 
JavaFunctions.cassandraTable(...))

val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))


I am hoping that Spark being based on Scala, the behavior below will apply:
scala> var x = 0
x: Int = 0

scala> val (a,b) = (x + 1, x+1)
a: Int = 1
b: Int = 1



From: Cassa L 
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke 
Cc: user , 
Subject: Re: Why don't I see my spark jobs running in parallel in 
Cassandra/Spark DSE cluster?

No, I dont use Yarn.  This is standalone spark that comes with DataStax 
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke 
> wrote:
Do you use yarn ? Then you need to configure the queues with the right 
scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L 
> wrote:
Hi,
I have a spark job that has use case as below:
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some 
transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How 
do I make it parallel? I also looked at below discussion from Cloudera, but it 
does not show how to run driver functions in parallel. Do I just add Executor 
and run them in threads?

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515

Attaching UI snapshot here?


Thanks.
LCassa



Re: Orc predicate pushdown with Spark Sql

2017-10-27 Thread Siva Gudavalli

I found a workaround, when I create Hive Table using Spark “saveAsTable”, I see 
filters being pushed down.

-> other approaches I tried where filters are not pushed down Is, 

1) when I create Hive Table upfront and load orc into it using Spark SQL
2) when I create orc files using spark SQL and then create Hive External Table

If my understanding is correct, when I use saveAsTable spark is using & also 
registering Hive Metastore with its custom Serde and Is able to pushdown 
filters. 
Please correct me.

Another question, 

When i am writing Orc to hive using “saveAsTable”, is there any way I can 
provide details about Orc Files.
for instance: stripe.size, can i create bloom filters etc… 


Regards
Shiv



> On Oct 25, 2017, at 1:37 AM, Jörn Franke  wrote:
> 
> Well the meta information is in the file so I am not surprised that it reads 
> the file, but it should not read all the content, which is probably also not 
> happening. 
> 
> On 24. Oct 2017, at 18:16, Siva Gudavalli  > wrote:
> 
>> 
>> Hello,
>>  
>> I have an update here. 
>>  
>> spark SQL is pushing predicates down, if I load the orc files in spark 
>> Context and Is not the same when I try to read hive Table directly.
>> please let me know if i am missing something here.
>>  
>> Is this supported in spark  ? 
>>  
>> when I load the files in spark Context 
>> scala> val hlogsv5 = 
>> sqlContext.read.format("orc").load("/user/hive/warehouse/hlogsv5")
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5 on driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003 on driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others on 
>> driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others/usrpartkey=hhhUsers
>>  on driver
>> hlogsv5: org.apache.spark.sql.DataFrame = [id: string, chn: string, ht: 
>> string, br: string, rg: string, cat: int, scat: int, usr: string, org: 
>> string, act: int, ctm: int, c1: string, c2: string, c3: string, d1: int, d2: 
>> int, doc: binary, cdt: int, catpartkey: string, usrpartkey: string]
>> scala> hlogsv5.registerTempTable("tempo")
>> scala> sqlContext.sql ( "selecT id from tempo where cdt=20171003 and 
>> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
>> 17/10/24 16:11:22 INFO ParseDriver: Parsing command: selecT id from tempo 
>> where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id 
>> desc limit 10
>> 17/10/24 16:11:22 INFO ParseDriver: Parse Completed
>> 17/10/24 16:11:22 INFO DataSourceStrategy: Selected 1 partitions out of 1, 
>> pruned 0.0% partitions.
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6 stored as values in 
>> memory (estimated size 164.5 KB, free 468.0 KB)
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes 
>> in memory (estimated size 18.3 KB, free 486.4 KB)
>> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
>> on 172.21.158.61:43493 (size: 18.3 KB, free: 511.4 MB)
>> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 6 from explain at 
>> :33
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7 stored as values in 
>> memory (estimated size 170.2 KB, free 656.6 KB)
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes 
>> in memory (estimated size 18.8 KB, free 675.4 KB)
>> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory 
>> on 172.21.158.61:43493 (size: 18.8 KB, free: 511.4 MB)
>> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 7 from explain at 
>> :33
>> == Physical Plan ==
>> TakeOrderedAndProject(limit=10, orderBy=[id#145 DESC], output=[id#145])
>> +- ConvertToSafe
>> +- Project [id#145]
>> +- Filter (usr#152 = AA0YP)
>> +- Scan OrcRelation[id#145,usr#152] InputPaths: 
>> maprfs:///user/hive/warehouse/hlogsv5, PushedFilters: 
>> [EqualTo(usr,AA0YP)]
>>  
>> when i read this as hive Table 
>>  
>> scala> sqlContext.sql ( "selecT id from hlogsv5 where cdt=20171003 and 
>> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
>> 17/10/24 16:11:32 INFO ParseDriver: Parsing command: selecT id from 
>> hlogsv5 where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' 
>> order by id desc limit 10
>> 17/10/24 16:11:32 INFO ParseDriver: Parse Completed
>> 17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8 stored as values in 
>> memory (estimated size 399.1 KB, free 1074.6 KB)
>> 17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes 
>> in memory (estimated size 42.7 KB, free 1117.2 KB)
>> 17/10/24 16:11:32 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
>> on 172.21.158.61:43493 

cosine similarity between rows

2017-10-27 Thread Donni Khan
I have spark job to compute the similarity between text documents:

RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd());
CoordinateMatrix
rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD
entries = rowsimilarity.entries().toJavaRDD();
List list = entries.collect();
for(MatrixEntry s : list) System.out.println(s);

the MatrixEntry(i, j, value) represents the similarity between
columns(let's say the features of documents).But how can I show the
similarity between rows? suppose I have five documents Doc1, Doc5, I
would like to show the similarity between all those documnts. How do I get
that?
 any help?


Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Cassa L
No, I dont use Yarn.  This is standalone spark that comes with DataStax
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke  wrote:

> Do you use yarn ? Then you need to configure the queues with the right
> scheduler and method.
>
> On 27. Oct 2017, at 08:05, Cassa L  wrote:
>
> Hi,
> I have a spark job that has use case as below:
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some
> transformation and after that I do a count on transformed data.
>
> Code somewhat  looks like this:
>
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
>
> RDD3.count
> RDD4.count
>
> In Spark UI I see count() functions are getting called one after another.
> How do I make it parallel? I also looked at below discussion from Cloudera,
> but it does not show how to run driver functions in parallel. Do I just add
> Executor and run them in threads?
>
> https://community.cloudera.com/t5/Advanced-Analytics-
> Apache-Spark/Getting-Spark-stages-to-run-in-parallel-
> inside-an-application/td-p/38515
>
> Attaching UI snapshot here?
>
>
> Thanks.
> LCassa
>
>


Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Jörn Franke
See also https://spark.apache.org/docs/latest/job-scheduling.html

> On 27. Oct 2017, at 08:05, Cassa L  wrote:
> 
> Hi,
> I have a spark job that has use case as below: 
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some 
> transformation and after that I do a count on transformed data.
> 
> Code somewhat  looks like this:
> 
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
> 
> RDD3.count
> RDD4.count
> 
> In Spark UI I see count() functions are getting called one after another. How 
> do I make it parallel? I also looked at below discussion from Cloudera, but 
> it does not show how to run driver functions in parallel. Do I just add 
> Executor and run them in threads?
> 
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515
> 
> Attaching UI snapshot here?
> 
> 
> Thanks.
> LCassa


Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Jörn Franke
Do you use yarn ? Then you need to configure the queues with the right 
scheduler and method.

> On 27. Oct 2017, at 08:05, Cassa L  wrote:
> 
> Hi,
> I have a spark job that has use case as below: 
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some 
> transformation and after that I do a count on transformed data.
> 
> Code somewhat  looks like this:
> 
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
> 
> RDD3.count
> RDD4.count
> 
> In Spark UI I see count() functions are getting called one after another. How 
> do I make it parallel? I also looked at below discussion from Cloudera, but 
> it does not show how to run driver functions in parallel. Do I just add 
> Executor and run them in threads?
> 
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515
> 
> Attaching UI snapshot here?
> 
> 
> Thanks.
> LCassa