Re: Adding header to an rdd before saving to text file

2017-06-05 Thread Yan Facai
Hi, upendra.
It will be easier to use DataFrame to read/save csv file with header, if
you'd like.

On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 
wrote:

> I am reading a CSV(file has headers header 1st,header2) and generating
> rdd,
> After few transformations I create an rdd and finally write it to a txt
> file.
>
> What's the best way to add the header from source file, into rdd and have
> it available as header into new file I.e, when I transform the rdd into
> textfile using saveAsTexFile("newfile") the header 1, header 2 shall be
> available.
>
>
> Thanks,
> Upendra
>


Spark on Kubernetes: Birds-of-a-Feather Session 12:50pm 6/6 @ Spark Summit

2017-06-05 Thread Erik Erlandson
Come learn about the community development project to add a native
Kubernetes scheduling back-end to Apache Spark!  Meet contributors
and network with community members interested in running Spark on
Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster;
find out how to contribute to the project.

https://spark-summit.org/2017/schedule/


Spark Streaming Job Stuck

2017-06-05 Thread Jain, Nishit
I have a very simple spark streaming job running locally in standalone mode. 
There is a customer receiver which reads from database and pass it to the main 
job which prints the total. Not an actual use case but I am playing around to 
learn. Problem is that job gets stuck forever, logic is very simple so I think 
it is neither doing any processing nor memory issue. What is strange is if I 
STOP the job, suddenly in logs I see the output of job execution and other 
backed jobs follow! Can some one help me understand what is going on here?

 val spark = SparkSession
  .builder()
  .master("local[1]")
  .appName("SocketStream")
  .getOrCreate()

val ssc = new StreamingContext(spark.sparkContext,Seconds(5))
val lines = ssc.receiverStream(new HanaCustomReceiver())


lines.foreachRDD{x => println("==" + x.count())}

ssc.start()
ssc.awaitTermination()


[enter image description here]

After terminating program following logs roll which shows execution of the 
batch -

17/06/05 15:56:16 INFO JobGenerator: Stopping JobGenerator immediately 17/06/05 
15:56:16 INFO RecurringTimer: Stopped timer for JobGenerator after time 
1496696175000 17/06/05 15:56:16 INFO JobGenerator: Stopped JobGenerator 
==100

Thanks!


Adding header to an rdd before saving to text file

2017-06-05 Thread upendra 1991
I am reading a CSV(file has headers header 1st,header2) and generating rdd, 
After few transformations I create an rdd and finally write it to a txt file. 
What's the best way to add the header from source file, into rdd and have it 
available as header into new file I.e, when I transform the rdd into textfile 
using saveAsTexFile("newfile") the header 1, header 2 shall be available.

Thanks,Upendra

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
My main concern is that the choice of Isolin is not for one use case. It
will be a strategic decision for the client and if we decide to go that way
we are effectively moving away from HDFS principals (3x replication) etc as
well.

Granted one can argue this may be OK but of course we have to look at our
future needs. From my experience of these tools, you cannot simply roll it
back without incurring considerable work and considerable cost.

And after all will the cost justify the whole of this setup? How about
performance and other bottlenecks?

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:46, John Leach  wrote:

> Mich,
>
> Yes, Isilon is in production...
>
> Isilon is a serious product and has been around for quite a while.  For
> on-premise external storage, we see it quite a bit.  Separating the compute
> from the storage actually helps.  It is also a nice transition to the cloud
> providers.
>
> Have you looked at MapR?  Usually the system guys target snapshots,
> volumes, and posix compliance if they are bought into Isilon.
>
> Good luck Mich.
>
> Regards,
> John Leach
>
>
>
>
> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh 
> wrote:
>
> Hi John,
>
> Thanks. Did you end up in production or in other words besides PoC did you
> use it in anger?
>
> The intention is to build Isilon on top of the whole HDFS cluster!. If we
> go that way we also need to adopt it for DR as well.
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 June 2017 at 15:19, John Leach  wrote:
>
>> Mich,
>>
>> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase
>> for real-time).  We were concerned initially and the initial setup took a
>> bit longer than excepted, but it performed well on both low latency and
>> high throughput use cases at scale (our POC ~ 100 TB).
>>
>> Just a data point.
>>
>> Regards,
>> John Leach
>>
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh 
>> wrote:
>>
>> I am concerned about the use case of tools like Isilon or Panasas to
>> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
>> usual 3x replication gone into the tool itself.
>>
>> There is interest to push Isilon  as a the solution forward but my
>> caution is about scalability and future proof of such tools. So I was
>> wondering if anyone else has tried such solution.
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 June 2017 at 19:09, Gene Pang  wrote:
>>
>>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>>> mount
>>> your (potentially remote) storage systems to Alluxio
>>> ,
>>> and deploy Alluxio co-located to the compute cluster. The computation
>>> framework will still achieve data locality since Alluxio workers are
>>> co-located, even though the existing storage systems may be remote. You can
>>> also use tiered storage
>>> 
>>> to deploy using only memory, and/or other physical media.
>>>
>>> Here are some blogs (Alluxio with Minio
>>> 

Edge Node in Spark

2017-06-05 Thread Ashok Kumar
Hi,

I am a bit confused between Edge node, Edge server and gateway node in Spark. 

Do these mean the same thing?

How does one set up an Edge node to be used in Spark? Is this different from 
Edge node for Hadoop please?

Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Incorrect CAST to TIMESTAMP in Hive compatibility

2017-06-05 Thread Anton Okolnychyi
Hi,

I also noticed this issue. Actually, it was already mentioned several
times. There is an existing JIRA(SPARK-17914).

I am going to submit a PR to fix this in a few days.

Best,
Anton

On Jun 5, 2017 21:42, "verbamour"  wrote:

> Greetings,
>
> I am using Hive compatibility in Spark 2.1.1 and it appears that the CAST
> string to TIMESTAMP improperly trims the sub-second value. In particular,
> leading zeros in the decimal portion appear to be dropped.
>
> Steps to reproduce:
> 1. From `spark-shell` issue: `spark.sql("SELECT CAST('2017-04-05
> 16:00:48.0297580' AS TIMESTAMP)").show(100, false)`
>
> 2. Note erroneous result (i.e. ".0297580" becomes ".29758")
> ```
> +--+
> |CAST(2017-04-05 16:00:48.0297580 AS TIMESTAMP)|
> +--+
> |2017-04-05 16:00:48.29758 |
> +--+
> ```
>
> I am not currently plugged into the JIRA system for Spark, so if this is
> truly a bug please bring it to the attention of the appropriate
> authorities.
>
> Cheers,
>  -tom
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Incorrect-CAST-to-TIMESTAMP-
> in-Hive-compatibility-tp28744.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Incorrect CAST to TIMESTAMP in Hive compatibility

2017-06-05 Thread verbamour
Greetings,

I am using Hive compatibility in Spark 2.1.1 and it appears that the CAST
string to TIMESTAMP improperly trims the sub-second value. In particular,
leading zeros in the decimal portion appear to be dropped.

Steps to reproduce:
1. From `spark-shell` issue: `spark.sql("SELECT CAST('2017-04-05
16:00:48.0297580' AS TIMESTAMP)").show(100, false)`

2. Note erroneous result (i.e. ".0297580" becomes ".29758")
```
+--+
|CAST(2017-04-05 16:00:48.0297580 AS TIMESTAMP)|
+--+
|2017-04-05 16:00:48.29758 |
+--+
```

I am not currently plugged into the JIRA system for Spark, so if this is
truly a bug please bring it to the attention of the appropriate authorities.

Cheers,
 -tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Incorrect-CAST-to-TIMESTAMP-in-Hive-compatibility-tp28744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkAppHandle.Listener.infoChanged behaviour

2017-06-05 Thread Mohammad Tariq
Hi Marcelo,

Thank you so much for the response. Appreciate it!


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Mon, Jun 5, 2017 at 7:24 AM, Marcelo Vanzin  wrote:

> On Sat, Jun 3, 2017 at 7:16 PM, Mohammad Tariq  wrote:
> > I am having a bit of difficulty in understanding the exact behaviour of
> > SparkAppHandle.Listener.infoChanged(SparkAppHandle handle) method. The
> > documentation says :
> >
> > Callback for changes in any information that is not the handle's state.
> >
> > What exactly is meant by any information here? Apart from state other
> pieces
> > of information I can see is ID
>
> So, you answered your own question.
>
> If there's ever any new kind of information, it would use the same event.
>
> --
> Marcelo
>


Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.

If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch again.

Is this the expected behavior? I was expecting this to start a new batch
window.


Here are some logs:

Last successful run:
17/06/05 13:38:00 INFO JobScheduler: Total delay: 0.040 s for time
149668428 ms (execution: 0.029 s)
17/06/05 13:38:00 INFO KafkaRDD: Removing RDD 0 from persistence list
17/06/05 13:38:00 INFO BlockManager: Removing RDD 0
17/06/05 13:38:00 INFO JobGenerator: Checkpointing graph for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Updating checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Updated checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO CheckpointWriter: Submitted checkpoint of time
149668428 ms to writer queue
17/06/05 13:38:00 INFO CheckpointWriter: Saving checkpoint for time
149668428 ms to file 'file:/Users/anbucheeralan/
IdeaProjects/Spark2Example/ckpt/checkpoint-149668428'
17/06/05 13:38:00 INFO CheckpointWriter: *Checkpoint for time 149668428
ms saved to file
'file:/Users/anbucheeralan/IdeaProjects/Spark2Example/ckpt/checkpoint-149668428',
took 4032 bytes and 9 ms*
17/06/05 13:38:00 INFO DStreamGraph: Clearing checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Cleared checkpoint data for time
149668428 ms

After the restart,

17/06/05 13:42:31 INFO DirectKafkaInputDStream$
DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for time
149668428 ms [(my_test,0,2000,2000)]
17/06/05 13:42:31 INFO DirectKafkaInputDStream: Restored checkpoint data
*17/06/05 13:42:31 INFO JobGenerator: Batches during down time (10
batches): 149668428 ms, 149668431 ms, 149668434 ms,
149668437 ms, 149668440 ms, 149668443 ms, 149668446 ms,
149668449 ms, 149668452 ms, 149668455 ms*
*17/06/05 13:42:31 INFO JobGenerator: Batches pending processing (0
batches): *
*17/06/05 13:42:31 INFO JobGenerator: Batches to reschedule (10
batches): *149668428
ms, 149668431 ms, 149668434 ms, 149668437 ms, 149668440 ms,
149668443 ms, 149668446 ms, 149668449 ms, 149668452 ms,
149668455 ms
17/06/05 13:42:31 INFO JobScheduler: Added jobs for time 149668428 ms
17/06/05 13:42:31 INFO JobScheduler: Starting job streaming job
149668428 ms.0 from job set of time 149668428 ms




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-Streaming-Checkpoint-and-Exactly-Once-Guarantee-on-Kafka-Direct-Stream-tp28743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.

If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch again.

Is this the expected behavior? I was expecting this to start a new batch
window.


Here are some logs:

Last successful run:
17/06/05 13:38:00 INFO JobScheduler: Total delay: 0.040 s for time
149668428 ms (execution: 0.029 s)
17/06/05 13:38:00 INFO KafkaRDD: Removing RDD 0 from persistence list
17/06/05 13:38:00 INFO BlockManager: Removing RDD 0
17/06/05 13:38:00 INFO JobGenerator: Checkpointing graph for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Updating checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Updated checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO CheckpointWriter: Submitted checkpoint of time
149668428 ms to writer queue
17/06/05 13:38:00 INFO CheckpointWriter: Saving checkpoint for time
149668428 ms to file
'file:/Users/anbucheeralan/IdeaProjects/Spark2Example/ckpt/checkpoint-149668428'
17/06/05 13:38:00 INFO CheckpointWriter: *Checkpoint for time 149668428
ms saved to file
'file:/Users/anbucheeralan/IdeaProjects/Spark2Example/ckpt/checkpoint-149668428',
took 4032 bytes and 9 ms*
17/06/05 13:38:00 INFO DStreamGraph: Clearing checkpoint data for time
149668428 ms
17/06/05 13:38:00 INFO DStreamGraph: Cleared checkpoint data for time
149668428 ms

After the restart,

17/06/05 13:42:31 INFO
DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring
KafkaRDD for time 149668428 ms [(my_test,0,2000,2000)]
17/06/05 13:42:31 INFO DirectKafkaInputDStream: Restored checkpoint data
*17/06/05 13:42:31 INFO JobGenerator: Batches during down time (10
batches): 149668428 ms, 149668431 ms, 149668434 ms,
149668437 ms, 149668440 ms, 149668443 ms, 149668446 ms,
149668449 ms, 149668452 ms, 149668455 ms*
*17/06/05 13:42:31 INFO JobGenerator: Batches pending processing (0
batches): *
*17/06/05 13:42:31 INFO JobGenerator: Batches to reschedule (10
batches): *149668428
ms, 149668431 ms, 149668434 ms, 149668437 ms, 149668440 ms,
149668443 ms, 149668446 ms, 149668449 ms, 149668452 ms,
149668455 ms
17/06/05 13:42:31 INFO JobScheduler: Added jobs for time 149668428 ms
17/06/05 13:42:31 INFO JobScheduler: Starting job streaming job
149668428 ms.0 from job set of time 149668428 ms


Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have
the following Kafka Consumer. The consumer is reading from a topic, and I
have a mechanism which blocks the consumer from time to time.

The producer is a separate thread which is continuously sending data. I
want to ensure that the consumer does not drop/not read data sent during
the period when the consumer was "blocked".

*In case the "blocked" part is confusing - we have a modified Spark
scheduler where we take a lock on the scheduler.*

public static JavaDStream getKafkaDStream(String inputTopics,
String broker, int kafkaPort, JavaStreamingContext ssc){
HashSet inputTopicsSet = new
HashSet(Arrays.asList(inputTopics.split(",")));
HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", broker + ":" + kafkaPort);

JavaPairInputDStream messages =
KafkaUtils.createDirectStream(
ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
inputTopicsSet
);

JavaDStream lines = messages.map(new
Function, String>() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
});

return lines;
}

Thanks
Nipun


Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich,

Yes, Isilon is in production...

Isilon is a serious product and has been around for quite a while.  For 
on-premise external storage, we see it quite a bit.  Separating the compute 
from the storage actually helps.  It is also a nice transition to the cloud 
providers.  

Have you looked at MapR?  Usually the system guys target snapshots, volumes, 
and posix compliance if they are bought into Isilon.  

Good luck Mich.

Regards,
John Leach




> On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh  wrote:
> 
> Hi John,
> 
> Thanks. Did you end up in production or in other words besides PoC did you 
> use it in anger?
> 
> The intention is to build Isilon on top of the whole HDFS cluster!. If we go 
> that way we also need to adopt it for DR as well.
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 5 June 2017 at 15:19, John Leach  > wrote:
> Mich,
> 
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for 
> real-time).  We were concerned initially and the initial setup took a bit 
> longer than excepted, but it performed well on both low latency and high 
> throughput use cases at scale (our POC ~ 100 TB).  
> 
> Just a data point.
> 
> Regards,
> John Leach
> 
>> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh > > wrote:
>> 
>> I am concerned about the use case of tools like Isilon or Panasas to create 
>> a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x 
>> replication gone into the tool itself.
>> 
>> There is interest to push Isilon  as a the solution forward but my caution 
>> is about scalability and future proof of such tools. So I was wondering if 
>> anyone else has tried such solution.
>> 
>> Thanks
>>  
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 June 2017 at 19:09, Gene Pang > > wrote:
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>> mount your (potentially remote) storage systems to Alluxio 
>> ,
>>  and deploy Alluxio co-located to the compute cluster. The computation 
>> framework will still achieve data locality since Alluxio workers are 
>> co-located, even though the existing storage systems may be remote. You can 
>> also use tiered storage 
>>  to 
>> deploy using only memory, and/or other physical media.
>> 
>> Here are some blogs (Alluxio with Minio 
>> ,
>>  Alluxio with HDFS 
>> ,
>>  Alluxio with S3 
>> )
>>  which use similar architecture.
>> 
>> Hope that helps,
>> Gene
>> 
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh > > wrote:
>> As a matter of interest what is the best way of creating virtualised 
>> clusters all pointing to the same physical data?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any 

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
Hi John,

Thanks. Did you end up in production or in other words besides PoC did you
use it in anger?

The intention is to build Isilon on top of the whole HDFS cluster!. If we
go that way we also need to adopt it for DR as well.

Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 5 June 2017 at 15:19, John Leach  wrote:

> Mich,
>
> We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for
> real-time).  We were concerned initially and the initial setup took a bit
> longer than excepted, but it performed well on both low latency and high
> throughput use cases at scale (our POC ~ 100 TB).
>
> Just a data point.
>
> Regards,
> John Leach
>
> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh 
> wrote:
>
> I am concerned about the use case of tools like Isilon or Panasas to
> create a layer on top of HDFS, essentially a HCFS on top of HDFS with the
> usual 3x replication gone into the tool itself.
>
> There is interest to push Isilon  as a the solution forward but my caution
> is about scalability and future proof of such tools. So I was wondering if
> anyone else has tried such solution.
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 June 2017 at 19:09, Gene Pang  wrote:
>
>> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
>> mount
>> your (potentially remote) storage systems to Alluxio
>> ,
>> and deploy Alluxio co-located to the compute cluster. The computation
>> framework will still achieve data locality since Alluxio workers are
>> co-located, even though the existing storage systems may be remote. You can
>> also use tiered storage
>> 
>> to deploy using only memory, and/or other physical media.
>>
>> Here are some blogs (Alluxio with Minio
>> ,
>> Alluxio with HDFS
>> ,
>> Alluxio with S3
>> )
>> which use similar architecture.
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> As a matter of interest what is the best way of creating virtualised
>>> clusters all pointing to the same physical data?
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 09:27, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 If mandatory, you can use a local cache like alluxio

 Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" 
 a écrit :

> Thanks Vincent. I assume by physical data locality you mean you are
> going through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that shared network could be an issue as well.
> However, it allows you to reduce data redundancy (you do not need R3 in
> HDFS anymore) and also you can build virtual clusters on the same data. 
> One
> cluster for read/writes 

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
Mich,

We used Isilon for a POC of Splice Machine (Spark for Analytics, HBase for 
real-time).  We were concerned initially and the initial setup took a bit 
longer than excepted, but it performed well on both low latency and high 
throughput use cases at scale (our POC ~ 100 TB).  

Just a data point.

Regards,
John Leach

> On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh  wrote:
> 
> I am concerned about the use case of tools like Isilon or Panasas to create a 
> layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x 
> replication gone into the tool itself.
> 
> There is interest to push Isilon  as a the solution forward but my caution is 
> about scalability and future proof of such tools. So I was wondering if 
> anyone else has tried such solution.
> 
> Thanks
>  
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 June 2017 at 19:09, Gene Pang  > wrote:
> As Vincent mentioned earlier, I think Alluxio can work for this. You can 
> mount your (potentially remote) storage systems to Alluxio 
> ,
>  and deploy Alluxio co-located to the compute cluster. The computation 
> framework will still achieve data locality since Alluxio workers are 
> co-located, even though the existing storage systems may be remote. You can 
> also use tiered storage 
>  to 
> deploy using only memory, and/or other physical media.
> 
> Here are some blogs (Alluxio with Minio 
> ,
>  Alluxio with HDFS 
> ,
>  Alluxio with S3 
> )
>  which use similar architecture.
> 
> Hope that helps,
> Gene
> 
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh  > wrote:
> As a matter of interest what is the best way of creating virtualised clusters 
> all pointing to the same physical data?
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 09:27, vincent gromakowski  > wrote:
> If mandatory, you can use a local cache like alluxio
> 
> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  > a écrit :
> Thanks Vincent. I assume by physical data locality you mean you are going 
> through Isilon and HCFS and not through direct HDFS.
> 
> Also I agree with you that shared network could be an issue as well. However, 
> it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) 
> and also you can build virtual clusters on the same data. One cluster for 
> read/writes and another for Reads? That is what has been suggestes!.
> 
> regards
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 1 June 2017 at 08:55, vincent gromakowski 

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread Mich Talebzadeh
I am concerned about the use case of tools like Isilon or Panasas to create
a layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x
replication gone into the tool itself.

There is interest to push Isilon  as a the solution forward but my caution
is about scalability and future proof of such tools. So I was wondering if
anyone else has tried such solution.

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 June 2017 at 19:09, Gene Pang  wrote:

> As Vincent mentioned earlier, I think Alluxio can work for this. You can mount
> your (potentially remote) storage systems to Alluxio
> ,
> and deploy Alluxio co-located to the compute cluster. The computation
> framework will still achieve data locality since Alluxio workers are
> co-located, even though the existing storage systems may be remote. You can
> also use tiered storage
>  to
> deploy using only memory, and/or other physical media.
>
> Here are some blogs (Alluxio with Minio
> ,
> Alluxio with HDFS
> ,
> Alluxio with S3
> )
> which use similar architecture.
>
> Hope that helps,
> Gene
>
> On Thu, Jun 1, 2017 at 1:45 AM, Mich Talebzadeh  > wrote:
>
>> As a matter of interest what is the best way of creating virtualised
>> clusters all pointing to the same physical data?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 June 2017 at 09:27, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> If mandatory, you can use a local cache like alluxio
>>>
>>> Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" 
>>> a écrit :
>>>
 Thanks Vincent. I assume by physical data locality you mean you are
 going through Isilon and HCFS and not through direct HDFS.

 Also I agree with you that shared network could be an issue as well.
 However, it allows you to reduce data redundancy (you do not need R3 in
 HDFS anymore) and also you can build virtual clusters on the same data. One
 cluster for read/writes and another for Reads? That is what has been
 suggestes!.

 regards

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 June 2017 at 08:55, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> I don't recommend this kind of design because you loose physical data
> locality and you will be affected by "bad neighboors" that are also using
> the network storage... We have one similar design but restricted to small
> clusters (more for experiments than production)
>
> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>
>> Thanks Jorn,
>>
>> This was a proposal made by someone as the firm is already using this
>> tool on other SAN based storage and extend it to Big Data
>>
>> On paper 

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
I run a spark-submit(https://spark.apache.org/docs/latest/spark-standalone.
html#launching-spark-applications) in client-mode that starts the
micro-service. If you keep the event loop going then the spark context
would remain active.

Thanks,
Muthu

On Mon, Jun 5, 2017 at 2:44 PM, kant kodali  wrote:

> Are you launching SparkSession from a MicroService or through spark-submit
> ?
>
> On Sun, Jun 4, 2017 at 11:52 PM, Muthu Jayakumar 
> wrote:
>
>> Hello Kant,
>>
>> >I still don't understand How SparkSession can use Akka to communicate
>> with SparkCluster?
>> Let me use your initial requirement as a way to illustrate what I mean --
>> i.e, "I want my Micro service app to be able to query and access data on
>> HDFS"
>> In order to run a query say a DF query (equally possible with SQL as
>> well), you'll need a sparkSession to build a query right? If you can have
>> your main thread launched in client-mode (https://spark.apache.org/docs
>> /latest/spark-standalone.html#launching-spark-applications) then you'll
>> be able to use play/akka based microservice as you used to.
>> Here is what I have in one of my applications do...
>> a. I have an akka-http as a micro-service that takes a query-like JSON
>> request (based on simple scala parser combinator) and runs a spark job
>> using dataframe/dataset and sends back JSON responses (synchronous and
>> asynchronously).
>> b. have another akka-actor that takes an object request to generate
>> parquet(s)
>> c. Another akka-http endpoint (based on web-sockets) to perform similar
>> operation as (a)
>> d. Another akka-http end-point to get progress on a running query /
>> parquet generation (which is based on SparkContext / SparkSQL internal API
>> which is similar to https://spark.apache.org/docs/latest/monitoring.html)
>> The idea is to make sure to have only one sparkSession per JVM. But you
>> can set the execution to be in FAIR (which defaults to FIFO) to be able to
>> run multiple queries in parallel. The application I use runs spark in Spark
>> Standalone with a 32 node cluster.
>>
>> Hope this gives some better idea.
>>
>> Thanks,
>> Muthu
>>
>>
>> On Sun, Jun 4, 2017 at 10:33 PM, kant kodali  wrote:
>>
>>> Hi Muthu,
>>>
>>> I am actually using Play framework for my Micro service which uses Akka
>>> but I still don't understand How SparkSession can use Akka to communicate
>>> with SparkCluster? SparkPi or SparkPl? any link?
>>>
>>> Thanks!
>>>
>>
>>
>


Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread kant kodali
Are you launching SparkSession from a MicroService or through spark-submit ?

On Sun, Jun 4, 2017 at 11:52 PM, Muthu Jayakumar  wrote:

> Hello Kant,
>
> >I still don't understand How SparkSession can use Akka to communicate
> with SparkCluster?
> Let me use your initial requirement as a way to illustrate what I mean --
> i.e, "I want my Micro service app to be able to query and access data on
> HDFS"
> In order to run a query say a DF query (equally possible with SQL as
> well), you'll need a sparkSession to build a query right? If you can have
> your main thread launched in client-mode (https://spark.apache.org/
> docs/latest/spark-standalone.html#launching-spark-applications) then
> you'll be able to use play/akka based microservice as you used to.
> Here is what I have in one of my applications do...
> a. I have an akka-http as a micro-service that takes a query-like JSON
> request (based on simple scala parser combinator) and runs a spark job
> using dataframe/dataset and sends back JSON responses (synchronous and
> asynchronously).
> b. have another akka-actor that takes an object request to generate
> parquet(s)
> c. Another akka-http endpoint (based on web-sockets) to perform similar
> operation as (a)
> d. Another akka-http end-point to get progress on a running query /
> parquet generation (which is based on SparkContext / SparkSQL internal API
> which is similar to https://spark.apache.org/docs/latest/monitoring.html)
> The idea is to make sure to have only one sparkSession per JVM. But you
> can set the execution to be in FAIR (which defaults to FIFO) to be able to
> run multiple queries in parallel. The application I use runs spark in Spark
> Standalone with a 32 node cluster.
>
> Hope this gives some better idea.
>
> Thanks,
> Muthu
>
>
> On Sun, Jun 4, 2017 at 10:33 PM, kant kodali  wrote:
>
>> Hi Muthu,
>>
>> I am actually using Play framework for my Micro service which uses Akka
>> but I still don't understand How SparkSession can use Akka to communicate
>> with SparkCluster? SparkPi or SparkPl? any link?
>>
>> Thanks!
>>
>
>


Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
Hello Kant,

>I still don't understand How SparkSession can use Akka to communicate with
SparkCluster?
Let me use your initial requirement as a way to illustrate what I mean --
i.e, "I want my Micro service app to be able to query and access data on
HDFS"
In order to run a query say a DF query (equally possible with SQL as well),
you'll need a sparkSession to build a query right? If you can have your
main thread launched in client-mode (
https://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications)
then you'll be able to use play/akka based microservice as you used to.
Here is what I have in one of my applications do...
a. I have an akka-http as a micro-service that takes a query-like JSON
request (based on simple scala parser combinator) and runs a spark job
using dataframe/dataset and sends back JSON responses (synchronous and
asynchronously).
b. have another akka-actor that takes an object request to generate
parquet(s)
c. Another akka-http endpoint (based on web-sockets) to perform similar
operation as (a)
d. Another akka-http end-point to get progress on a running query / parquet
generation (which is based on SparkContext / SparkSQL internal API which is
similar to https://spark.apache.org/docs/latest/monitoring.html)
The idea is to make sure to have only one sparkSession per JVM. But you can
set the execution to be in FAIR (which defaults to FIFO) to be able to run
multiple queries in parallel. The application I use runs spark in Spark
Standalone with a 32 node cluster.

Hope this gives some better idea.

Thanks,
Muthu


On Sun, Jun 4, 2017 at 10:33 PM, kant kodali  wrote:

> Hi Muthu,
>
> I am actually using Play framework for my Micro service which uses Akka
> but I still don't understand How SparkSession can use Akka to communicate
> with SparkCluster? SparkPi or SparkPl? any link?
>
> Thanks!
>