Re: fetching and joining data from two different clusters

2017-06-18 Thread Mich Talebzadeh
It is a proprietary solution to an open source problem

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 June 2017 at 21:11, Jörn Franke  wrote:

> Sorry cannot help you there - I do not know the cost for isilon. I also
> cannot predict what the majority will do ...
>
> On 18. Jun 2017, at 21:49, Mich Talebzadeh 
> wrote:
>
> thanks Jorn.
>
> I have been told that Hadoop 3 (alpha testing now) will support Docking
> and virtualised Hadoop clusters
>
> Also if we decided to use something like Isolin and blue data to create
> zoning (meaning two different Hadoop clusters migrated to Isolin storage
> each residing on its zone/compartment) and virtualised clusters, we haave
> to migrate two separate physical Hadoop clusters to Isolin and then create
> the structure.
>
> My point is if we went that way we have to weight up the cost and efforts
> in migrating two Hadoop clusters to Isolin, versus adding one Hadoop
> cluster to the other one to make one cluster out of two and still we have
> the underlying HDFS file system. And then of course how many companies
> going this way and overriding reason to use such approach. What will happen
> if we have performance issues, where to pinpoint the bottleneck (Isolin) or
> third party Hadoop vendor. There is really no community to rely on as well.
>
> Your thoughts?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 21:27, Jörn Franke  wrote:
>
>> On HDFS you have storage policies where you can define ssd etc
>> https://hadoop.apache.org/docs/current/hadoop-project-di
>> st/hadoop-hdfs/ArchivalStorage.html
>>
>> Not sure if this is a similar offering to what you refer to.
>>
>> Open stack swift is similar to S3 but for your own data center
>> https://docs.openstack.org/developer/swift/associated_projects.html
>>
>> On 15. Jun 2017, at 21:55, Mich Talebzadeh 
>> wrote:
>>
>> In Isilon etc you have SSD, middle layer and archive later where data is
>> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
>> low level archive disk?
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 June 2017 at 20:42, Jörn Franke  wrote:
>>
>>> Well this happens also if you use amazon EMR - most data will be stored
>>> on S3 and there you have also no data locality. You can move it temporary
>>> to HDFS or in-memory (ignite) and you can use sampling etc to avoid the
>>> need to process all the data. In fact, that is done in Spark machine
>>> learning algorithms (stochastic gradient descent etc). This will avoid that
>>> you need to move all the data through the networks and you loose only
>>> little precision (and you can statistically reason on that).
>>> For a lot of data I see also the trend that companies move it anyway to
>>> cheap object storages (swift etc) to reduce cost - particularly because it
>>> is not used often.
>>>
>>>
>>> On 15. Jun 2017, at 21:34, Mich Talebzadeh 
>>> wrote:
>>>
>>> thanks Jorn.
>>>
>>> If the idea is to separate compute from data using Isilon etc then one
>>> is going to lose the locality of data.
>>>
>>> Also the argument is that we would like to run queries/reports against
>>> two independent clusters simultaneously so do this
>>>
>>>
>>>1. Use Isilon OneFS
>>>for Big
>>>data 

Re: fetching and joining data from two different clusters

2017-06-18 Thread Jörn Franke
Sorry cannot help you there - I do not know the cost for isilon. I also cannot 
predict what the majority will do ...

> On 18. Jun 2017, at 21:49, Mich Talebzadeh  wrote:
> 
> thanks Jorn.
> 
> I have been told that Hadoop 3 (alpha testing now) will support Docking and 
> virtualised Hadoop clusters
> 
> Also if we decided to use something like Isolin and blue data to create 
> zoning (meaning two different Hadoop clusters migrated to Isolin storage each 
> residing on its zone/compartment) and virtualised clusters, we haave to 
> migrate two separate physical Hadoop clusters to Isolin and then create the 
> structure.
> 
> My point is if we went that way we have to weight up the cost and efforts in 
> migrating two Hadoop clusters to Isolin, versus adding one Hadoop cluster to 
> the other one to make one cluster out of two and still we have the underlying 
> HDFS file system. And then of course how many companies going this way and 
> overriding reason to use such approach. What will happen if we have 
> performance issues, where to pinpoint the bottleneck (Isolin) or third party 
> Hadoop vendor. There is really no community to rely on as well.
> 
> Your thoughts?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 15 June 2017 at 21:27, Jörn Franke  wrote:
>> On HDFS you have storage policies where you can define ssd etc 
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
>> 
>> Not sure if this is a similar offering to what you refer to.
>> 
>> Open stack swift is similar to S3 but for your own data center 
>> https://docs.openstack.org/developer/swift/associated_projects.html
>> 
>>> On 15. Jun 2017, at 21:55, Mich Talebzadeh  
>>> wrote:
>>> 
>>> In Isilon etc you have SSD, middle layer and archive later where data is 
>>> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that 
>>> low level archive disk?
>>> 
>>> thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
 On 15 June 2017 at 20:42, Jörn Franke  wrote:
 Well this happens also if you use amazon EMR - most data will be stored on 
 S3 and there you have also no data locality. You can move it temporary to 
 HDFS or in-memory (ignite) and you can use sampling etc to avoid the need 
 to process all the data. In fact, that is done in Spark machine learning 
 algorithms (stochastic gradient descent etc). This will avoid that you 
 need to move all the data through the networks and you loose only little 
 precision (and you can statistically reason on that).
 For a lot of data I see also the trend that companies move it anyway to 
 cheap object storages (swift etc) to reduce cost - particularly because it 
 is not used often.
 
 
> On 15. Jun 2017, at 21:34, Mich Talebzadeh  
> wrote:
> 
> thanks Jorn.
> 
> If the idea is to separate compute from data using Isilon etc then one is 
> going to lose the locality of data.
> 
> Also the argument is that we would like to run queries/reports against 
> two independent clusters simultaneously so do this
> 
> Use Isilon OneFS for Big data to migrate two independent Hadoop clusters 
> into Isilon OneFS
> Locate data from each cluster into its own zone in Isilon
> Run queries to combine data from each zone
> Use blue data to create virtual Hadoop clusters on top of Isilon so one 
> isolates the performance impact of analytics/Data Science versus other 
> users
> 
> Now that is easily said than done as usual. First you have to migrate the 
> two existing clusters data into zones in Isilon. Then you are effectively 
> separating Compute from data so data locality is lost. This is no 
> different from your Spark cluster accessing data from each cluster. There 
> are a lot of tangential arguments here. Like Isilon will use RAID and you 
> don't need to replicate your data R3. Even including Isilon licensing 
> cost, the total

Re: fetching and joining data from two different clusters

2017-06-18 Thread Mich Talebzadeh
thanks Jorn.

I have been told that Hadoop 3 (alpha testing now) will support Docking and
virtualised Hadoop clusters

Also if we decided to use something like Isolin and blue data to create
zoning (meaning two different Hadoop clusters migrated to Isolin storage
each residing on its zone/compartment) and virtualised clusters, we haave
to migrate two separate physical Hadoop clusters to Isolin and then create
the structure.

My point is if we went that way we have to weight up the cost and efforts
in migrating two Hadoop clusters to Isolin, versus adding one Hadoop
cluster to the other one to make one cluster out of two and still we have
the underlying HDFS file system. And then of course how many companies
going this way and overriding reason to use such approach. What will happen
if we have performance issues, where to pinpoint the bottleneck (Isolin) or
third party Hadoop vendor. There is really no community to rely on as well.

Your thoughts?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 June 2017 at 21:27, Jörn Franke  wrote:

> On HDFS you have storage policies where you can define ssd etc
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
> ArchivalStorage.html
>
> Not sure if this is a similar offering to what you refer to.
>
> Open stack swift is similar to S3 but for your own data center
> https://docs.openstack.org/developer/swift/associated_projects.html
>
> On 15. Jun 2017, at 21:55, Mich Talebzadeh 
> wrote:
>
> In Isilon etc you have SSD, middle layer and archive later where data is
> moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that
> low level archive disk?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 June 2017 at 20:42, Jörn Franke  wrote:
>
>> Well this happens also if you use amazon EMR - most data will be stored
>> on S3 and there you have also no data locality. You can move it temporary
>> to HDFS or in-memory (ignite) and you can use sampling etc to avoid the
>> need to process all the data. In fact, that is done in Spark machine
>> learning algorithms (stochastic gradient descent etc). This will avoid that
>> you need to move all the data through the networks and you loose only
>> little precision (and you can statistically reason on that).
>> For a lot of data I see also the trend that companies move it anyway to
>> cheap object storages (swift etc) to reduce cost - particularly because it
>> is not used often.
>>
>>
>> On 15. Jun 2017, at 21:34, Mich Talebzadeh 
>> wrote:
>>
>> thanks Jorn.
>>
>> If the idea is to separate compute from data using Isilon etc then one is
>> going to lose the locality of data.
>>
>> Also the argument is that we would like to run queries/reports against
>> two independent clusters simultaneously so do this
>>
>>
>>1. Use Isilon OneFS
>>for Big
>>data to migrate two independent Hadoop clusters into Isilon OneFS
>>2. Locate data from each cluster into its own zone in Isilon
>>3. Run queries to combine data from each zone
>>4. Use blue data
>>
>> 
>>to create virtual Hadoop clusters on top of Isilon so one isolates the
>>performance impact of analytics/Data Science versus other users
>>
>>
>> Now that is easily said than done as usual. First you have to migrate the
>> two existing clusters data into zones in Isilon. Then you are effectively
>> separating Compute from data so data locality is lost. This is no different
>> from your Spark cluster accessing data from each cluster. There are a lot
>> of tangential arguments here. Like Isilon will use RAID and you don't need
>> to replicate your data R3. Even including Isilon licensing cost, the total
>> cost goes down!
>>
>> The side effect is the network now that you have lost data

Re: how to call udf with parameters

2017-06-18 Thread Yong Zhang
What version of spark you are using? I cannot reproduce your error:


scala> spark.version
res9: String = 2.1.1
scala> val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
dataset: org.apache.spark.sql.DataFrame = [id: int, text: string]
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf

// define a method in similar way like you did
scala> def len = udf { (data: String) => data.length > 0 }
len: org.apache.spark.sql.expressions.UserDefinedFunction

// use it
scala> dataset.select(len($"text").as('length)).show
+--+
|length|
+--+
|  true|
|  true|
+--+


Yong




From: Pralabh Kumar 
Sent: Friday, June 16, 2017 12:19 AM
To: lk_spark
Cc: user.spark
Subject: Re: how to call udf with parameters

sample UDF
val getlength=udf((data:String)=>data.length())
data.select(getlength(data("col1")))

On Fri, Jun 16, 2017 at 9:21 AM, lk_spark 
mailto:lk_sp...@163.com>> wrote:
hi,all
 I define a udf with multiple parameters  ,but I don't know how to call it 
with DataFrame

UDF:

def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, 
minTermLen: Int) =>
val terms = HanLP.segment(sentence).asScala
.

Call :

scala> val output = input.select(ssplit2($"text",true,true,2).as('words))
:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
   val output = input.select(ssplit2($"text",true,true,2).as('words))
 ^
:40: error: type mismatch;
 found   : Boolean(true)
 required: org.apache.spark.sql.Column
   val output = input.select(ssplit2($"text",true,true,2).as('words))
  ^
:40: error: type mismatch;
 found   : Int(2)
 required: org.apache.spark.sql.Column
   val output = input.select(ssplit2($"text",true,true,2).as('words))
   ^

scala> val output = 
input.select(ssplit2($"text",$"true",$"true",$"2").as('words))
org.apache.spark.sql.AnalysisException: cannot resolve '`true`' given input 
columns: [id, text];;
'Project [UDF(text#6, 'true, 'true, '2) AS words#16]
+- Project [_1#2 AS id#5, _2#3 AS text#6]
   +- LocalRelation [_1#2, _2#3]

I need help!!


2017-06-16

lk_spark



Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-18 Thread Yong Zhang
I assume you use Scala to implement your UDFs.


In this case, Scala language itself provides some options already for you.


If you want to control more logic when UDFs init, you can define a Scala 
object, def your UDF as part of it, then the object in Scala will behavior like 
Singleton pattern for you.


So the Sacala object's constructor logic can be treated as init/configure 
contract as in Hive. They will be called once per JVM, to init your Scala 
object. That should meet your requirement.


The only trick part is the context reference for configure() method, which 
allow you to pass some configuration dynamic to your UDF for runtime. Since 
object in Scala has to fix at compile time, so you cannot pass any parameters 
to the construct of it. But there is nothing stopping you building Scala 
class/companion object to allow any parameter passed in at constructor/init 
time, which can control your UDF's behavior.


If you have a concrete example that you cannot do in Spark Scala UDF, you can 
post here.


Yong



From: RD 
Sent: Friday, June 16, 2017 11:37 AM
To: Georg Heiler
Cc: user@spark.apache.org
Subject: Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you 
elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler 
mailto:georg.kf.hei...@gmail.com>> wrote:
What about using map partitions instead?

RD mailto:rdsr...@gmail.com>> schrieb am Do. 15. Juni 2017 
um 06:52:
Hi Spark folks,

Is there any plan to support the richer UDF API that Hive supports for 
Spark UDFs ? Hive supports the GenericUDF API which has, among others methods 
like initialize(), configure() (called once on the cluster) etc, which a lot of 
our users use. We have now a lot of UDFs in Hive which make use of these 
methods. We plan to move to UDFs to Spark UDFs but are being limited by not 
having similar lifecycle methods.
   Are there plans to address these? Or do people usually adopt some sort of 
workaround?

   If we  directly use  the Hive UDFs  in Spark we pay a performance penalty. I 
think Spark anyways does a conversion from InternalRow to Row back to 
InternalRow for native spark udfs and for Hive it does InternalRow to Hive 
Object back to InternalRow but somehow the conversion in native udfs is more 
performant.

-Best,
R.



Unsubscribe

2017-06-18 Thread Palash Gupta
 Thanks & Best Regards,
Engr. Palash Gupta
Consultant, OSS/CEM/Big Data
Skype: palash2494
https://www.linkedin.com/in/enggpalashgupta




the scheme in stream reader

2017-06-18 Thread ??????????
Hi all,


L set the scheme for  DataStreamReader but when I print the scheme.It just 
printed:
root
|--value:string (nullable=true)


My code is


val line = ss.readStream.format("socket")
.option("ip",xxx)
.option("port",xxx)
.scheme(StructField("name",StringType??::(StructField("age", IntegerType))).load
line.printSchema


My spark version is 2.1.0.
I want the printSchema prints the schema I set in the code.How should I do 
please?
And my original target is the received data from socket is handled as schema 
directly.What should I do please?


thanks
Fei Shao