Spark 3.0.0-preview and s3a

2019-12-12 Thread vincent gromakowski
Hi Spark users,
I am testing the preview of Spark 3 with s3a and hadoop 3.2 but I have got
NoClassDefFoundError and cannot find what is the issue. I suppose there is
some lib conflict. Can someone provide a working configuration?



*Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816) at
org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)*


Here is my SBT file

libraryDependencies ++= {
  Seq(
"org.apache.spark" %% "spark-core" % "3.0.0-preview" % "provided",
"org.apache.spark" %% "spark-sql" % "3.0.0-preview" % "provided",
"org.apache.hadoop" % "hadoop-cloud-storage" % "3.2.1",
"org.scalactic" %% "scalactic" % "3.1.0",
"org.scalatest" %% "scalatest" % "3.1.0" % Test
  )
}

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("javax.xml.stream.**" ->
"shaded-javax.xml.stream.@1").inLibrary("javax.xml.stream" %
"stax-api" % "1.0-2"),
  ShadeRule.rename("*" ->
"shaded-@1").inLibrary("com.fasterxml.jackson.core" % "jackson-core" %
"2.10.0"),
  ShadeRule.rename("*" ->
"shaded2-@1").inLibrary("com.fasterxml.jackson.core" %
"jackson-databind" % "2.10.0"),
  ShadeRule.rename("mozilla.**" ->
"shaded-mozilla.@1").inLibrary("com.amazonaws" % "aws-java-sdk-bundle"
% "1.11.375"),
)

assemblyMergeStrategy in assembly := {
  case "mime.types" => MergeStrategy.rename
  case x if x.contains("versions.properties") => MergeStrategy.rename
  case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}


Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
There is a probably a limit in the number of element you can pass in the
list of partitions for the listPartitionsWithAuthInfo API call. Not sure if
the dynamic overwrite logic is implemented in Spark or in Hive, in which
case using hive 1.2.1 is probably the reason for un-optimized logic but
also a huge constraint for solving this issue as upgrading Hive version is
a real challenge

Le jeu. 25 avr. 2019 à 15:10, Juho Autio  a écrit :

> Ok, I've verified that hive> SHOW PARTITIONS is using get_partition_names,
> which is always quite fast. Spark's insertInto uses
> get_partitions_with_auth which is much slower (it also gets location etc.
> of each partition).
>
> I created a test in java that with a local metastore client to measure the
> time:
>
> I used the Short.MAX_VALUE (32767) as max for both (so also get 32767
> partitions in both responses). I didn't get next page of results, but this
> gives the idea already:
>
> listPartitionNames completed in: 1540 ms ~= 1,5 seconds
> listPartitionsWithAuthInfo completed in: 303400 ms ~= 5 minutes
>
> I wonder if this can be optimized on metastore side, but at least it
> doesn't seem to be CPU-bound on the RDS db (we're using Hive metastore,
> backed by AWS RDS).
>
> So my original question remains; does spark need to know about all
> existing partitions for dynamic overwrite? I don't see why it would.
>
> On Thu, Apr 25, 2019 at 10:12 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Which metastore are you using?
>>
>> Le jeu. 25 avr. 2019 à 09:02, Juho Autio  a écrit :
>>
>>> Would anyone be able to answer this question about the non-optimal
>>> implementation of insertInto?
>>>
>>> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio  wrote:
>>>
>>>> Hi,
>>>>
>>>> My job is writing ~10 partitions with insertInto. With the same input /
>>>> output data the total duration of the job is very different depending on
>>>> how many partitions the target table has.
>>>>
>>>> Target table with 10 of partitions:
>>>> 1 min 30 s
>>>>
>>>> Target table with ~1 partitions:
>>>> 13 min 0 s
>>>>
>>>> It seems that spark is always fetching the full list of partitions in
>>>> target table. When this happens, the cluster is basically idling while
>>>> driver is listing partitions.
>>>>
>>>> Here's a thread dump for executor driver from such idle time:
>>>> https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20
>>>>
>>>> Is there any way to optimize this currently? Is this a known issue? Any
>>>> plans to improve?
>>>>
>>>> My code is essentially:
>>>>
>>>> spark = SparkSession.builder \
>>>> .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER')
>>>> \
>>>> .config("hive.exec.dynamic.partition", "true") \
>>>> .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \
>>>> .config("hive.exec.dynamic.partition.mode", "nonstrict") \
>>>> .enableHiveSupport() \
>>>> .getOrCreate()
>>>>
>>>> out_df.write \
>>>> .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
>>>> .insertInto(target_table_name, overwrite=True)
>>>>
>>>> Table has been originally created from spark with saveAsTable.
>>>>
>>>> Does spark need to know anything about the existing partitions though?
>>>> As a manual workaround I would write the files directly to the partition
>>>> locations, delete existing files first if there's anything in that
>>>> partition, and then call metastore to ALTER TABLE IF NOT EXISTS ADD
>>>> PARTITION. This doesn't require previous knowledge on existing partitions.
>>>>
>>>> Thanks.
>>>>
>>>
>
> --
> *Juho Autio*
> Senior Data Engineer
>
> Data Engineering, Games
> Rovio Entertainment Corporation
> Mobile: + 358 (0)45 313 0122
> juho.au...@rovio.com
> www.rovio.com
>
> *This message and its attachments may contain confidential information and
> is intended solely for the attention and use of the named addressee(s). If
> you are not the intended recipient and / or you have received this message
> in error, please contact the sender immediately and delete all material you
> have received in this message. You are hereby notified that any use of the
> information, which you have received in error in whatsoever form, is
> strictly prohibited. Thank you for your co-operation.*
>


Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
Which metastore are you using?

Le jeu. 25 avr. 2019 à 09:02, Juho Autio  a écrit :

> Would anyone be able to answer this question about the non-optimal
> implementation of insertInto?
>
> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio  wrote:
>
>> Hi,
>>
>> My job is writing ~10 partitions with insertInto. With the same input /
>> output data the total duration of the job is very different depending on
>> how many partitions the target table has.
>>
>> Target table with 10 of partitions:
>> 1 min 30 s
>>
>> Target table with ~1 partitions:
>> 13 min 0 s
>>
>> It seems that spark is always fetching the full list of partitions in
>> target table. When this happens, the cluster is basically idling while
>> driver is listing partitions.
>>
>> Here's a thread dump for executor driver from such idle time:
>> https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20
>>
>> Is there any way to optimize this currently? Is this a known issue? Any
>> plans to improve?
>>
>> My code is essentially:
>>
>> spark = SparkSession.builder \
>> .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER') \
>> .config("hive.exec.dynamic.partition", "true") \
>> .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \
>> .config("hive.exec.dynamic.partition.mode", "nonstrict") \
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> out_df.write \
>> .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
>> .insertInto(target_table_name, overwrite=True)
>>
>> Table has been originally created from spark with saveAsTable.
>>
>> Does spark need to know anything about the existing partitions though? As
>> a manual workaround I would write the files directly to the partition
>> locations, delete existing files first if there's anything in that
>> partition, and then call metastore to ALTER TABLE IF NOT EXISTS ADD
>> PARTITION. This doesn't require previous knowledge on existing partitions.
>>
>> Thanks.
>>
>


Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
I totally missed this new feature. Thanks for the pointer

Le mar. 18 déc. 2018 à 21:18, Priya Matpadi  a écrit :

> Changes in streaming query that allow or disallow recovery from checkpoint
> is clearly provided in
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query
> .
>
> On Tue, Dec 18, 2018 at 9:45 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Checkpointing is only used for failure recovery not for app upgrades. You
>> need to manually code the unload/load and save it to a persistent store
>>
>> Le mar. 18 déc. 2018 à 17:29, Priya Matpadi  a
>> écrit :
>>
>>> Using checkpointing for graceful updates is my understanding as well,
>>> based on the writeup in
>>> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing,
>>> and some prototyping. Have you faced any missed events?
>>>
>>> On Mon, Dec 17, 2018 at 6:56 PM Yuta Morisawa <
>>> yu-moris...@kddi-research.jp> wrote:
>>>
>>>> Hi
>>>>
>>>> Now I'm trying to update my structured streaming application.
>>>> But I have no idea how to update it gracefully.
>>>>
>>>> Should I stop it, replace a jar file then restart it?
>>>> In my understanding, in that case, all the state will be recovered if I
>>>> use checkpoints.
>>>> Is this correct?
>>>>
>>>> Thank you,
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>


Re: How to update structured streaming apps gracefully

2018-12-18 Thread vincent gromakowski
Checkpointing is only used for failure recovery not for app upgrades. You
need to manually code the unload/load and save it to a persistent store

Le mar. 18 déc. 2018 à 17:29, Priya Matpadi  a écrit :

> Using checkpointing for graceful updates is my understanding as well,
> based on the writeup in
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing,
> and some prototyping. Have you faced any missed events?
>
> On Mon, Dec 17, 2018 at 6:56 PM Yuta Morisawa <
> yu-moris...@kddi-research.jp> wrote:
>
>> Hi
>>
>> Now I'm trying to update my structured streaming application.
>> But I have no idea how to update it gracefully.
>>
>> Should I stop it, replace a jar file then restart it?
>> In my understanding, in that case, all the state will be recovered if I
>> use checkpoints.
>> Is this correct?
>>
>> Thank you,
>>
>>
>> --
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread vincent gromakowski
On the other side increasing parallelism with kakfa partition avoid the
shuffle in spark to repartition

Le mer. 7 nov. 2018 à 09:51, Michael Shtelma  a écrit :

> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a lot.
>
> Best,
> Michael
>
>
> On Wed, Nov 7, 2018 at 8:28 AM JF Chen  wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>


Re: External shuffle service on K8S

2018-10-26 Thread vincent gromakowski
No it's on the roadmap >2.4

Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :

> Hi all:
>
> Does Spark 2.3.2 supports external shuffle service on Kubernetes?
>
> I have looked up the documentation(
> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
> couldn't find related suggestions.
>
> If suppports, how can I enable it?
>
> Best Regards
>
> Lijun Cao
>
>
>


Re: Use SparkContext in Web Application

2018-10-04 Thread vincent gromakowski
Decoupling the web app from Spark backend is recommended. Training the
model can be launched in the background via a scheduling tool. Inferring
the model with Spark in interactive mode s not a good option as it will do
it for unitary data and Spark is better in using large dataset. The
original purpose of inferring with Spark was to do it offline for large
datasets and store the results in a KV store for instance, then any
consumer like your web app would just read the KV store. I would personally
store the trained model in PFA or PMML and serve it via another tool.
There are lots of tools to serve the models via API from managed solution
like Amazon Sagemaker to open source solution like Prediction.io
If you still want to call Spark backend from your web app, what I don't
recommend, I would do it using Spark Jobserver or Livy to interact via rest
API.

Le jeu. 4 oct. 2018 à 08:25, Jörn Franke  a écrit :

> Depending on your model size you can store it as PFA or PMML and run the
> prediction in Java. For larger models you will need a custom solution ,
> potentially using a spark thrift Server/spark job server/Livy and a cache
> to store predictions that have been already calculated (eg based on
> previous requests to predict). Then you run also into thoughts on caching
> prediction results on the model version that has been used, evicting
> non-relevant predictions etc
> Making the model available as a service is currently a topic where a lot
> of custom „plumbing“ is required , especially if models are a little bit
> larger.
>
> Am 04.10.2018 um 06:55 schrieb Girish Vasmatkar <
> girish.vasmat...@hotwaxsystems.com>:
>
>
>
> On Mon, Oct 1, 2018 at 12:18 PM Girish Vasmatkar <
> girish.vasmat...@hotwaxsystems.com> wrote:
>
>> Hi All
>>
>> We are very early into our Spark days so the following may sound like a
>> novice question :) I will try to keep this as short as possible.
>>
>> We are trying to use Spark to introduce a recommendation engine that can
>> be used to provide product recommendations and need help on some design
>> decisions before moving forward. Ours is a web application running on
>> Tomcat. So far, I have created a simple POC (standalone java program) that
>> reads in a CSV file and feeds to FPGrowth and then fits the data and runs
>> transformations. I would like to be able to do the following -
>>
>>
>>- Scheduler runs nightly in Tomcat (which it does currently) and
>>reads everything from the DB to train/fit the system. This can grow into
>>really some large data and everyday we will have new data. Should I just
>>use SparkContext here, within my scheduler, to FIT the system? Is this
>>correct way to go about this? I am also planning to save the model on S3
>>which should be okay. We also thought on using HDFS. The scheduler's job
>>will be just to create model and save the same and be done with it.
>>- On the product page, we can then use the saved model to display the
>>product recommendations for a particular product.
>>- My understanding is that I should be able to use SparkContext here
>>in my web application to just load the saved model and use it to derive 
>> the
>>recommendations. Is this a good design? The problem I see using this
>>approach is that the SparkContext does take time to initialize and this 
>> may
>>cost dearly. Or should we keep SparkContext per web application to use a
>>single instance of the same? We can initialize a SparkContext during
>>application context initializaion phase.
>>
>>
>> Since I am fairly new to using Spark properly, please help me take
>> decision on whether the way I plan to use Spark is the recommended way? I
>> have also seen use cases involving kafka tha does communication with Spark,
>> but can we not do it directly using Spark Context? I am sure a lot of my
>> understanding is wrong, so please feel free to correct me.
>>
>> Thanks and Regards,
>> Girish Vasmatkar
>> HotWax Systems
>>
>>
>>
>>


Re: Spark (Scala) Streaming [Convert rdd [org.bson.document] - > dataframe]

2018-07-18 Thread vincent gromakowski
Mongo libs provide a way to convert to case class

On Wed 18 Jul 2018 at 20:23, Chethan  wrote:

> Hi Dev,
>
> I am streaming from mongoDB using Kafka with spark streaming, It returns
> me document [org.bson.document] . I wan to convert this rdd to dataframe
> to process with other data.
>
> Any suggestions will be helpful.
>
>
> Thanks,
> Chethan.
>


Re: Append In-Place to S3

2018-06-02 Thread vincent gromakowski
Structured streaming can provide idempotent and exactly once writings in
parquet but I don't know how it does under the hood.
Without this you need to load all your dataset, then dedup, then write back
the entire dataset. This overhead can be minimized with partitionning
output files.

Le ven. 1 juin 2018, 18:01, Benjamin Kim  a écrit :

> I have a situation where I trying to add only new rows to an existing data
> set that lives in S3 as gzipped parquet files, looping and appending for
> each hour of the day. First, I create a DF from the existing data, then I
> use a query to create another DF with the data that is new. Here is the
> code snippet.
>
> df = spark.read.parquet(existing_data_path)
> df.createOrReplaceTempView(‘existing_data’)
> new_df = spark.read.parquet(new_data_path)
> new_df.createOrReplaceTempView(’new_data’)
> append_df = spark.sql(
> """
> WITH ids AS (
> SELECT DISTINCT
> source,
> source_id,
> target,
> target_id
> FROM new_data i
> LEFT ANTI JOIN existing_data im
> ON i.source = im.source
> AND i.source_id = im.source_id
> AND i.target = im.target
> AND i.target = im.target_id
> """
> )
> append_df.coalesce(1).write.parquet(existing_data_path, mode='append',
> compression='gzip’)
>
>
> I thought this would append new rows and keep the data unique, but I am
> see many duplicates. Can someone help me with this and tell me what I am
> doing wrong?
>
> Thanks,
> Ben
>


Re: Advice on multiple streaming job

2018-05-06 Thread vincent gromakowski
Use a scheduler that abstract the network away with a CNI for instance or
other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
bind on the same ports because each container will have its own IP. Some
other solution like mesos and marathon can work without CNI , with host IP
binding, but will manage the ports for you ensuring there isn't any
conflict.

Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a écrit :

> Hi All,
>
> Need advice on executing multiple streaming jobs.
>
> Problem:- We have 100's of streaming job. Every streaming job uses new
> port. Also, Spark automatically checks port from 4040 to 4056, post that it
> fails. One of the workaround, is to provide port explicitly.
>
> Is there a way to tackle this situation? or Am I missing any thing?
>
> Thanking you in advance.
>
> Regards,
> Dhaval Modi
> dhavalmod...@gmail.com
>


Re: Spark Optimization

2018-04-26 Thread vincent gromakowski
Ideal parallelization is 2-3x the nb of cores. But it depends on the number
of partitions of your source and the operation you use (Shuffle or not). It
can be worth paying the extra cost of an initial repartition to match your
cluster but it clearly depends on your DAG.
Optimizing spark apps depends on lots of thing, it's hard to answer
- cluster size
- scheduler
- spark version
- transformation graph (DAG)
...

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh 
a écrit :

> Hi Team,
>
>
>
> We are currently working on POC based on Spark and Scala.
>
> we have to read 18million records from parquet file and perform the 25
> user defined aggregation based on grouping keys.
>
> we have used spark high level Dataframe API for the aggregation. On
> cluster of two node we could finish end to end job
> ((Read+Aggregation+Write))in 2 min.
>
>
>
> *Cluster Information:*
>
> Number of Node:2
>
> Total Core:28Core
>
> Total RAM:128GB
>
>
>
> *Component: *
>
> Spark Core
>
>
>
> *Scenario:*
>
> How-to
>
>
>
> *Tuning Parameter:*
>
> spark.serializer org.apache.spark.serializer.KryoSerializer
>
> spark.default.parallelism 24
>
> spark.sql.shuffle.partitions 24
>
> spark.executor.extraJavaOptions -XX:+UseG1GC
>
> spark.speculation true
>
> spark.executor.memory 16G
>
> spark.driver.memory 8G
>
> spark.sql.codegen true
>
> spark.sql.inMemoryColumnarStorage.batchSize 10
>
> spark.locality.wait 1s
>
> spark.ui.showConsoleProgress false
>
> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
>
> Please let us know, If you have any ideas/tuning parameter that we can use
> to finish the job in less than one min.
>
>
>
>
>
> Regards,
>
> Pallavi
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Barre metal servers with 2 dedicated clusters (spark and Cassandra) versus
1 cluster with colocation. In both case 10 gbps dedicated network.

Le sam. 14 avr. 2018 à 23:17, Mich Talebzadeh  a
écrit :

> Thanks Vincent. You mean 20 times improvement with data being local as
> opposed to Spark running on compute nodes?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 April 2018 at 21:06, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Not with hadoop but with Cassandra, i have seen 20x data locality
>> improvement on partitioned optimized spark jobs
>>
>> Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh 
>> a écrit :
>>
>>> Hi,
>>>
>>> This is a sort of your mileage varies type question.
>>>
>>> In a classic Hadoop cluster, one has data locality when each node
>>> includes the Spark libraries and HDFS data. this helps certain queries like
>>> interactive BI.
>>>
>>> However running Spark over remote storage say Isilon scaled out NAS
>>> instead of LOCAL HDFS becomes problematic. The full-scan Spark needs to
>>> do will take much longer when it is done over the network (access the
>>> remote Isilon storage) instead of local I/O request to HDFS.
>>>
>>> Has anyone done some comparative studies on this?
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>


Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Not with hadoop but with Cassandra, i have seen 20x data locality
improvement on partitioned optimized spark jobs

Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh  a
écrit :

> Hi,
>
> This is a sort of your mileage varies type question.
>
> In a classic Hadoop cluster, one has data locality when each node includes
> the Spark libraries and HDFS data. this helps certain queries like
> interactive BI.
>
> However running Spark over remote storage say Isilon scaled out NAS
> instead of LOCAL HDFS becomes problematic. The full-scan Spark needs to
> do will take much longer when it is done over the network (access the
> remote Isilon storage) instead of local I/O request to HDFS.
>
> Has anyone done some comparative studies on this?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-21 Thread vincent gromakowski
If not resilient at spark level, can't you just relaunch you job with your
orchestration tool ?

Le 21 déc. 2017 09:34, "Georg Heiler"  a écrit :

> Die you try to use the yarn Shuffle Service?
> chopinxb  schrieb am Do. 21. Dez. 2017 um 04:43:
>
>> In my practice of spark application(almost Spark-SQL) , when there is a
>> complete node failure in my cluster, jobs which have shuffle blocks on the
>> node will completely fail after 4 task retries.  It seems that data
>> lineage
>> didn't work. What' more, our applications use multiple SQL statements for
>> data analysis. After a lengthy calculation, entire application failed
>> because of one job failure is unacceptable.  So we consider more stability
>> rather than speed in some way.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-20 Thread vincent gromakowski
Probability of a complete node failure is low. I would rely on data lineage
and accept the reprocessing overhead. Another option would be to Write on
distributed FS but it will drastically reduce all your jobs speed

Le 20 déc. 2017 11:23, "chopinxb"  a écrit :

> Yes,shuffle service was already started in each NodeManager. What i mean
> about node fails is the machine is down,all the service include nodemanager
> process in this machine  is down. So in this case, shuffle service is no
> longer helpfull
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Can spark shuffle leverage Alluxio to abtain higher stability?

2017-12-20 Thread vincent gromakowski
In your case you need to externalize the shuffle files to a component
outside of your spark cluster to make them persist after spark workers
death.
https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service


2017-12-20 10:46 GMT+01:00 chopinxb :

> In my use case, i run spark on yarn-client mode with dynamicAllocation
> enabled,  When a node shutting down abnormally, my spark application will
> fails because of task fail to fetch shuffle blocks from that node 4 times.
> Why spark do not leverage Alluxio(distributed in-memory filesystem) to
> write
> shuffle blocks with replicas ?  In this situation,when a node shutdown,task
> can fetch shuffle blocks from other replicas. we can abtain higher
> stability
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Chaining Spark Streaming Jobs

2017-09-12 Thread vincent gromakowski
What about chaining with akka or akka stream and the fair scheduler ?

Le 13 sept. 2017 01:51, "Sunita Arvind"  a écrit :

Hi Michael,

I am wondering what I am doing wrong. I get error like:

Exception in thread "main" java.lang.IllegalArgumentException: Schema must
be specified when creating a streaming source DataFrame. If some files
already exist in the directory, then depending on the file format you may
be able to create a static DataFrame on that directory with
'spark.read.load(directory)' and infer schema from it.
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(
DataSource.scala:223)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$
lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(
DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(
StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.
load(DataStreamReader.scala:125)
at org.apache.spark.sql.streaming.DataStreamReader.
load(DataStreamReader.scala:134)
at com.aol.persist.UplynkAggregates$.aggregator(
UplynkAggregates.scala:23)
at com.aol.persist.UplynkAggregates$.main(UplynkAggregates.scala:41)
at com.aol.persist.UplynkAggregates.main(UplynkAggregates.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
17/09/12 14:46:22 INFO SparkContext: Invoking stop() from shutdown hook


I tried specifying the schema as well.
Here is my code:

object Aggregates {

  val aggregation=
"""select sum(col1), sum(col2), id, first(name)
  from enrichedtb
  group by id
""".stripMargin

  def aggregator(conf:Config)={
implicit val spark =
SparkSession.builder().appName(conf.getString("AppName")).getOrCreate()
implicit val sqlctx = spark.sqlContext
printf("Source path is" + conf.getString("source.path"))
val schemadf = sqlctx.read.parquet(conf.getString("source.path"))
// Added this as it was complaining about schema.
val df=spark.readStream.format("parquet").option("inferSchema",
true).schema(schemadf.schema).load(conf.getString("source.path"))
df.createOrReplaceTempView("enrichedtb")
val res = spark.sql(aggregation)

res.writeStream.format("parquet").outputMode("append").option("checkpointLocation",conf.getString("checkpointdir")).start(conf.getString("sink.outputpath"))
  }

  def main(args: Array[String]): Unit = {
val mainconf = ConfigFactory.load()
val conf = mainconf.getConfig(mainconf.getString("pipeline"))
print(conf.toString)
aggregator(conf)
  }

}


I tried to extract schema from static read of the input path and
provided it to the readStream API. With that, I get this error:

at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:109)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:222)

While running on the EMR cluster all paths point to S3. In my laptop,
they all point to local filesystem.

I am using Spark2.2.0

Appreciate your help.

regards

Sunita


On Wed, Aug 23, 2017 at 2:30 PM, Michael Armbrust 
wrote:

> If you use structured streaming and the file sink, you can have a
> subsequent stream read using the file source.  This will maintain exactly
> once processing even if there are hiccups or failures.
>
> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind 
> wrote:
>
>> Hello Spark Experts,
>>
>> I have a design question w.r.t Spark Streaming. I have a streaming job
>> that consumes protocol buffer encoded real time logs from a Kafka cluster
>> on premise. My spark application runs on EMR (aws) and persists data onto
>> s3. Before I persist, I need to strip header and convert protobuffer to
>> parquet (I use sparksql-scalapb to convert from Protobuff to
>> Spark.sql.Row). I need to persist Raw logs as is. I can continue the
>> enrichment on the same dataframe after persisting the raw data, however, in
>> order to modularize I am planning to have a separate job which picks up the
>> raw data and performs enrichment on it. Also,  I am trying to avoid all in
>> 1 job a

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vincent gromakowski
I think Kafka streams is good when the processing of each row is
independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin"  a
écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/
documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in
conferences, saying that one should give a try to Kafka streaming when its
whole pipeline is using Kafka. I have no pros/cons to argument on this
topic.

*Yohann Jardin*
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-
programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
 and graph processing
 algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali  wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783 <(224)%20436-0783>
Greater Chicago


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
If mandatory, you can use a local cache like alluxio

Le 1 juin 2017 10:23 AM, "Mich Talebzadeh"  a
écrit :

> Thanks Vincent. I assume by physical data locality you mean you are going
> through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that shared network could be an issue as well.
> However, it allows you to reduce data redundancy (you do not need R3 in
> HDFS anymore) and also you can build virtual clusters on the same data. One
> cluster for read/writes and another for Reads? That is what has been
> suggestes!.
>
> regards
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:55, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> I don't recommend this kind of design because you loose physical data
>> locality and you will be affected by "bad neighboors" that are also using
>> the network storage... We have one similar design but restricted to small
>> clusters (more for experiments than production)
>>
>> 2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :
>>
>>> Thanks Jorn,
>>>
>>> This was a proposal made by someone as the firm is already using this
>>> tool on other SAN based storage and extend it to Big Data
>>>
>>> On paper it seems like a good idea, in practice it may be a Wandisco
>>> scenario again..  Of course as ever one needs to EMC for reference calls
>>> ans whether anyone is using this product in anger.
>>>
>>>
>>>
>>> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>>>  However that may suit our needs.  But  would need to PoC it and test it
>>> thoroughly!
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>>>
>>>> Hi,
>>>>
>>>> I have done this (not Isilon, but another storage system). It can be
>>>> efficient for small clusters and depending on how you design the network.
>>>>
>>>> What I have also seen is the microservice approach with object stores
>>>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>>>
>>>> If you want additional performance you could fetch the data from the
>>>> object stores and store it temporarily in a local HDFS. Not sure to what
>>>> extent this affects regulatory requirements though.
>>>>
>>>> Best regards
>>>>
>>>> On 31. May 2017, at 18:07, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I realize this may not have direct relevance to Spark but has anyone
>>>> tried to create virtualized HDFS clusters using tools like ISILON or
>>>> similar?
>>>>
>>>> The prime motive behind this approach is to minimize the propagation or
>>>> copy of data which has regulatory implication. In shoret you want your data
>>>> to be in one place regardless of artefacts used against it such as Spark?
>>>>
>>>> Thanks,
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
I don't recommend this kind of design because you loose physical data
locality and you will be affected by "bad neighboors" that are also using
the network storage... We have one similar design but restricted to small
clusters (more for experiments than production)

2017-06-01 9:47 GMT+02:00 Mich Talebzadeh :

> Thanks Jorn,
>
> This was a proposal made by someone as the firm is already using this tool
> on other SAN based storage and extend it to Big Data
>
> On paper it seems like a good idea, in practice it may be a Wandisco
> scenario again..  Of course as ever one needs to EMC for reference calls
> ans whether anyone is using this product in anger.
>
>
>
> At the end of the day it's not HDFS.  It is OneFS with a HCFS API.
>  However that may suit our needs.  But  would need to PoC it and test it
> thoroughly!
>
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 June 2017 at 08:21, Jörn Franke  wrote:
>
>> Hi,
>>
>> I have done this (not Isilon, but another storage system). It can be
>> efficient for small clusters and depending on how you design the network.
>>
>> What I have also seen is the microservice approach with object stores
>> (e.g. In the cloud s3, on premise swift) which is somehow also similar.
>>
>> If you want additional performance you could fetch the data from the
>> object stores and store it temporarily in a local HDFS. Not sure to what
>> extent this affects regulatory requirements though.
>>
>> Best regards
>>
>> On 31. May 2017, at 18:07, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I realize this may not have direct relevance to Spark but has anyone
>> tried to create virtualized HDFS clusters using tools like ISILON or
>> similar?
>>
>> The prime motive behind this approach is to minimize the propagation or
>> copy of data which has regulatory implication. In shoret you want your data
>> to be in one place regardless of artefacts used against it such as Spark?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread vincent gromakowski
Akka has been replaced by netty in 1.6

Le 22 mai 2017 15:25, "Chin Wei Low"  a écrit :

> I think akka has been removed since 2.0.
>
> On 22 May 2017 10:19 pm, "Gene Pang"  wrote:
>
>> Hi,
>>
>> Tachyon has been renamed to Alluxio. Here is the documentation for
>> running Alluxio with Spark
>> .
>>
>> Hope this helps,
>> Gene
>>
>> On Sun, May 21, 2017 at 6:15 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>
>>> HI all,
>>> Iread some paper about source code, the paper base on version 1.2.  they
>>> refer the tachyon and akka.  When i read the 2.1code. I can not find the
>>> code abiut akka and tachyon.
>>>
>>> Are tachyon and akka removed from 2.1.1  please
>>>
>>
>>


Re: Restful API Spark Application

2017-05-12 Thread vincent gromakowski
It's in scala but it should be portable in java
https://github.com/vgkowski/akka-spark-experiments


Le 12 mai 2017 10:54 PM, "Василец Дмитрий"  a
écrit :

and livy https://hortonworks.com/blog/livy-a-rest-interface-for-
apache-spark/

On Fri, May 12, 2017 at 10:51 PM, Sam Elamin 
wrote:
> Hi Nipun
>
> Have you checked out the job servwr
>
> https://github.com/spark-jobserver/spark-jobserver
>
> Regards
> Sam
> On Fri, 12 May 2017 at 21:00, Nipun Arora 
wrote:
>>
>> Hi,
>>
>> We have written a java spark application (primarily uses spark sql). We
>> want to expand this to provide our application "as a service". For this,
we
>> are trying to write a REST API. While a simple REST API can be easily
made,
>> and I can get Spark to run through the launcher. I wonder, how the spark
>> context can be used by service requests, to process data.
>>
>> Are there any simple JAVA examples to illustrate this use-case? I am sure
>> people have faced this before.
>>
>>
>> Thanks
>> Nipun

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread vincent gromakowski
Use cache or persist. The dataframe will be materialized when the 1st
action is called and then be reused from memory for each following usage

Le 1 mai 2017 4:51 PM, "Saulo Ricci"  a écrit :

> Hi,
>
>
> I have the following code that is reading a table to a apache spark
> DataFrame:
>
>  val df = spark.read.format("jdbc")
>  .option("url","jdbc:postgresql:host/database")
>  .option("dbtable","tablename").option("user","username")
>  .option("password", "password")
>  .load()
>
> When I first invoke df.count() I get a smaller number than the next time
> I invoke the same count method.
>
> Why this happen?
>
> Doesn't Spark load a snapshot of my table in a DataFrame on my Spark
> Cluster when I first read that table?
>
> My table on postgres keeps being fed and it seems my data frame is
> reflecting this behavior.
>
> How should I manage to load just a static snapshot my table to spark's
> DataFrame by the time `read` method was invoked?
>
>
> Any help is appreciated,
>
> --
> Saulo
>


Re: help/suggestions to setup spark cluster

2017-04-26 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe
you can try fair scheduling and use one cluster but i doubt it will be prod
ready...

Le 27 avr. 2017 5:28 AM, "anna stax"  a écrit :

> Thanks Cody,
>
> As I already mentioned I am running spark streaming on EC2 cluster in
> standalone mode. Now in addition to streaming, I want to be able to run
> spark batch job hourly and adhoc queries using Zeppelin.
>
> Can you please confirm that a standalone cluster is OK for this. Please
> provide me some links to help me get started.
>
> Thanks
> -Anna
>
> On Wed, Apr 26, 2017 at 7:46 PM, Cody Koeninger 
> wrote:
>
>> The standalone cluster manager is fine for production.  Don't use Yarn
>> or Mesos unless you already have another need for it.
>>
>> On Wed, Apr 26, 2017 at 4:53 PM, anna stax  wrote:
>> > Hi Sam,
>> >
>> > Thank you for the reply.
>> >
>> > What do you mean by
>> > I doubt people run spark in a. Single EC2 instance, certainly not in
>> > production I don't think
>> >
>> > What is wrong in having a data pipeline on EC2 that reads data from
>> kafka,
>> > processes using spark and outputs to cassandra? Please explain.
>> >
>> > Thanks
>> > -Anna
>> >
>> > On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin 
>> wrote:
>> >>
>> >> Hi Anna
>> >>
>> >> There are a variety of options for launching spark clusters. I doubt
>> >> people run spark in a. Single EC2 instance, certainly not in
>> production I
>> >> don't think
>> >>
>> >> I don't have enough information of what you are trying to do but if you
>> >> are just trying to set things up from scratch then I think you can
>> just use
>> >> EMR which will create a cluster for you and attach a zeppelin instance
>> as
>> >> well
>> >>
>> >>
>> >> You can also use databricks for ease of use and very little management
>> but
>> >> you will pay a premium for that abstraction
>> >>
>> >>
>> >> Regards
>> >> Sam
>> >> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
>> >>>
>> >>> I need to setup a spark cluster for Spark streaming and scheduled
>> batch
>> >>> jobs and adhoc queries.
>> >>> Please give me some suggestions. Can this be done in standalone mode.
>> >>>
>> >>> Right now we have a spark cluster in standalone mode on AWS EC2
>> running
>> >>> spark streaming application. Can we run spark batch jobs and zeppelin
>> on the
>> >>> same. Do we need a better resource manager like Mesos?
>> >>>
>> >>> Are there any companies or individuals that can help in setting this
>> up?
>> >>>
>> >>> Thank you.
>> >>>
>> >>> -Anna
>> >
>> >
>>
>
>


spark streaming resiliency

2017-04-25 Thread vincent gromakowski
Hi,
I have a question regarding Spark streaming resiliency and the
documentation is ambiguous :

The documentation says that the default configuration use a replication
factor of 2 for data received but the recommendation is to use write ahead
logs to guarantee data resiliency with receivers.

"Additionally, it is recommended that the replication of the received data
within Spark be disabled when the write ahead log is enabled as the log is
already stored in a replicated storage system."
The doc says it useless to duplicate with WAL, but what is the benefit of
using WAL instead of the internal in memory replication ? I would assume
it's better to replicate in memory than write on a replicated FS reagarding
performance...

Can a streaming expert explain me ?
BR


Re: Authorizations in thriftserver

2017-04-25 Thread vincent gromakowski
Does someone have the answer ?

2017-04-24 9:32 GMT+02:00 vincent gromakowski :

> Hi,
> Can someone confirm authorizations aren't implemented in Spark
> thriftserver for SQL standard based hive authorizations?
> https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+
> Authorization
> If confirmed, any plan to implement it ?
> Thanks
>
>


Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread vincent gromakowski
Look at Spark jobserver namedRDD that are supposed to be thread safe...

2017-04-24 16:01 GMT+02:00 Hemanth Gudela :

> Hello Gene,
>
>
>
> Thanks, but Alluxio did not solve my spark streaming use case because my
> source parquet files in Alluxio in-memory are not ”appended” but are
> periodically being ”overwritten” due to the nature of business need.
>
> Spark jobs fail when trying to read parquet files at the same time when
> other job is writing parquet files in Alluxio.
>
>
>
> Could you suggest a way to synchronize parquet reads and writes in Allxio
> in-memory. i.e. when one spark job is writing a dataframe as parquet file
> in alluxio in-memory, the other spark jobs trying to read must wait until
> the write is finished.
>
>
>
> Thanks,
>
> Hemanth
>
>
>
> *From: *Gene Pang 
> *Date: *Monday, 24 April 2017 at 16.41
> *To: *vincent gromakowski 
> *Cc: *Hemanth Gudela , "user@spark.apache.org"
> , Felix Cheung 
>
> *Subject: *Re: Spark SQL - Global Temporary View is not behaving as
> expected
>
>
>
> As Vincent mentioned, Alluxio helps with sharing data across different
> Spark contexts. This blog post about Spark dataframes and Alluxio
> discusses that use case
> <https://alluxio.com/blog/effective-spark-dataframes-with-alluxio>.
>
>
>
> Thanks,
>
> Gene
>
>
>
> On Sat, Apr 22, 2017 at 2:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> Look at alluxio for sharing across drivers or spark jobserver
>
>
>
> Le 22 avr. 2017 10:24 AM, "Hemanth Gudela"  a
> écrit :
>
> Thanks for your reply.
>
>
>
> Creating a table is an option, but such approach slows down reads & writes
> for a real-time analytics streaming use case that I’m currently working on.
>
> If at all global temporary view could have been accessible across
> sessions/spark contexts, that would have simplified my usecase a lot.
>
>
>
> But yeah, thanks for explaining the behavior of global temporary view, now
> it’s clear J
>
>
>
> -Hemanth
>
>
>
> *From: *Felix Cheung 
> *Date: *Saturday, 22 April 2017 at 11.05
> *To: *Hemanth Gudela , "user@spark.apache.org"
> 
> *Subject: *Re: Spark SQL - Global Temporary View is not behaving as
> expected
>
>
>
> Cross session is this context is multiple spark sessions from the same
> spark context. Since you are running two shells, you are having different
> spark context.
>
>
>
> Do you have to you a temp view? Could you create a table?
>
>
>
> _
> From: Hemanth Gudela 
> Sent: Saturday, April 22, 2017 12:57 AM
> Subject: Spark SQL - Global Temporary View is not behaving as expected
> To: 
>
>
> Hi,
>
>
>
> According to documentation
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view>,
> global temporary views are cross-session accessible.
>
>
>
> But when I try to query a global temporary view from another spark shell
> like thisà
>
> *Instance 1 of spark-shell*
>
> --
>
> scala> spark.sql("select 1 as col1").createGlobalTempView("gView1")
>
>
>
> *Instance 2 of spark-shell *(while Instance 1 of spark-shell is still
> alive)
>
> -
>
> scala> spark.sql("select * from global_temp.gView1").show()
>
> org.apache.spark.sql.AnalysisException: Table or view not found:
> `global_temp`.`gView1`
>
> 'Project [*]
>
> +- 'UnresolvedRelation `global_temp`.`gView1`
>
>
>
> I am expecting that global temporary view created in shell 1 should be
> accessible in shell 2, but it isn’t!
>
> Please correct me if I missing something here.
>
>
>
> Thanks (in advance),
>
> Hemanth
>
>
>
>
>


Authorizations in thriftserver

2017-04-24 Thread vincent gromakowski
Hi,
Can someone confirm authorizations aren't implemented in Spark thriftserver
for SQL standard based hive authorizations?
https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
If confirmed, any plan to implement it ?
Thanks


Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-22 Thread vincent gromakowski
Look at alluxio for sharing across drivers or spark jobserver

Le 22 avr. 2017 10:24 AM, "Hemanth Gudela"  a
écrit :

> Thanks for your reply.
>
>
>
> Creating a table is an option, but such approach slows down reads & writes
> for a real-time analytics streaming use case that I’m currently working on.
>
> If at all global temporary view could have been accessible across
> sessions/spark contexts, that would have simplified my usecase a lot.
>
>
>
> But yeah, thanks for explaining the behavior of global temporary view, now
> it’s clear J
>
>
>
> -Hemanth
>
>
>
> *From: *Felix Cheung 
> *Date: *Saturday, 22 April 2017 at 11.05
> *To: *Hemanth Gudela , "user@spark.apache.org"
> 
> *Subject: *Re: Spark SQL - Global Temporary View is not behaving as
> expected
>
>
>
> Cross session is this context is multiple spark sessions from the same
> spark context. Since you are running two shells, you are having different
> spark context.
>
>
>
> Do you have to you a temp view? Could you create a table?
>
>
>
> _
> From: Hemanth Gudela 
> Sent: Saturday, April 22, 2017 12:57 AM
> Subject: Spark SQL - Global Temporary View is not behaving as expected
> To: 
>
>
>
> Hi,
>
>
>
> According to documentation
> ,
> global temporary views are cross-session accessible.
>
>
>
> But when I try to query a global temporary view from another spark shell
> like thisà
>
> *Instance 1 of spark-shell*
>
> --
>
> scala> spark.sql("select 1 as col1").createGlobalTempView("gView1")
>
>
>
> *Instance 2 of spark-shell *(while Instance 1 of spark-shell is still
> alive)
>
> -
>
> scala> spark.sql("select * from global_temp.gView1").show()
>
> org.apache.spark.sql.AnalysisException: Table or view not found:
> `global_temp`.`gView1`
>
> 'Project [*]
>
> +- 'UnresolvedRelation `global_temp`.`gView1`
>
>
>
> I am expecting that global temporary view created in shell 1 should be
> accessible in shell 2, but it isn’t!
>
> Please correct me if I missing something here.
>
>
>
> Thanks (in advance),
>
> Hemanth
>
>
>


Re: Reading ASN.1 files in Spark

2017-04-06 Thread vincent gromakowski
I would also be interested...

2017-04-06 11:09 GMT+02:00 Hamza HACHANI :

> Does any body have a spark code example where he is reading ASN.1 files ?
> Thx
>
> Best regards
> Hamza
>


data cleaning and error routing

2017-03-21 Thread vincent gromakowski
Hi,
In a context of ugly data, I am trying to find an efficient way to parse a
kafka stream of CSV lines into a clean data model and route lines in error
in a specific topic.

Generally I do this:
1. First a map to split my lines with the separator character (";")
2. Then a filter where I put all my conditions (number of fields...)
3. Then subtract the first with the second to get lines in error and save
it to a topic

Problem with this approach is that I cannot efficiently test the parsing of
String fields in other types like Int or Date. I would like to:
- test incomplete lines (arra length < x)
- test empty fields
- test field casting into Int, Long...
- some errors can be evicting, some aren't (use Try getOrElse ?)

How do you generally achieve this ? I cannot find any good data cleaning
example...


Re: Fast write datastore...

2017-03-15 Thread vincent gromakowski
Hi
If queries are statics and filters are on the same columns, Cassandra is a
good option.

Le 15 mars 2017 7:04 AM, "muthu"  a écrit :

Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Scaling Kafka Direct Streming application

2017-03-15 Thread vincent gromakowski
You would probably need dynamic allocation which is only available on yarn
and mesos. Or wait for on going spark k8s integration


Le 15 mars 2017 1:54 AM, "Pranav Shukla"  a
écrit :

> How to scale or possibly auto-scale a spark streaming application
> consuming from kafka and using kafka direct streams. We are using spark
> 1.6.3, cannot move to 2.x unless there is a strong reason.
>
> Scenario:
> Kafka topic with 10 partitions
> Standalone cluster running on kubernetes with 1 master and 2 workers
>
> What we would like to know?
> Increase the number of partitions (say from 10 to 15)
> Add additional worker node without restarting the streaming application
> and start consuming off the additional partitions.
>
> Is this possible? i.e. start additional workers in standalone cluster to
> auto-scale an existing spark streaming application that is already running
> or we have to stop and resubmit the streaming app?
>
> Best Regards,
> Pranav Shukla
>


Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
I forgot to mention it also depends on the spark kafka connector you use.
If it's receiver based, I recommend a dedicated zookeeper cluster because
it is used to store offsets. If it's receiver less Zookeeper can be shared.

2017-03-03 9:29 GMT+01:00 Jörn Franke :

> I think this highly depends on the risk that you want to be exposed to. If
> you have it on dedicated nodes there is less influence of other processes.
>
> I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not
> recommend to put it on data nodes/heavily utilized nodes.
>
> Zookeeper does not need many resources (if you do not abuse it) and you
> may think about putting it on a dedicated small infrastructure of several
> nodes.
>
> On 3 Mar 2017, at 09:15, Mich Talebzadeh 
> wrote:
>
>
> hi,
>
> In DEV, Kafka and ZooKeeper services can be co- located.on the same
> physical hosts
>
> In Prod moving forward do we need to set up Zookeeper on its own cluster
> not sharing with Hadoop cluster? Can these services be shared within the
> Hadoop cluster?
>
> How best to set up Zookeeper that is needed for Kafka for use with Spark
> Streaming?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread vincent gromakowski
Hi,
Depending on the Kafka version (< 0.8.2 I think), offsets are managed in
Zookeeper and if you have lots of consumer it's recommended to use a
dedicated zookeeper cluster (always with dedicated disks, even SSD is
better). On newer version offsets are managed in special Kafka topics and
Zookeeper is only used to store metadata, you can share it with Hadoop.
Maybe you can reach a limit depending on the size of your Kafka, the number
of topics, producers/consumers... but I have never heard yet. Another point
is to be careful about security on Zookeeper, sharing a cluster means you
get the same security level (authentication or not)

2017-03-03 9:15 GMT+01:00 Mich Talebzadeh :

>
> hi,
>
> In DEV, Kafka and ZooKeeper services can be co- located.on the same
> physical hosts
>
> In Prod moving forward do we need to set up Zookeeper on its own cluster
> not sharing with Hadoop cluster? Can these services be shared within the
> Hadoop cluster?
>
> How best to set up Zookeeper that is needed for Kafka for use with Spark
> Streaming?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: is dataframe thread safe?

2017-02-15 Thread vincent gromakowski
I would like to have your opinion about an idea I had...

I am thinking of answering the issue of interactive query on small/medium
dataset (max 500 GB or 1 TB) with a solution based on the thriftserver and
spark cache management. Currently the problem of caching the dataset in
Spark is that you cannot have a high data freshness and the cache isn't
resilient.
If dataframe is thread safe, would it be possible to implement a cache
management strategy that periodically refresh the cached dataset from the
backends ?

Another question regarding the persist MEMORY_AND_DISK, what is the
promote/eviction strategy implemented ? Is FIFO, LIFO, heat based ?

Note: I already know Alluxio and it could potentially also solve this
issue, my question is on Spark only, I would like to benefits from tungsten
project and the no-serialization options...

2017-02-15 9:05 GMT+01:00 萝卜丝炒饭 <1427357...@qq.com>:

> updating dataframe  returns NEW dataframe  like RDD please?
>
> ---Original---
> *From:* "vincent gromakowski"
> *Date:* 2017/2/14 01:15:35
> *To:* "Reynold Xin";
> *Cc:* "user";"Mendelson, Assaf"<
> assaf.mendel...@rsa.com>;
> *Subject:* Re: is dataframe thread safe?
>
> How about having a thread that update and cache a dataframe in-memory next
> to other threads requesting this dataframe, is it thread safe ?
>
> 2017-02-13 9:02 GMT+01:00 Reynold Xin :
>
>> Yes your use case should be fine. Multiple threads can transform the same
>> data frame in parallel since they create different data frames.
>>
>>
>> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf 
>> wrote:
>>
>>> Hi,
>>>
>>> I was wondering if dataframe is considered thread safe. I know the spark
>>> session and spark context are thread safe (and actually have tools to
>>> manage jobs from different threads) but the question is, can I use the same
>>> dataframe in both threads.
>>>
>>> The idea would be to create a dataframe in the main thread and then in
>>> two sub threads do different transformations and actions on it.
>>>
>>> I understand that some things might not be thread safe (e.g. if I
>>> unpersist in one thread it would affect the other. Checkpointing would
>>> cause similar issues), however, I can’t find any documentation as to what
>>> operations (if any) are thread safe.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Assaf.
>>>
>>
>


Re: is dataframe thread safe?

2017-02-13 Thread vincent gromakowski
How about having a thread that update and cache a dataframe in-memory next
to other threads requesting this dataframe, is it thread safe ?

2017-02-13 9:02 GMT+01:00 Reynold Xin :

> Yes your use case should be fine. Multiple threads can transform the same
> data frame in parallel since they create different data frames.
>
>
> On Sun, Feb 12, 2017 at 9:07 AM Mendelson, Assaf 
> wrote:
>
>> Hi,
>>
>> I was wondering if dataframe is considered thread safe. I know the spark
>> session and spark context are thread safe (and actually have tools to
>> manage jobs from different threads) but the question is, can I use the same
>> dataframe in both threads.
>>
>> The idea would be to create a dataframe in the main thread and then in
>> two sub threads do different transformations and actions on it.
>>
>> I understand that some things might not be thread safe (e.g. if I
>> unpersist in one thread it would affect the other. Checkpointing would
>> cause similar issues), however, I can’t find any documentation as to what
>> operations (if any) are thread safe.
>>
>>
>>
>> Thanks,
>>
>> Assaf.
>>
>


Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread vincent gromakowski
Spark jobserver or Livy server are the best options for pure technical API.
If you want to publish business API you will probably have to build you own
app like the one I wrote a year ago
https://github.com/elppc/akka-spark-experiments
It combines Akka actors and a shared Spark context to serve concurrent
subsecond jobs


2017-02-07 15:28 GMT+01:00 ayan guha :

> I think you are loking for livy or spark  jobserver
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca 
> wrote:
>
>> I want to run different jobs on demand with same spark context, but i
>> don't know how exactly i can do this.
>>
>> I try to get current context, but seems it create a new spark
>> context(with new executors).
>>
>> I call spark-submit to add new jobs.
>>
>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
>> yarn as resource manager.
>>
>> My code:
>>
>> val sparkContext = SparkContext.getOrCreate()
>> val content = 1 to 4
>> val result = sparkContext.parallelize(content, 5)
>> result.map(value => value.toString).foreach(loop)
>>
>> def loop(x: String): Unit = {
>>for (a <- 1 to 3000) {
>>
>>}
>> }
>>
>> spark-submit:
>>
>> spark-submit --executor-cores 1 \
>>  --executor-memory 1g \
>>  --driver-memory 1g \
>>  --master yarn \
>>  --deploy-mode cluster \
>>  --conf spark.dynamicAllocation.enabled=true \
>>  --conf spark.shuffle.service.enabled=true \
>>  --conf spark.dynamicAllocation.minExecutors=1 \
>>  --conf spark.dynamicAllocation.maxExecutors=3 \
>>  --conf spark.dynamicAllocation.initialExecutors=3 \
>>  --conf spark.executor.instances=3 \
>>
>> If i run twice spark-submit it create 6 executors, but i want to run all
>> this jobs on same spark application.
>>
>> How can achieve adding jobs to an existing spark application?
>>
>> I don't understand why SparkContext.getOrCreate() don't get existing
>> spark context.
>>
>>
>> Thanks,
>>
>> Cosmin P.
>>
> --
> Best Regards,
> Ayan Guha
>


RE: filters Pushdown

2017-02-02 Thread vincent gromakowski
There are some native (in the doc) and some third party (in spark package
https://spark-packages.org/?q=tags%3A"Data%20Sources";)
Parquet is prefered native. Cassandra/filodb provides most advanced
pushdown.

Le 2 févr. 2017 11:23 AM, "Peter Shmukler"  a écrit :

> Hi Vincent,
>
> Thank you for answer. (I don’t see your answer in mailing list, so I’m
> answering directly)
>
>
>
> What connectors can I work with from Spark?
>
> Can you provide any link to read about it because I see nothing in Spark
> documentation?
>
>
>
>
>
> *From:* vincent gromakowski [mailto:vincent.gromakow...@gmail.com]
> *Sent:* Thursday, February 2, 2017 12:12 PM
> *To:* Peter Shmukler 
> *Cc:* user@spark.apache.org
> *Subject:* Re: filters Pushdown
>
>
>
> Pushdowns depend on the source connector.
> Join pushdown with Cassandra only
> Filter pushdown with mainly all sources with some specific constraints
>
>
>
> Le 2 févr. 2017 10:42 AM, "Peter Sg"  a écrit :
>
> Can community help me to figure out some details about Spark:
> -   Does Spark support filter Pushdown for types:
>   o Int/long
>   o DateTime
>   o String
> -   Does Spark support Pushdown of join operations for partitioned
> tables (in
> case of join condition includes partitioning field)?
> -   Does Spark support Pushdown on Parquet, ORC ?
>   o Should I use Hadoop or NTFS/NFS is option was well?
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/filters-Pushdown-tp28357.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_filters-2DPushdown-2Dtp28357.html&d=DwMFaQ&c=7s4bs_giP1ngjwWhX4oayQ&r=kLWLAWGkyIRgjRCprqh7QX1OMFp1eBZjlRawqzDlMWc&m=Zss0q3yuZVzxFuqvPaXLIOHACrxzZOjevU-VE8Eeh04&s=dupzi0-PiyPLCmvPqwWSt2NaEE5hUKlbzmB4-NRuhfg&e=>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
> This email and any attachments thereto may contain private, confidential,
> and privileged material for the sole use of the intended recipient. Any
> review, copying, or distribution of this email (or any attachments thereto)
> by others is strictly prohibited. If you are not the intended recipient,
> please contact the sender immediately and permanently delete the original
> and any copies of this email and any attachments thereto.
>


Re: filters Pushdown

2017-02-02 Thread vincent gromakowski
Pushdowns depend on the source connector.
Join pushdown with Cassandra only
Filter pushdown with mainly all sources with some specific constraints

Le 2 févr. 2017 10:42 AM, "Peter Sg"  a écrit :

> Can community help me to figure out some details about Spark:
> -   Does Spark support filter Pushdown for types:
>   o Int/long
>   o DateTime
>   o String
> -   Does Spark support Pushdown of join operations for partitioned
> tables (in
> case of join condition includes partitioning field)?
> -   Does Spark support Pushdown on Parquet, ORC ?
>   o Should I use Hadoop or NTFS/NFS is option was well?
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/filters-Pushdown-tp28357.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Having multiple spark context

2017-01-29 Thread vincent gromakowski
A clustering lib is necessary to manage multiple jvm. Akka cluster for
instance

Le 30 janv. 2017 8:01 AM, "Rohit Verma"  a
écrit :

> Hi,
>
> If I am right, you need to launch other context from another jvm. If you
> are trying to launch from same jvm another context it will return you the
> existing context.
>
> Rohit
>
> On Jan 30, 2017, at 12:24 PM, Mark Hamstra 
> wrote:
>
> More than one Spark Context in a single Application is not supported.
>
> On Sun, Jan 29, 2017 at 9:08 PM,  wrote:
>
>> Hi,
>>
>>
>>
>> I have a requirement in which, my application creates one Spark context
>> in Distributed mode whereas another Spark context in local mode.
>>
>> When I am creating this, my complete application is working on only one
>> SparkContext (created in Distributed mode). Second spark context is not
>> getting created.
>>
>>
>>
>> Can you please help me out in how to create two spark contexts.
>>
>>
>>
>> Regards,
>>
>> Jasbir singh
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> 
>> __
>>
>> www.accenture.com
>>
>
>
>


Re: spark locality

2017-01-14 Thread vincent gromakowski
Should I open a ticket to allow data locality in IP per container context ?

2017-01-12 23:41 GMT+01:00 Michael Gummelt :

> If the executor reports a different hostname inside the CNI container,
> then no, I don't think so.
>
> On Thu, Jan 12, 2017 at 2:28 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> So even if I make the Spark executors run on the same node as Casssandra
>> nodes, I am not sure each worker will connect to c* nodes on the same mesos
>> agent ?
>>
>> 2017-01-12 21:13 GMT+01:00 Michael Gummelt :
>>
>>> The code in there w/ docs that reference CNI doesn't actually run when
>>> CNI is in effect, and doesn't have anything to do with locality.  It's just
>>> making Spark work in a no-DNS environment
>>>
>>> On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> I have found this but I am not sure how it can help...
>>>> https://github.com/mesosphere/spark-build/blob/a9efef8850976
>>>> f787956660262f3b77cd636f3f5/conf/spark-env.sh
>>>>
>>>>
>>>> 2017-01-12 20:16 GMT+01:00 Michael Gummelt :
>>>>
>>>>> That's a good point. I hadn't considered the locality implications of
>>>>> CNI yet.  I think tasks are placed based on the hostname reported by the
>>>>> executor, which in a CNI container will be different than the
>>>>> HDFS/Cassandra hostname.  I'm not aware of anyone running Spark+CNI in 
>>>>> prod
>>>>> yet, either.
>>>>>
>>>>> However, locality in Mesos isn't great right now anyway.  Executors
>>>>> are placed w/o regard to locality.  Locality is only taken into account
>>>>> when tasks are assigned to executors.  So if you get a locality-poor
>>>>> executor placement, you'll also have locality poor task placement.  It
>>>>> could be better.
>>>>>
>>>>> On Thu, Jan 12, 2017 at 7:55 AM, vincent gromakowski <
>>>>> vincent.gromakow...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> Does anyone have experience running Spark on Mesos with CNI (ip per
>>>>>> container) ?
>>>>>> How would Spark use IP or hostname for data locality with backend
>>>>>> framework like HDFS or Cassandra ?
>>>>>>
>>>>>> V
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Michael Gummelt
>>>>> Software Engineer
>>>>> Mesosphere
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: spark locality

2017-01-12 Thread vincent gromakowski
So even if I make the Spark executors run on the same node as Casssandra
nodes, I am not sure each worker will connect to c* nodes on the same mesos
agent ?

2017-01-12 21:13 GMT+01:00 Michael Gummelt :

> The code in there w/ docs that reference CNI doesn't actually run when CNI
> is in effect, and doesn't have anything to do with locality.  It's just
> making Spark work in a no-DNS environment
>
> On Thu, Jan 12, 2017 at 12:04 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> I have found this but I am not sure how it can help...
>> https://github.com/mesosphere/spark-build/blob/a9efef8850976
>> f787956660262f3b77cd636f3f5/conf/spark-env.sh
>>
>>
>> 2017-01-12 20:16 GMT+01:00 Michael Gummelt :
>>
>>> That's a good point. I hadn't considered the locality implications of
>>> CNI yet.  I think tasks are placed based on the hostname reported by the
>>> executor, which in a CNI container will be different than the
>>> HDFS/Cassandra hostname.  I'm not aware of anyone running Spark+CNI in prod
>>> yet, either.
>>>
>>> However, locality in Mesos isn't great right now anyway.  Executors are
>>> placed w/o regard to locality.  Locality is only taken into account when
>>> tasks are assigned to executors.  So if you get a locality-poor executor
>>> placement, you'll also have locality poor task placement.  It could be
>>> better.
>>>
>>> On Thu, Jan 12, 2017 at 7:55 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> Does anyone have experience running Spark on Mesos with CNI (ip per
>>>> container) ?
>>>> How would Spark use IP or hostname for data locality with backend
>>>> framework like HDFS or Cassandra ?
>>>>
>>>> V
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: spark locality

2017-01-12 Thread vincent gromakowski
I have found this but I am not sure how it can help...
https://github.com/mesosphere/spark-build/blob/a9efef8850976f787956660262f3b77cd636f3f5/conf/spark-env.sh


2017-01-12 20:16 GMT+01:00 Michael Gummelt :

> That's a good point. I hadn't considered the locality implications of CNI
> yet.  I think tasks are placed based on the hostname reported by the
> executor, which in a CNI container will be different than the
> HDFS/Cassandra hostname.  I'm not aware of anyone running Spark+CNI in prod
> yet, either.
>
> However, locality in Mesos isn't great right now anyway.  Executors are
> placed w/o regard to locality.  Locality is only taken into account when
> tasks are assigned to executors.  So if you get a locality-poor executor
> placement, you'll also have locality poor task placement.  It could be
> better.
>
> On Thu, Jan 12, 2017 at 7:55 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Hi all,
>> Does anyone have experience running Spark on Mesos with CNI (ip per
>> container) ?
>> How would Spark use IP or hostname for data locality with backend
>> framework like HDFS or Cassandra ?
>>
>> V
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


spark locality

2017-01-12 Thread vincent gromakowski
Hi all,
Does anyone have experience running Spark on Mesos with CNI (ip per
container) ?
How would Spark use IP or hostname for data locality with backend framework
like HDFS or Cassandra ?

V


Re: Question about Spark and filesystems

2016-12-18 Thread vincent gromakowski
I am using gluster and i have decent performance with basic maintenance
effort. Advantage of gluster: you can plug Alluxio on top to improve perf
but I still need to be validate...

Le 18 déc. 2016 8:50 PM,  a écrit :

> Hello,
>
> We are trying out Spark for some file processing tasks.
>
> Since each Spark worker node needs to access the same files, we have
> tried using Hdfs. This worked, but there were some oddities making me a
> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
> this version exhibited the odd behaviour with Hdfs. The problem might
> have nothing to do with Hdfs, but the situation made me curious about
> the alternatives.
>
> Now I'm wondering what kind of file system would be suitable for our
> deployment.
>
> - There won't be a great number of nodes. Maybe 10 or so.
>
> - The datasets won't be big by big-data standards(Maybe a couple of
>   hundred gb)
>
> So maybe I could just use a NFS server, with a caching client?
> Or should I try Ceph, or Glusterfs?
>
> Does anyone have any experiences to share?
>
> --
> Joakim Verona
> joa...@verona.se
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Is there synchronous way to predict against model for real time data

2016-12-15 Thread vincent gromakowski
Something like that ?
https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/

Le 16 déc. 2016 1:08 AM, "suyogchoudh...@gmail.com" <
suyogchoudh...@gmail.com> a écrit :

> Hi,
>
> I have question about, how can I real time make decision using a model I
> have created with Spark ML.
>
> 1. I have some data and created model using it.
> // Train the model
>
> val model = new
> LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData)
>
> 2. I believe, I can use spark streaming to get real time feed and then
> predict result against model created in step1
>
> 3. My question is, how can I do it in synchronous way?
>
> For e.g. lets say if some customer logs in to my site, then according to
> his
> data, I want to personalize his site. I want to send his attributes to
> model
> and get prediction before rendering anything on page.
>
> How can I do this synchronously?
>
> Regards,
>
> Suyog
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-there-synchronous-way-to-
> predict-against-model-for-real-time-data-tp28222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
What about ephemeral storage on ssd ? If performance is required it's
generally for production so the cluster would never be stopped. Then a
spark job to backup/restore on S3 allows to shut down completely the cluster

Le 3 déc. 2016 1:28 PM, "David Mitchell"  a
écrit :

> To get a node local read from Spark to Cassandra, one has to use a read
> consistency level of LOCAL_ONE.  For some use cases, this is not an
> option.  For example, if you need to use a read consistency level
> of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
> node local read.
>
> Also, to insure a node local read, one has to set spark.locality.wait to
> zero.  Whether or not a partition will be streamed to another node or
> computed locally is dependent on the spark.locality.wait parameters. This
> parameter can be set to 0 to force all partitions to only be computed on
> local nodes.
>
> If you do some testing, please post your performance numbers.
>
>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
You get more latency on reads so overall execution time is longer

Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :

>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of moving compute towards the data instead of
> moving data towards computation but It sounds more like one is trying to
> optimize for network latency.
>
> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
> second) Network throughput.
>
> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>
> so In this case I don't see how colocation can help even if there is one
> to one mapping from spark worker node to a colocated Cassandra node where
> say we are doing a table scan of billion rows ?
>
> Thanks!
>
>


Re: Do I have to wrap akka around spark streaming app?

2016-11-29 Thread vincent gromakowski
You can still achieve it by implementing an actor in each partition but I
am not sure it's a good design regarding scalability because your
distributed actors would send a message for each event to your single app
actor, it would be a huge load
If you want to experiment this and because actor is thread safe you can use
the following pattern which allows to reuse actors between micro batches in
each partitions
http://allegro.tech/2015/08/spark-kafka-integration.html


2016-11-29 2:18 GMT+01:00 shyla deshpande :

> Hello All,
>
> I just want to make sure this is a right use case for Kafka --> Spark
> Streaming
>
> Few words about my use case :
>
> When the user watches a video, I get the position events from the user
> that indicates how much they have completed viewing and at a certain point,
> I mark that Video as complete and that triggers a lot of other events. I
> need a way to notify the app about the creation of the completion event.
>
> Appreciate any suggestions.
>
> Thanks
>
>
> On Mon, Nov 28, 2016 at 2:35 PM, shyla deshpande  > wrote:
>
>> In this case, persisting to Cassandra is for future analytics and
>> Visualization.
>>
>> I want to notify that the app of the event, so it makes the app
>> interactive.
>>
>> Thanks
>>
>> On Mon, Nov 28, 2016 at 2:24 PM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Sorry I don't understand...
>>> Is it a cassandra acknowledge to actors that you want ? Why do you want
>>> to ack after writing to cassandra ? Your pipeline kafka=>spark=>cassandra
>>> is supposed to be exactly once, so you don't need to wait for cassandra
>>> ack, you can just write to kafka from actors and then notify the user ?
>>>
>>> 2016-11-28 23:15 GMT+01:00 shyla deshpande :
>>>
>>>> Thanks Vincent for the input. Not sure I understand your suggestion.
>>>> Please clarify.
>>>>
>>>> Few words about my use case :
>>>> When the user watches a video, I get the position events from the user
>>>> that indicates how much they have completed viewing and at a certain point,
>>>> I mark that Video as complete and persist that info to cassandra.
>>>>
>>>> How do I notify the user that it was marked complete?
>>>>
>>>> Are you suggesting I write the completed events to kafka(different
>>>> topic) and the akka consumer could read from this? There could be many
>>>> completed events from different users in this topic. So the akka consumer
>>>> should pretty much do what a spark streaming does to process this without
>>>> the knowledge of the kafka offset.
>>>>
>>>> So not sure what you mean by kafka offsets will do the job, how will
>>>> the akka consumer know the kafka offset?
>>>>
>>>> On Mon, Nov 28, 2016 at 12:52 PM, vincent gromakowski <
>>>> vincent.gromakow...@gmail.com> wrote:
>>>>
>>>>> You don't need actors to do kafka=>spark processing=>kafka
>>>>> Why do you need to notify the akka producer ? If you need to get back
>>>>> the processed message in your producer, then implement an akka consummer 
>>>>> in
>>>>> your akka app and kafka offsets will do the job
>>>>>
>>>>> 2016-11-28 21:46 GMT+01:00 shyla deshpande :
>>>>>
>>>>>> Thanks Daniel for the response.
>>>>>>
>>>>>> I am planning to use Spark streaming to do Event Processing. I will
>>>>>> have akka actors sending messages to kafka. I process them using Spark
>>>>>> streaming and as a result a new events will be generated. How do I notify
>>>>>> the akka actor(Message producer)  that a new event has been generated?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 28, 2016 at 9:51 AM, Daniel van der Ende <
>>>>>> daniel.vandere...@gmail.com> wrote:
>>>>>>
>>>>>>> Well, I would say it depends on what you're trying to achieve. Right
>>>>>>> now I don't know why you are considering using Akka. Could you please
>>>>>>> explain your use case a bit?
>>>>>>>
>>>>>>> In general, there is no single correct answer to your current
>>>>>>> question as it's quite broad.
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> On Mon, Nov 28, 2016 at 9:11 AM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> My data pipeline is Kafka --> Spark Streaming --> Cassandra.
>>>>>>>>
>>>>>>>> Can someone please explain me when would I need to wrap akka around
>>>>>>>> the spark streaming app. My knowledge of akka and the actor system is 
>>>>>>>> poor.
>>>>>>>> Please help!
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Daniel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
Sorry I don't understand...
Is it a cassandra acknowledge to actors that you want ? Why do you want to
ack after writing to cassandra ? Your pipeline kafka=>spark=>cassandra is
supposed to be exactly once, so you don't need to wait for cassandra ack,
you can just write to kafka from actors and then notify the user ?

2016-11-28 23:15 GMT+01:00 shyla deshpande :

> Thanks Vincent for the input. Not sure I understand your suggestion.
> Please clarify.
>
> Few words about my use case :
> When the user watches a video, I get the position events from the user
> that indicates how much they have completed viewing and at a certain point,
> I mark that Video as complete and persist that info to cassandra.
>
> How do I notify the user that it was marked complete?
>
> Are you suggesting I write the completed events to kafka(different topic)
> and the akka consumer could read from this? There could be many completed
> events from different users in this topic. So the akka consumer should
> pretty much do what a spark streaming does to process this without the
> knowledge of the kafka offset.
>
> So not sure what you mean by kafka offsets will do the job, how will the
> akka consumer know the kafka offset?
>
> On Mon, Nov 28, 2016 at 12:52 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> You don't need actors to do kafka=>spark processing=>kafka
>> Why do you need to notify the akka producer ? If you need to get back the
>> processed message in your producer, then implement an akka consummer in
>> your akka app and kafka offsets will do the job
>>
>> 2016-11-28 21:46 GMT+01:00 shyla deshpande :
>>
>>> Thanks Daniel for the response.
>>>
>>> I am planning to use Spark streaming to do Event Processing. I will have
>>> akka actors sending messages to kafka. I process them using Spark streaming
>>> and as a result a new events will be generated. How do I notify the akka
>>> actor(Message producer)  that a new event has been generated?
>>>
>>>
>>>
>>> On Mon, Nov 28, 2016 at 9:51 AM, Daniel van der Ende <
>>> daniel.vandere...@gmail.com> wrote:
>>>
>>>> Well, I would say it depends on what you're trying to achieve. Right
>>>> now I don't know why you are considering using Akka. Could you please
>>>> explain your use case a bit?
>>>>
>>>> In general, there is no single correct answer to your current question
>>>> as it's quite broad.
>>>>
>>>> Daniel
>>>>
>>>> On Mon, Nov 28, 2016 at 9:11 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> My data pipeline is Kafka --> Spark Streaming --> Cassandra.
>>>>>
>>>>> Can someone please explain me when would I need to wrap akka around
>>>>> the spark streaming app. My knowledge of akka and the actor system is 
>>>>> poor.
>>>>> Please help!
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Daniel
>>>>
>>>
>>>
>>
>


Re: Do I have to wrap akka around spark streaming app?

2016-11-28 Thread vincent gromakowski
You don't need actors to do kafka=>spark processing=>kafka
Why do you need to notify the akka producer ? If you need to get back the
processed message in your producer, then implement an akka consummer in
your akka app and kafka offsets will do the job

2016-11-28 21:46 GMT+01:00 shyla deshpande :

> Thanks Daniel for the response.
>
> I am planning to use Spark streaming to do Event Processing. I will have
> akka actors sending messages to kafka. I process them using Spark streaming
> and as a result a new events will be generated. How do I notify the akka
> actor(Message producer)  that a new event has been generated?
>
>
>
> On Mon, Nov 28, 2016 at 9:51 AM, Daniel van der Ende <
> daniel.vandere...@gmail.com> wrote:
>
>> Well, I would say it depends on what you're trying to achieve. Right now
>> I don't know why you are considering using Akka. Could you please explain
>> your use case a bit?
>>
>> In general, there is no single correct answer to your current question as
>> it's quite broad.
>>
>> Daniel
>>
>> On Mon, Nov 28, 2016 at 9:11 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My data pipeline is Kafka --> Spark Streaming --> Cassandra.
>>>
>>> Can someone please explain me when would I need to wrap akka around the
>>> spark streaming app. My knowledge of akka and the actor system is poor.
>>> Please help!
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Daniel
>>
>
>


Re: Possible DR solution

2016-11-12 Thread vincent gromakowski
A Hdfs tiering policy with good tags should be similar

Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh"  a
écrit :

> I really don't see why one wants to set up streaming replication unless
> for situations where similar functionality to transactional databases is
> required in big data?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 November 2016 at 17:24, Mich Talebzadeh 
> wrote:
>
>> I think it differs as it starts streaming data through its own port as
>> soon as the first block is landed. so the granularity is a block.
>>
>> however, think of it as oracle golden gate replication or sap replication
>> for databases. the only difference is that if the corruption in the block
>> with hdfs it will be freplicated much like srdf.
>>
>> whereas with oracle or sap it is log based replication which stops when
>> it encounters corruption.
>>
>> replication depends on the block. so can replicate hive metadata and
>> fsimage etc. but cannot replicate hbase memstore if hbase crashes.
>>
>> so that is the gist of it. streaming replication as opposed to snapshot.
>>
>> sounds familiar. think of it as log shipping in oracle old days versus
>> goldengate etc.
>>
>> hth
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 11 November 2016 at 17:14, Deepak Sharma 
>> wrote:
>>
>>> Reason being you can set up hdfs duplication on your own to some other
>>> cluster .
>>>
>>> On Nov 11, 2016 22:42, "Mich Talebzadeh" 
>>> wrote:
>>>
 reason being ?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 11 November 2016 at 17:11, Deepak Sharma 
 wrote:

> This is waste of money I guess.
>
> On Nov 11, 2016 22:41, "Mich Talebzadeh" 
> wrote:
>
>> starts at $4,000 per node per year all inclusive.
>>
>> With discount it can be halved but we are talking a node itself so if
>> you have 5 nodes in primary and 5 nodes in DR we are talking about $40K
>> already.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 11 November 2016 at 16:43, Mudit Kumar 
>> wrote:
>>
>>> Is it feasible cost wise?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mudit
>>>
>>>
>>>
>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>> *Sent:* Friday, November 11, 2016 2:56 PM
>>> *To:* user @spark
>>> *Subject:* Possible DR solution
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> Has anyone had experience of using WanDisco
>>>  block replication to create a fault
>>> to

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread vincent gromakowski
I have already integrated common actors. I am also interested, specially to
see how we can achieve end to end back pressure.

2016-11-10 8:46 GMT+01:00 shyla deshpande :

> I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka,
> Spark Streaming and Cassandra using Structured Streaming. But the kafka
> source support for Structured Streaming is not yet available. So now I am
> trying to use Akka Stream as the source to Spark Streaming.
>
> Want to make sure I am heading in the right direction. Please direct me to
> any sample code and reading material for this.
>
> Thanks
>
>


Re: A Spark long running program as web server

2016-11-06 Thread vincent gromakowski
Hi,
Spark jobserver seems to be more mature than Livy but both would work I
think. You will just get more functionalities with the jobserver except the
impersonation that is only in Livy.
If you need to publish business API I would recommend to use Akka http with
Spark actors sharing a preloaded spark context so you can publish more user
friendly API. Jobserver has no way to specify endpoints URL and API verbs,
it's more like a series of random numbers.
The other way to publish business API is to build a classic API application
that requests jobserver or livy jobs through HTTP but I think it has two
much latency to run 2 HTTP request ?

2016-11-06 14:06 GMT+01:00 Reza zade :

> Hi
>
> I have written multiple spark driver programs that load some data from
> HDFS to data frames and accomplish spark sql queries on it and persist the
> results in HDFS again. Now I need to provide a long running java program in
> order to receive requests and their some parameters(such as the number of
> top rows should be returned) from a web application (e.g. a dashboard) via
> post and get and send back the results to web application. My web
> application is somewhere out of the Spark cluster. Briefly my goal is to
> send requests and their accompanying data from web application via
> something such as POST to long running java program. then it receives the
> request and runs the corresponding spark driver (spark app) and returns the
> results for example in JSON format.
>
>
> Whats is your suggestion to develop this use case?
> Is Livy a good choise? If your answer is positive what should I do?
>
> Thanks.
>
>


Re: How to avoid unnecessary spark starkups on every request?

2016-11-02 Thread vincent gromakowski
Hi
I am currently using akka http sending requests to multiple spark actors
that use a preloaded spark context and fair scheduler. It's only a
prototype and I haven't tested the concurrency but it seems one of the
rigth way to do. Complete processing time is arround 600 ms.The other way
would be to use a spark job server but i don't like to split my REST API in
2 (one business  in akka http and one technical in jobserver).

Le 2 nov. 2016 8:34 AM, "Fanjin Zeng"  a écrit :

>  Hi,
>
>  I am working on a project that takes requests from HTTP server and
> computes accordingly on spark. But the problem is when I receive many
> request at the same time, users have to waste a lot of time on the
> unnecessary startups that occur on each request. Does Spark have built-in
> job scheduler function to solve this problem or is there any trick can be
> used to avoid these unnecessary startups?
>
>  Thanks,
>  Fanjin
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
Bad idea. No caching, cluster over consumption...
Have a look on instantiating a custom thriftserver on temp tables with
fair  scheduler to allow concurrent SQL requests. It's not a public API but
you can find some examples.

Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh"  a
écrit :

> Hi,
>
> I think tempTable is private to the session that creates it. In Hive temp
> tables created by "CREATE TEMPORARY TABLE" are all private to the session.
> Spark is no different.
>
> The alternative may be everyone creates tempTable from the same DF?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 10:03, Chanh Le  wrote:
>
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>>
>> It just reuse 1 Spark Context by not letting it stop when the application
>> had done. Should check: livy, spark-jobserver
>> FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html just how
>> you scheduler your job in the pool but FAIR help you run job in parallel vs
>> FIFO (default) 1 job at the time.
>>
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>>
>>
>> Store metadata in Hive may help but I am not sure about this.
>> I use Spark Thrift Server create table on that then let Zeppelin query
>> from that.
>>
>> Regards,
>> Chanh
>>
>>
>>
>>
>>
>> On Oct 27, 2016, at 9:01 PM, Victor Shafran 
>> wrote:
>>
>> Hi Vincent,
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>> You help is highly appreciated
>> Victor
>>
>> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:
>>
>>> Hi Mich,
>>>
>>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>>> DataFrames among different applications and contexts. The data typically
>>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>>> om/blog/effective-spark-rdds-with-alluxio
>>>
>>> Also, Alluxio also has the concept of an "Under filesystem", which can
>>> help you access your existing data across different storage systems. Here
>>> is more information about the unified namespace abilities:
>>> http://www.alluxio.org/docs/master/en/Unified-and
>>> -Transparent-Namespace.html
>>>
>>> Hope that helps,
>>> Gene
>>>
>>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks Chanh,

 Can it share RDDs.

 Personally I have not used either Alluxio or Ignite.


1. Are there major differences between these two
2. Have you tried Alluxio for sharing Spark RDDs and if so do you
have any experience you can kindly share

 Regards


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 27 October 2016 at 11:29, Chanh Le  wrote:

> Hi Mich,
> Alluxio is the good option to go.
>
> Regards,
> Chanh
>
> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> There was a mention of using Zeppelin to share RDDs with many users.
> From the notes on Zeppelin it appears that this is sharing UI and I am not
> sure how easy it is going to be changing the result set with different
> users modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache
> Ignite. Has anyone really tried this. Will that work with multiple
> ap

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi
Just point all users on the same app with a common spark context.
For instance akka http receives queries from user and launch concurrent
spark SQL queries in different actor thread. The only prerequsite is to
launch the different jobs in different threads (like with actors).
Be carefull it's not CRUD if one of the job modifies dataset, it's OK for
read only.

Le 27 oct. 2016 4:02 PM, "Victor Shafran"  a
écrit :

> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
> You help is highly appreciated
> Victor
>
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:
>
>> Hi Mich,
>>
>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>> DataFrames among different applications and contexts. The data typically
>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>> om/blog/effective-spark-rdds-with-alluxio
>>
>> Also, Alluxio also has the concept of an "Under filesystem", which can
>> help you access your existing data across different storage systems. Here
>> is more information about the unified namespace abilities:
>> http://www.alluxio.org/docs/master/en/Unified-and
>> -Transparent-Namespace.html
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>
 Hi Mich,
 Alluxio is the good option to go.

 Regards,
 Chanh

 On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
 wrote:


 There was a mention of using Zeppelin to share RDDs with many users.
 From the notes on Zeppelin it appears that this is sharing UI and I am not
 sure how easy it is going to be changing the result set with different
 users modifying say sql queries.

 There is also the idea of caching RDDs with something like Apache
 Ignite. Has anyone really tried this. Will that work with multiple
 applications?

 It looks feasible as RDDs are immutable and so are registered
 tempTables etc.

 Thanks


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.





>>>
>>
>
>
> --
>
> Victor Shafran
>
> VP R&D| Equalum
>
> Mobile: +972-523854883 | Email: victor.shaf...@equalum.io
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
For this you will need to contribute...

Le 27 oct. 2016 1:35 PM, "Mich Talebzadeh"  a
écrit :

> so I assume Ignite will not work with Spark version >=2?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 12:27, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> some options:
>> - ignite for spark 1.5, can deep store on cassandra
>> - alluxio for all spark versions, can deep store on hdfs, gluster...
>>
>> ==> these are best for sharing between jobs
>>
>> - shared sparkcontext and fair scheduling, seems to be not thread safe
>> - spark jobserver and namedRDD, CRUD thread safe RDD sharing between
>> spark jobs
>> ==> these are best for sharing between users
>>
>> 2016-10-27 12:59 GMT+02:00 vincent gromakowski <
>> vincent.gromakow...@gmail.com>:
>>
>>> I would prefer sharing the spark context  and using FAIR scheduler for
>>> user concurrency
>>>
>>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" 
>>> a écrit :
>>>
>>>> thanks Vince.
>>>>
>>>> So Ignite uses some hash/in-memory indexing.
>>>>
>>>> The question is in practice is there much use case to use these two
>>>> fabrics for sharing RDDs.
>>>>
>>>> Remember all RDBMSs do this through shared memory.
>>>>
>>>> In layman's term if I have two independent spark-submit running, can
>>>> they share result set. For example the same tempTable etc?
>>>>
>>>> Cheers
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 27 October 2016 at 11:44, vincent gromakowski <
>>>> vincent.gromakow...@gmail.com> wrote:
>>>>
>>>>> Ignite works only with spark 1.5
>>>>> Ignite leverage indexes
>>>>> Alluxio provides tiering
>>>>> Alluxio easily integrates with underlying FS
>>>>>
>>>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
>>>>> a écrit :
>>>>>
>>>>>> Thanks Chanh,
>>>>>>
>>>>>> Can it share RDDs.
>>>>>>
>>>>>> Personally I have not used either Alluxio or Ignite.
>>>>>>
>>>>>>
>>>>>>1. Are there major differences between these two
>>>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>>>have any experience you can kindly share
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>&

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
some options:
- ignite for spark 1.5, can deep store on cassandra
- alluxio for all spark versions, can deep store on hdfs, gluster...

==> these are best for sharing between jobs

- shared sparkcontext and fair scheduling, seems to be not thread safe
- spark jobserver and namedRDD, CRUD thread safe RDD sharing between spark
jobs
==> these are best for sharing between users

2016-10-27 12:59 GMT+02:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:

> I would prefer sharing the spark context  and using FAIR scheduler for
> user concurrency
>
> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh"  a
> écrit :
>
>> thanks Vince.
>>
>> So Ignite uses some hash/in-memory indexing.
>>
>> The question is in practice is there much use case to use these two
>> fabrics for sharing RDDs.
>>
>> Remember all RDBMSs do this through shared memory.
>>
>> In layman's term if I have two independent spark-submit running, can they
>> share result set. For example the same tempTable etc?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:44, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Ignite works only with spark 1.5
>>> Ignite leverage indexes
>>> Alluxio provides tiering
>>> Alluxio easily integrates with underlying FS
>>>
>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
>>> a écrit :
>>>
>>>> Thanks Chanh,
>>>>
>>>> Can it share RDDs.
>>>>
>>>> Personally I have not used either Alluxio or Ignite.
>>>>
>>>>
>>>>1. Are there major differences between these two
>>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>>have any experience you can kindly share
>>>>
>>>> Regards
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>>
>>>>> Hi Mich,
>>>>> Alluxio is the good option to go.
>>>>>
>>>>> Regards,
>>>>> Chanh
>>>>>
>>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> There was a mention of using Zeppelin to share RDDs with many users.
>>>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>>>> sure how easy it is going to be changing the result set with different
>>>>> users modifying say sql queries.
>>>>>
>>>>> There is also the idea of caching RDDs with something like Apache
>>>>> Ignite. Has anyone really tried this. Will that work with multiple
>>>>> applications?
>>>>>
>>>>> It looks feasible as RDDs are immutable and so are registered
>>>>> tempTables etc.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
I would prefer sharing the spark context  and using FAIR scheduler for user
concurrency

Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh"  a
écrit :

> thanks Vince.
>
> So Ignite uses some hash/in-memory indexing.
>
> The question is in practice is there much use case to use these two
> fabrics for sharing RDDs.
>
> Remember all RDBMSs do this through shared memory.
>
> In layman's term if I have two independent spark-submit running, can they
> share result set. For example the same tempTable etc?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Ignite works only with spark 1.5
>> Ignite leverage indexes
>> Alluxio provides tiering
>> Alluxio easily integrates with underlying FS
>>
>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
>> a écrit :
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>
>>>> Hi Mich,
>>>> Alluxio is the good option to go.
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>>
>>>> There was a mention of using Zeppelin to share RDDs with many users.
>>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>>> sure how easy it is going to be changing the result set with different
>>>> users modifying say sql queries.
>>>>
>>>> There is also the idea of caching RDDs with something like Apache
>>>> Ignite. Has anyone really tried this. Will that work with multiple
>>>> applications?
>>>>
>>>> It looks feasible as RDDs are immutable and so are registered
>>>> tempTables etc.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5
Ignite leverage indexes
Alluxio provides tiering
Alluxio easily integrates with underlying FS

Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh"  a
écrit :

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
>any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le  wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users. From
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure
>> how easy it is going to be changing the result set with different users
>> modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache Ignite.
>> Has anyone really tried this. Will that work with multiple applications?
>>
>> It looks feasible as RDDs are immutable and so are registered tempTables
>> etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Re: Microbatches length

2016-10-20 Thread vincent gromakowski
You can still implement your own logic with akka actors for instance. Based
on some threshold the actor can launch spark batch mode using the same
spark context... It's only an idea , no real experience.

Le 20 oct. 2016 1:31 PM, "Paulo Candido"  a écrit :

> In this case I haven't any alternatives to get microbatches with same
> length? Using another class or any configuration? I'm using socket.
>
> Thank you for attention.
>
> Em qui, 20 de out de 2016 às 09:24, 王贺(Gabriel) 
> escreveu:
>
>> The interval is for time, so you won't get micro-batches in same data
>> size but same time length.
>>
>> Yours sincerely,
>> Gabriel (王贺)
>> Mobile: +86 18621263813 <+86%20186%202126%203813>
>>
>>
>> On Thu, Oct 20, 2016 at 6:38 PM, pcandido  wrote:
>>
>> Hello folks,
>>
>> I'm using Spark Streaming. My question is simple:
>> The documentation says that microbatches arrive in intervals. The
>> intervals
>> are in real time (minutes, seconds). I want to get microbatches with same
>> length, so, I can configure SS to return microbatches when it reach a
>> determined length?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Microbatches-length-tp27927.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>> --
>
> Paulo Cândido
>


Re: mllib model in production web API

2016-10-18 Thread vincent gromakowski
Hi
Did you try applying the model with akka instead of spark ?
https://spark-summit.org/eu-2015/events/real-time-anomaly-detection-with-spark-ml-and-akka/

Le 18 oct. 2016 5:58 AM, "Aseem Bansal"  a écrit :

> @Nicolas
>
> No, ours is different. We required predictions within 10ms time frame so
> we needed much less latency than that.
>
> Every algorithm has some parameters. Correct? We took the parameters from
> the mllib and used them to create ml package's model. ml package's model's
> prediction time was much faster compared to mllib package's transformation.
> So essentially use spark's distributed machine learning library to train
> the model, save to S3, load from S3 in a different system and then convert
> it into the vector based API model for actual predictions.
>
> There were obviously some transformations involved but we didn't use
> Pipeline for those transformations. Instead, we re-wrote them for the
> Vector based API. I know it's not perfect but if we had used the
> transformations within the pipeline that would make us dependent on spark's
> distributed API and we didn't see how we will really reach our latency
> requirements. Would have been much simpler and more DRY if the
> PipelineModel had a predict method based on vectors and was not distributed.
>
> As you can guess it is very much model-specific and more work. If we
> decide to use another type of Model we will have to add conversion
> code/transformation code for that also. Only if spark exposed a prediction
> method which is as fast as the old machine learning package.
>
> On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long 
> wrote:
>
>> Hi Sean and Aseem,
>>
>> thanks both. A simple thing which sped things up greatly was simply to
>> load our sql (for one record effectively) directly and then convert to a
>> dataframe, rather than using Spark to load it. Sounds stupid, but this took
>> us from > 5 seconds to ~1 second on a very small instance.
>>
>> Aseem: can you explain your solution a bit more? I'm not sure I
>> understand it. At the moment we load our models from S3
>> (RandomForestClassificationModel.load(..) ) and then store that in an
>> object property so that it persists across requests - this is in Scala. Is
>> this essentially what you mean?
>>
>>
>>
>>
>>
>>
>> On 12 October 2016 at 10:52, Aseem Bansal  wrote:
>>
>>> Hi
>>>
>>> Faced a similar issue. Our solution was to load the model, cache it
>>> after converting it to a model from mllib and then use that instead of ml
>>> model.
>>>
>>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen  wrote:
>>>
 I don't believe it will ever scale to spin up a whole distributed job
 to serve one request. You can look possibly at the bits in mllib-local. You
 might do well to export as something like PMML either with Spark's export
 or JPMML and then load it into a web container and score it, without Spark
 (possibly also with JPMML, OpenScoring)


 On Tue, Oct 11, 2016, 17:53 Nicolas Long  wrote:

> Hi all,
>
> so I have a model which has been stored in S3. And I have a Scala
> webapp which for certain requests loads the model and transforms submitted
> data against it.
>
> I'm not sure how to run this quickly on a single instance though. At
> the moment Spark is being bundled up with the web app in an uberjar (sbt
> assembly).
>
> But the process is quite slow. I'm aiming for responses < 1 sec so
> that the webapp can respond quickly to requests. When I look the Spark UI 
> I
> see:
>
> Summary Metrics for 1 Completed Tasks
>
> MetricMin25th percentileMedian75th percentileMax
> Duration94 ms94 ms94 ms94 ms94 ms
> Scheduler Delay0 ms0 ms0 ms0 ms0 ms
> Task Deserialization Time3 s3 s3 s3 s3 s
> GC Time2 s2 s2 s2 s2 s
> Result Serialization Time0 ms0 ms0 ms0 ms0 ms
> Getting Result Time0 ms0 ms0 ms0 ms0 ms
> Peak Execution Memory0.0 B0.0 B0.0 B0.0 B0.0 B
>
> I don't really understand why deserialization and GC should take so
> long when the models are already loaded. Is this evidence I am doing
> something wrong? And where can I get a better understanding on how Spark
> works under the hood here, and how best to do a standalone/bundled jar
> deployment?
>
> Thanks!
>
> Nic
>

>>>
>>
>


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
Instead of (or additionally to) saving results somewhere, you just start a
thriftserver that expose the Spark tables of the SQLContext (or
SparkSession now). That means you can implement any logic (and maybe use
structured streaming) to expose your data. Today using the thriftserver
means reading data from the persistent store every query, so if the data
modeling doesn't fit the query it can be quite long.  What you generally do
in a common spark job is to load the data and cache spark table in a
in-memory columnar table which is quite efficient for any kind of query,
the counterpart is that the cache isn't updated you have to implement a
reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in
spark table cache and expose it through the thriftserver. But you have to
implement the loading logic, it can be very simple to very complex
depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim :

> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
> Just curious.
>
> Cheers,
> Ben
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel :
>
>> Guys,
>> Sorry for jumping in late to the game…
>>
>> If memory serves (which may not be a good thing…) :
>>
>> You can use HiveServer2 as a connection point to HBase.
>> While this doesn’t perform well, its probably the cleanest solution.
>> I’m not keen on Phoenix… wouldn’t recommend it….
>>
>>
>> The issue is that you’re trying to make HBase, a key/value object store,
>> a Relational Engine… its not.
>>
>> There are some considerations which make HBase not ideal for all use
>> cases and you may find better performance with Parquet files.
>>
>> One thing missing is the use of secondary indexing and query
>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>> etc …  so your performance will vary.
>>
>> With respect to Tableau… their entire interface in to the big data world
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>> part of your solution, you’re DOA w respect to Tableau.
>>
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>> another Apache project)
>>
>>
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>>
>> Thanks for all the suggestions. It would seem you guys are right about
>> the Tableau side of things. The reports don’t need to be real-time, and
>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>
>> I originally thought that we needed two-way data retrieval from the DMP
>> HBase for ID generation, but after further investigation into the use-case
>> and architecture, the ID generation needs to happen local to the Ad Servers
>> where we generate a unique ID and store it in a ID linking table. Even
>> better, many of the 3rd party services supply this ID. So, data only needs
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>> required. This is also goes for the REST Endpoints. 3rd party services will
>> hit ours to update our data with no need to read from our data. And, when
>> we want to update their data, we will hit theirs to update their data using
>> a triggered job.
>>
>> This al boils down to just integrating with Kafka.
>>
>> Once again, thanks for all the help.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>>
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
I would suggest to code your own Spark thriftserver which seems to be very
easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any
logic because it's a spark job and then start a thrift server on temporary
table. For example you can query a micro batch rdd from a kafka stream, or
pre load some tables and implement a rolling cache to periodically update
the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues
doing this but I think Spark community should look at this path: making the
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel :

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim  wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering

spark on mesos memory sizing with offheap

2016-10-13 Thread vincent gromakowski
Hi,
I am trying to understand how mesos allocate memory when offheap is enabled
but it seems that the framework is only taking the heap + 400 MB overhead
into consideration for resources allocation.
Example: spark.executor.memory=3g spark.memory.offheap.size=1g ==> mesos
report 3.4g allocated for the executor
Is there any configuration to use both heap and offheap for mesos
allocation ?