Re: Increase batch interval in case of delay

2021-07-01 Thread Mich Talebzadeh
Just looking at this, what is your frequency interval ingesting ~1000
records per sec. By the rule of thumb your capacity planning should account
for twice the normal ingestion rate.

Regarding your point:

"...  Hence, ideally I'd like to increase the number of batches/records
that are being processed after a delay reaches a certain time"

The only way you can do this is by allocating more resources to your
cluster at the start so that additional capacity is made available.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 14:28, András Kolbert  wrote:

> After a 10 minutes delay, taking a 10 minutes batch will not take 10 times
> more than a 1-minute batch.
>
> It's mainly because of the I/O write operations to HDFS, and also because
> certain active users will be active in 1-minute batch, processing this
> customer only once (if we take 10 batches) will save time.
>
>
>
> On Thu, 1 Jul 2021 at 13:45, Sean Owen  wrote:
>
>> Wouldn't this happen naturally? the large batches would just take a
>> longer time to complete already.
>>
>> On Thu, Jul 1, 2021 at 6:32 AM András Kolbert 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a spark streaming application which generally able to process the
>>> data within the given time frame. However, in certain hours it starts
>>> increasing that causes a delay.
>>>
>>> In my scenario, the number of input records are not linearly increase
>>> the processing time. Hence, ideally I'd like to increase the number of
>>> batches/records that are being processed after a delay reaches a certain
>>> time.
>>>
>>> Is there a possibility/settings to do so?
>>>
>>> Thanks
>>> Andras
>>>
>>>
>>> [image: image.png]
>>>
>>


[Spark conf setting] spark.sql.parquet.cacheMetadata = true still invalidates cache in memory.

2021-07-01 Thread Parag Mohanty
Hi Team
I am trying to read a parquet file, cache it and then do transformation and
overwrite the parquet file in a session.
But first count action doesn't cache the dataframe.
It gets cached while caching the transformed dataframe.
Even if the spark.sql.parquet.cacheMetadata = true still the write
operation destroys the cache.
Is it expected? What is the relevance of this conf setting ?

We are using pyspark on spark cluster mode.
Regards
Parag Mohanty


Re: Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
After a 10 minutes delay, taking a 10 minutes batch will not take 10 times
more than a 1-minute batch.

It's mainly because of the I/O write operations to HDFS, and also because
certain active users will be active in 1-minute batch, processing this
customer only once (if we take 10 batches) will save time.



On Thu, 1 Jul 2021 at 13:45, Sean Owen  wrote:

> Wouldn't this happen naturally? the large batches would just take a longer
> time to complete already.
>
> On Thu, Jul 1, 2021 at 6:32 AM András Kolbert 
> wrote:
>
>> Hi,
>>
>> I have a spark streaming application which generally able to process the
>> data within the given time frame. However, in certain hours it starts
>> increasing that causes a delay.
>>
>> In my scenario, the number of input records are not linearly increase the
>> processing time. Hence, ideally I'd like to increase the number of
>> batches/records that are being processed after a delay reaches a certain
>> time.
>>
>> Is there a possibility/settings to do so?
>>
>> Thanks
>> Andras
>>
>>
>> [image: image.png]
>>
>


Unsubscribe

2021-07-01 Thread kushagra deep



Re: Increase batch interval in case of delay

2021-07-01 Thread Sean Owen
Wouldn't this happen naturally? the large batches would just take a longer
time to complete already.

On Thu, Jul 1, 2021 at 6:32 AM András Kolbert 
wrote:

> Hi,
>
> I have a spark streaming application which generally able to process the
> data within the given time frame. However, in certain hours it starts
> increasing that causes a delay.
>
> In my scenario, the number of input records are not linearly increase the
> processing time. Hence, ideally I'd like to increase the number of
> batches/records that are being processed after a delay reaches a certain
> time.
>
> Is there a possibility/settings to do so?
>
> Thanks
> Andras
>
>
> [image: image.png]
>


Re: OutOfMemoryError

2021-07-01 Thread Sean Owen
You need to set driver memory before the driver starts, on the CLI or
however you run your app, not in the app itself. By the time the driver
starts to run your app, its heap is already set.

On Thu, Jul 1, 2021 at 12:10 AM javaguy Java  wrote:

> Hi,
>
> I'm getting Java OOM errors even though I'm setting my driver memory to
> 24g and I'm executing against local[*]
>
> I was wondering if anyone can give me any insight.  The server this job is 
> running on has more than enough memory as does the spark driver.
>
> The final result does write 3 csv files that are 300MB each so there's no way 
> its coming close to the 24g
>
> From the OOM, I don't know about the internals of Spark itself to tell me 
> where this is failing + how I should refactor or change anything
>
> Would appreciate any advice on how I can resolve
>
> Thx
>
>
> Parameters here:
>
> val spark = SparkSession
>   .builder
>   .master("local[*]")
>   .appName("OOM")
>   .config("spark.driver.host", "localhost")
>   .config("spark.driver.maxResultSize", "0")
>   .config("spark.sql.caseSensitive", "false")
>   .config("spark.sql.adaptive.enabled", "true")
>   .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>   .config("spark.driver.memory", "24g")
>   .getOrCreate()
>
>
> My OOM errors are below:
>
> driver): java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:109)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:110)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/1058609963.apply(Unknown
>  Source)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   
>   
>   
>   
> driver): java.lang.OutOfMemoryError: Java heap space
>   at 
> net.jpountz.lz4.LZ4BlockOutputStream.(LZ4BlockOutputStream.java:102)
>   at 
> org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:145)
>   at 
> org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:158)
>   at 
> org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:133)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:122)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/249605067.apply(Unknown
>  Source)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>
>
>


Re: Structuring a PySpark Application

2021-07-01 Thread Kartik Ohri
Hi Mich!

The shell script indeed looks more robust now :D

Yes, the current setup works fine. I am wondering whether it is the right
way to set up things? That is, should I run the program which accepts
requests from the queue independently and have it invoke spark-submit cli
or something else?

Thanks again.

Regards

On Thu, Jul 1, 2021 at 4:44 PM Mich Talebzadeh 
wrote:

> Hi Kartik,
>
> I parameterized your shell script and tested on a stob python file and
> looks OK, ensuring that the shell script is more robust
>
>
> #!/bin/bash
> set -e
>
> #cd "$(dirname "${BASH_SOURCE[0]}")/../"
>
> pyspark_venv="pyspark_venv"
> source_zip_file="DSBQ.zip"
> [ -d ${pyspark_venv} ] && rm -r -d ${pyspark_venv}
> [ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
> [ -f ${source_zip_file} ] && rm -r -f ${source_zip_file}
>
> python3 -m venv ${pyspark_venv}
> source ${pyspark_venv}/bin/activate
> pip install -r requirements_spark.txt
> pip install venv-pack
> venv-pack -o ${pyspark_venv}.tar.gz
>
> export PYSPARK_DRIVER_PYTHON=python
> export PYSPARK_PYTHON=./${pyspark_venv}/bin/python
> spark-submit \
> --master local[4] \
> --conf
> "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \
> /home/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Jun 2021 at 19:21, Kartik Ohri  wrote:
>
>> Hi Mich!
>>
>> We use this in production but indeed there is much scope for
>> improvements, configuration being one of those :).
>>
>> Yes, we have a private on-premise cluster. We run Spark on YARN (no
>> airflow etc.) which controls the scheduling and use HDFS as a datastore.
>>
>> Regards
>>
>> On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks for the details Kartik.
>>>
>>> Let me go through these. The code itself and indentation looks good.
>>>
>>> One minor thing I noticed is that you are not using a yaml file
>>> (config.yml) for your variables and you seem to embed them in your
>>> config.py code. That is what I used to do before :) a friend advised me to
>>> initialise with yaml and read them in python file. However, I guess that is
>>> a personal style.
>>>
>>> Overall looking neat. I believe you are running all these on-premises
>>> and not using airflow or composer for your scheduling.
>>>
>>>
>>> Cheers
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri 
>>> wrote:
>>>
 Hi Mich!

 Thanks for the reply.

 The zip file contains all of the spark related
 code, particularly contents of this folder
 
 .
 The requirements_spark.txt
 
  is
 contained in the project and it contains the non-spark dependencies of the
 python code.
 The tar.gz file is created according to Pyspark docs
 
  for
 dependency management. The spark.yarn.dist.archives also comes from
 there.

 This is the python file
 
 invoked by the spark-submit to start the "RequestConsumer".

 Regards,
 Kartik


 On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Kartik,
>
> Can you explain how you create your zip file? Does that include all in
> your top project directory as per PyCharm etc.
>
> The rest looks Ok as you are creating a Python Virtual Env
>
> python3 -m venv pyspark_venv
> source pyspark_venv/bin/activate
>
> How do you create that requirements_spark.txt file?
>
> pip install -r requirements_spark.txt
> pip install venv-pack
>
>
> Where is this gz file used?
> venv-pack -o pyspark_venv.tar.gz
>
> Because I am not clear 

Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
Hi,

I have a spark streaming application which generally able to process the
data within the given time frame. However, in certain hours it starts
increasing that causes a delay.

In my scenario, the number of input records are not linearly increase the
processing time. Hence, ideally I'd like to increase the number of
batches/records that are being processed after a delay reaches a certain
time.

Is there a possibility/settings to do so?

Thanks
Andras


[image: image.png]


Re: Structuring a PySpark Application

2021-07-01 Thread Mich Talebzadeh
Hi Kartik,

I parameterized your shell script and tested on a stob python file and
looks OK, ensuring that the shell script is more robust


#!/bin/bash
set -e

#cd "$(dirname "${BASH_SOURCE[0]}")/../"

pyspark_venv="pyspark_venv"
source_zip_file="DSBQ.zip"
[ -d ${pyspark_venv} ] && rm -r -d ${pyspark_venv}
[ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
[ -f ${source_zip_file} ] && rm -r -f ${source_zip_file}

python3 -m venv ${pyspark_venv}
source ${pyspark_venv}/bin/activate
pip install -r requirements_spark.txt
pip install venv-pack
venv-pack -o ${pyspark_venv}.tar.gz

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./${pyspark_venv}/bin/python
spark-submit \
--master local[4] \
--conf
"spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \
/home/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Jun 2021 at 19:21, Kartik Ohri  wrote:

> Hi Mich!
>
> We use this in production but indeed there is much scope for improvements,
> configuration being one of those :).
>
> Yes, we have a private on-premise cluster. We run Spark on YARN (no
> airflow etc.) which controls the scheduling and use HDFS as a datastore.
>
> Regards
>
> On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks for the details Kartik.
>>
>> Let me go through these. The code itself and indentation looks good.
>>
>> One minor thing I noticed is that you are not using a yaml file
>> (config.yml) for your variables and you seem to embed them in your
>> config.py code. That is what I used to do before :) a friend advised me to
>> initialise with yaml and read them in python file. However, I guess that is
>> a personal style.
>>
>> Overall looking neat. I believe you are running all these on-premises and
>> not using airflow or composer for your scheduling.
>>
>>
>> Cheers
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri  wrote:
>>
>>> Hi Mich!
>>>
>>> Thanks for the reply.
>>>
>>> The zip file contains all of the spark related
>>> code, particularly contents of this folder
>>> 
>>> .
>>> The requirements_spark.txt
>>> 
>>>  is
>>> contained in the project and it contains the non-spark dependencies of the
>>> python code.
>>> The tar.gz file is created according to Pyspark docs
>>> 
>>>  for
>>> dependency management. The spark.yarn.dist.archives also comes from
>>> there.
>>>
>>> This is the python file
>>> 
>>> invoked by the spark-submit to start the "RequestConsumer".
>>>
>>> Regards,
>>> Kartik
>>>
>>>
>>> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Kartik,

 Can you explain how you create your zip file? Does that include all in
 your top project directory as per PyCharm etc.

 The rest looks Ok as you are creating a Python Virtual Env

 python3 -m venv pyspark_venv
 source pyspark_venv/bin/activate

 How do you create that requirements_spark.txt file?

 pip install -r requirements_spark.txt
 pip install venv-pack


 Where is this gz file used?
 venv-pack -o pyspark_venv.tar.gz

 Because I am not clear about below line

 --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \

 It helps if you walk us through the shell itself for clarification HTH,

 Mich




view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. 

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Pralabh,

You need to check the latest compatibility between Spark version that can
successfully work as Hive execution engine

This is my old file alluding to spark-1.3.1 as the execution engine

set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6;
--set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn-client;
set hive.execution.engine=spark;


Hive is great as a data warehouse but the default mapReduce used is
Jurassic Park.

On the other hand Spark has performant inbuilt API for Hive. Otherwise you
can connect to Hive on a remote cluster through JDBC.

In python you can do

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext


And use it like below


sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
 CREATE TABLE test.randomDataPy(
   ID INT
 , CLUSTERED INT
 , SCATTERED INT
 , RANDOMISED INT
 , RANDOM_STRING VARCHAR(50)
 , SMALL_VC VARCHAR(50)
 , PADDING  VARCHAR(4000)
)
STORED AS PARQUET
"""
  spark.sql(sqltext)

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 11:50, Pralabh Kumar  wrote:

> Hi mich
>
> Thx for replying.your answer really helps. The comparison was done in
> 2016. I would like to know the latest comparison with spark 3.0
>
> Also what you are suggesting is to migrate queries to Spark ,which is
> hivecontxt or hive on spark, which is what Facebook also did
> . Is that understanding correct ?
>
> Regards
> Pralabh
>
> On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
> wrote:
>
>> Hi Prahabh,
>>
>> This question has been asked before :)
>>
>> Few years ago (late 2016),  I made a presentation on running Hive Queries
>> on the Spark execution engine for Hortonworks.
>>
>>
>> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>>
>> The issue you will face will be compatibility problems with versions of
>> Hive and Spark.
>>
>> My suggestion would be to use Spark as a massive parallel processing and
>> Hive as a storage layer. However, you need to test what can be migrated or
>> not.
>>
>> HTH
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar 
>> wrote:
>>
>>> Hi Dev
>>>
>>> I am having thousands of legacy hive queries .  As a plan to move to
>>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>>> two approaches
>>>
>>>
>>>1.  One is Hive on Spark , which is similar to changing the
>>>execution engine in hive queries like TEZ.
>>>2. Another one is migrating hive queries to Hivecontext/sparksql ,
>>>an approach used by Facebook and presented in Spark conference.
>>>
>>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>>.
>>>
>>>
>>> Can you please guide me which option to go for . I am personally
>>> inclined to go for option 2 . It also allows the use of the latest spark .
>>>
>>> Please help me on the same , as there are not much comparisons online
>>> available keeping Spark 3.0 in perspective.
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>>
>>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi mich

Thx for replying.your answer really helps. The comparison was done in 2016.
I would like to know the latest comparison with spark 3.0

Also what you are suggesting is to migrate queries to Spark ,which is
hivecontxt or hive on spark, which is what Facebook also did
. Is that understanding correct ?

Regards
Pralabh

On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, 
wrote:

> Hi Prahabh,
>
> This question has been asked before :)
>
> Few years ago (late 2016),  I made a presentation on running Hive Queries
> on the Spark execution engine for Hortonworks.
>
>
> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>
> The issue you will face will be compatibility problems with versions of
> Hive and Spark.
>
> My suggestion would be to use Spark as a massive parallel processing and
> Hive as a storage layer. However, you need to test what can be migrated or
> not.
>
> HTH
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:
>
>> Hi Dev
>>
>> I am having thousands of legacy hive queries .  As a plan to move to
>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>> two approaches
>>
>>
>>1.  One is Hive on Spark , which is similar to changing the execution
>>engine in hive queries like TEZ.
>>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>>approach used by Facebook and presented in Spark conference.
>>
>> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>.
>>
>>
>> Can you please guide me which option to go for . I am personally inclined
>> to go for option 2 . It also allows the use of the latest spark .
>>
>> Please help me on the same , as there are not much comparisons online
>> available keeping Spark 3.0 in perspective.
>>
>> Regards
>> Pralabh Kumar
>>
>>
>>


Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Prahabh,

This question has been asked before :)

Few years ago (late 2016),  I made a presentation on running Hive Queries
on the Spark execution engine for Hortonworks.

https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations

The issue you will face will be compatibility problems with versions of
Hive and Spark.

My suggestion would be to use Spark as a massive parallel processing and
Hive as a storage layer. However, you need to test what can be migrated or
not.

HTH


Mich


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar  wrote:

> Hi Dev
>
> I am having thousands of legacy hive queries .  As a plan to move to Spark
> , we are planning to migrate Hive queries on Spark .  Now there are two
> approaches
>
>
>1.  One is Hive on Spark , which is similar to changing the execution
>engine in hive queries like TEZ.
>2. Another one is migrating hive queries to Hivecontext/sparksql , an
>approach used by Facebook and presented in Spark conference.
>
> https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>.
>
>
> Can you please guide me which option to go for . I am personally inclined
> to go for option 2 . It also allows the use of the latest spark .
>
> Please help me on the same , as there are not much comparisons online
> available keeping Spark 3.0 in perspective.
>
> Regards
> Pralabh Kumar
>
>
>


Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi Dev

I am having thousands of legacy hive queries .  As a plan to move to Spark
, we are planning to migrate Hive queries on Spark .  Now there are two
approaches


   1.  One is Hive on Spark , which is similar to changing the execution
   engine in hive queries like TEZ.
   2. Another one is migrating hive queries to Hivecontext/sparksql , an
   approach used by Facebook and presented in Spark conference.
   
https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
   .


Can you please guide me which option to go for . I am personally inclined
to go for option 2 . It also allows the use of the latest spark .

Please help me on the same , as there are not much comparisons online
available keeping Spark 3.0 in perspective.

Regards
Pralabh Kumar


Re: Structuring a PySpark Application

2021-07-01 Thread Kartik Ohri
Hi Gourav,

Thanks for the suggestion, I'll check it out.

Regards,
Kartik

On Thu, Jul 1, 2021 at 5:38 AM Gourav Sengupta 
wrote:

> Hi,
>
> I think that reading Matei Zaharia's book "SPARK the definitive guide"
> will be a good and best starting point.
>
> Regards,
> Gourav Sengupta
>
> On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri 
> wrote:
>
>> Hi all!
>>
>> I am working on a Pyspark application and would like suggestions on how
>> it should be structured.
>>
>> We have a number of possible jobs, organized in modules. There is also a "
>> RequestConsumer
>> "
>> class which consumes from a messaging queue. Each message contains the name
>> of the job to invoke and the arguments to be passed to it. Messages are put
>> into the message queue by cronjobs, manually etc.
>>
>> We submit a zip file containing all python files to a Spark cluster
>> running on YARN and ask it to run the RequestConsumer. This
>> 
>> is the exact spark-submit command for the interested. The results of the
>> jobs are collected
>> 
>> by the request consumer and pushed into another queue.
>>
>> My question is whether this type of structure makes sense. Should the
>> Request Consumer instead run independently of Spark and invoke spark-submit
>> scripts when it needs to trigger a job? Or is there another recommendation?
>>
>> Thank you all in advance for taking the time to read this email and
>> helping.
>>
>> Regards,
>> Kartik.
>>
>>
>>