Hi Palash

so you have a pyspark application running on spark 2.0
You have python scripts dropping files on HDFS
then you have two spark job
- 1 load expected hour data (pls explain. HOw many files on average)
- 1 load delayed data(pls explain. how many files on average)

Do these scripts run continuously (they have a while loop) or you kick them
off  via a job scheduler on an hourly basis
Do these scripts run on a cluster?


So, at T1 in HDFS there are 3 csv files. Your job starts up and load all 3
of them, does aggregation etc then populate mongo
At T+1 hour, in HDFS there are now 5 files (the previous 3 plus 2
additonal. Presumably these files are not deleted). So your job now loads 5
files, does aggregation and store data in mongodb? Or does your job at T+1
only loads deltas (the two new csv files which appeared at T+1)?

You said before that simply parsing csv files via spark in a standalone app
works fine. Then what you can try is to do exactly the same processig you
are doing but instead of loading csv files from HDFS you can load from
local directory and see if the problem persists......(this just to exclude
any issues with loading HDFS data.)

hth
   Marco












On Fri, Dec 30, 2016 at 2:02 PM, Palash Gupta <spline_pal...@yahoo.com>
wrote:

> Hi Marco & Ayan,
>
> I have now clearer idea about what Marco means by Reduce. I will do it to
> dig down.
>
> Let me answer to your queries:
>
> hen you see the broadcast errors, does your job terminate?
> Palash>> Yes it terminated the app.
>
> Or are you assuming that something is wrong just because you see the
> message in the logs?
>
> Palash>> No it terminated for the very first step of Spark processing (in
> my case loading csv from hdfs)
>
> Plus...Wrt logic....Who writes the CSV? With what frequency?
> Palash>> We parsed xml files using python (not in spark scope) & make csv
> and put in hdfs
>
> Does it app run all the time loading CSV from hadoop?
>
> Palash>> Every hour two separate pyspark app are running
> 1. Loading current expected hour data, prepare kpi, do aggregation, load
> in mongodb
> 2. Same operation will run for delayed hour data
>
>
> Are you using spark streaming?
> Palash>> No
>
> Does it app run fine with an older version of spark (1.6 )
> Palash>> I didn't test with Spark 1.6. My app is running now good as I
> stopped second app (delayed data loading) since last two days. Even most of
> the case both are running well except few times...
>
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Fri, 30 Dec, 2016 at 4:57 pm, Marco Mistroni
> <mmistr...@gmail.com> wrote:
> Correct. I mean reduce the functionality.
> Uhm I realised I didn't ask u a fundamental question. When you see the
> broadcast errors, does your job terminate? Or are you assuming that
> something is wrong just because you see the message in the logs?
> Plus...Wrt logic....Who writes the CSV? With what frequency?
> Does it app run all the time loading CSV from hadoop?
> Are you using spark streaming?
> Does it app run fine with an older version of spark (1.6 )
> Hth
>
> On 30 Dec 2016 12:44 pm, "ayan guha" <guha.a...@gmail.com> wrote:
>
>> @Palash: I think what Macro meant by "reduce functionality" is to reduce
>> scope of your application's functionality so that you can isolate the issue
>> in certain part(s) of the app...I do not think he meant "reduce" operation
>> :)
>>
>> On Fri, Dec 30, 2016 at 9:26 PM, Palash Gupta <spline_pal...@yahoo.com.
>> invalid> wrote:
>>
>>> Hi Marco,
>>>
>>> All of your suggestions are highly appreciated, whatever you said so
>>> far. I would apply to implement in my code and let you know.
>>>
>>> Let me answer your query:
>>>
>>> What does your program do?
>>> Palash>> In each hour I am loading many CSV files and then I'm making
>>> some KPI(s) out of them. Finally I am doing some aggregation and inserting
>>> into mongodb from spark.
>>>
>>> you say it runs for 2-3 hours, what is the logic? just processing a huge
>>> amount of data? doing ML ?
>>> Palash>> Yes you are right whatever I'm processing it should not take
>>> much time. Initially my processing was taking only 5 minutes as I was using
>>> all cores running only one application. When I created more separate spark
>>> applications for handling delayed data loading and implementing more use
>>> cases with parallel run, I started facing the error randomly. And due to
>>> separate resource distribution among four parallel spark application to run
>>> in parallel now some task is taking longer time than usual. But still it
>>> should not take 2-3 hours time...
>>>
>>> Currently whole applications are running in a development environment
>>> where we have only two VM cluster and I will migrate to production platform
>>> by next week. I will let you know if there is any improvement over there.
>>>
>>> I'd say break down your application..  reduce functionality , run and
>>> see outcome. then add more functionality, run and see again.
>>>
>>> Palash>> Macro as I'm not very good in Spark. It would be helpful for me
>>> if you provide some example of reduce functionality. Cause I'm using Spark
>>> data frame, join data frames, use SQL statement to manipulate KPI(s). Here
>>> How could I apply reduce functionality?
>>>
>>>
>>>
>>> Thanks & Best Regards,
>>> Palash Gupta
>>>
>>>
>>> ------------------------------
>>> *From:* Marco Mistroni <mmistr...@gmail.com>
>>> *To:* "spline_pal...@yahoo.com" <spline_pal...@yahoo.com>
>>> *Cc:* User <user@spark.apache.org>
>>> *Sent:* Thursday, December 29, 2016 11:28 PM
>>>
>>> *Subject:* Re: [TorrentBroadcast] Pyspark Application terminated saying
>>> "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"
>>>
>>> Hello
>>>  no sorry i dont have any further insight into that.... i have seen
>>> similar errors but for completely different issues, and in most of hte
>>> cases it had to do with my data or my processing rather than Spark itself.
>>> What does your program do? you say it runs for 2-3 hours, what is the
>>> logic? just processing a huge amount of data?
>>> doing ML ?
>>> i'd say break down your application..  reduce functionality , run and
>>> see outcome. then add more functionality, run and see again.
>>> I found myself doing htese kinds of things when i got errors in my spark
>>> apps.
>>>
>>> To get a concrete help you will have to trim down the code to a few
>>> lines that can reproduces the error  That will be a great start
>>>
>>> Sorry for not being of much help
>>>
>>> hth
>>>  marco
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Dec 29, 2016 at 12:00 PM, Palash Gupta <spline_pal...@yahoo.com>
>>> wrote:
>>>
>>> Hi Marco,
>>>
>>> Thanks for your response.
>>>
>>> Yes I tested it before & am able to load from linux filesystem and it
>>> also sometimes have similar issue.
>>>
>>> However in both cases (either from hadoop or linux file system), this
>>> error comes in some specific scenario as per my observations:
>>>
>>> 1. When two parallel spark separate application is initiated from one
>>> driver (not all the time, sometime)
>>> 2. If one spark jobs are running for more than expected hour let say 2-3
>>> hours, the second application terminated giving the error.
>>>
>>> To debug the problem for me it will be good if you can share some
>>> possible reasons why failed to broadcast error may come.
>>>
>>> Or if you need more logs I can share.
>>>
>>> Thanks again Spark User Group.
>>>
>>> Best Regards
>>> Palash Gupta
>>>
>>>
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>
>>> On Thu, 29 Dec, 2016 at 2:57 pm, Marco Mistroni
>>> <mmistr...@gmail.com> wrote:
>>> Hi
>>>  Pls try to read a CSV from filesystem instead of hadoop. If you can
>>> read it successfully then your hadoop file is the issue and you can start
>>> debugging from there.
>>> Hth
>>>
>>> On 29 Dec 2016 6:26 am, "Palash Gupta" <spline_pal...@yahoo.com.
>>> invalid> wrote:
>>>
>>> Hi Apache Spark User team,
>>>
>>>
>>>
>>> Greetings!
>>>
>>> I started developing an application using Apache Hadoop and Spark using
>>> python. My pyspark application randomly terminated saying "Failed to get
>>> broadcast_1*" and I have been searching for suggestion and support in
>>> Stakeoverflow at Failed to get broadcast_1_piece0 of broadcast_1 in
>>> pyspark application
>>> <http://stackoverflow.com/questions/41236661/failed-to-get-broadcast-1-piece0-of-broadcast-1-in-pyspark-application>
>>>
>>>
>>> Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application
>>> I was building an application on Apache Spark 2.00 with Python 3.4 and
>>> trying to load some CSV files from HDFS (...
>>>
>>> <http://stackoverflow.com/questions/41236661/failed-to-get-broadcast-1-piece0-of-broadcast-1-in-pyspark-application>
>>>
>>>
>>> Could you please provide suggestion registering myself in Apache User
>>> list or how can I get suggestion or support to debug the problem I am
>>> facing?
>>>
>>> Your response will be highly appreciated.
>>>
>>>
>>>
>>> Thanks & Best Regards,
>>> Engr. Palash Gupta
>>> WhatsApp/Viber: +8801817181502
>>> Skype: palash2494
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>

Reply via email to