The performance I mentioned here is all on local(my laptop).
I have tried the same thing on cluster(Elastic MapReduce) and have seen
even worse results.

Is there a way this can be done efficiently?If any of you might have tried
it.


On Wednesday, September 14, 2016, Jörn Franke <jornfra...@gmail.com> wrote:

> It could be that by using the rdd it converts the data from the internal
> format to Java objects (-> much more memory is needed), which may lead to
> spill over to disk. This conversion takes a lot of time. Then, you need to
> transfer these Java objects via network to one single node (repartition
> ...), which takes on a 1 gbit network for 3 gb (since it may transfer Java
> objects this might be even more for 3 gb) under optimal conditions ca 25
> seconds (if no other transfers happening at the same time, jumbo frames
> activated etc). On the destination node we may have again spill over to
> disk. Then you store them to a single disk (potentially multiple if you
> have and use HDFS) which takes also time (assuming that no other process
> uses this disk).
>
> Btw spark-csv can be used with different dataframes.
> As said, other options are compression, avoid repartitioning (to avoid
> network transfer), avoid spilling to disk (provide memory in yarn etc),
> increase network bandwidth ...
>
> On 14 Sep 2016, at 14:22, sanat kumar Patnaik <patnaik.sa...@gmail.com
> <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote:
>
> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
>
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
>
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
>
> That is where I am concerned, the time to write a text file compared to
> json grows exponentially.
>
> On Wednesday, September 14, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com
> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>
>> These intermediate file what sort of files are there. Are there csv type
>> files.
>>
>> I agree that DF is more efficient than an RDD as it follows tabular
>> format (I assume that is what you mean by "columnar" format). So if you
>> read these files in a bath process you may not worry too much about
>> execution time?
>>
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I
>> think it is pretty efficient.
>>
>> For myself, I would do something like below
>>
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <
>> patnaik.sa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>>    - I am writing a batch application using Spark SQL and Dataframes.
>>>    This application has a bunch of file joins and there are intermediate
>>>    points where I need to drop a file for downstream applications to 
>>> consume.
>>>    - The problem is all these downstream applications are still on
>>>    legacy, so they still require us to drop them a text file.As you all must
>>>    be knowing Dataframe stores the data in columnar format internally.
>>>
>>> Only way I found out how to do this and which looks awfully slow is this:
>>>
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>
>>> Is there any better way to do this?
>>>
>>> *P.S: *The other workaround would be to use RDDs for all my operations.
>>> But I am wary of using them as the documentation says Dataframes are way
>>> faster because of the Catalyst engine running behind the scene.
>>>
>>> Please suggest if any of you might have tried something similar.
>>>
>>
>>
>
> --
> Regards,
> Sanat Patnaik
> Cell->804-882-6424
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424

Reply via email to