Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Gourav Sengupta
I genuinely do not think that Scala for Spark needs us to be super in
Scala. There is infact a tutorial called as "Just enough Scala for Spark"
which even with my IQ does not take more than 40 mins to go through. Also
the sytax of Scala is almost always similar to that of Python.

Data processing is much more amenable to functional thinking and therefore
Scala suits best also Spark is written in Scala.

Regards,
Gourav

On Mon, Oct 29, 2018 at 11:33 PM kant kodali  wrote:

> Most people when they compare two different programming languages 99% of
> the time it all seems to boil down to syntax sugar.
>
> Performance I doubt Scala is ever faster than Java given that Scala likes
> Heap more than Java. I had also written some pointless micro-benchmarking
> code like (Random String Generation, hash computations, etc..) on Java,
> Scala and Golang and Java had outperformed both Scala and Golang as well on
> many occasions.
>
> Now that Java 11 had released things seem to get even better given the
> startup time is also very low.
>
> I am happy to change my view as long as I can see some code and benchmarks!
>
>
>
> On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin  wrote:
>
>> did not see anything, but curious if you find something.
>>
>> I think one of the big benefit of using Java, for data engineering in the
>> context of  Spark, is that you do not have to train a lot of your team to
>> Scala. Now if you want to do data science, Java is probably not the best
>> tool yet...
>>
>> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
>>
>> Hello
>> - is there a "performance" difference when using Java or Scala for Apache
>> Spark ?
>>
>> I understand, there are other obvious differences (less code with scala,
>> easier to focus on logic etc),
>> but wrt performance - i think there would not be much of a difference
>> since both of them are JVM based,
>> pls. let me know if this is not the case.
>>
>> thanks!
>>
>>
>>


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread kant kodali
Most people when they compare two different programming languages 99% of
the time it all seems to boil down to syntax sugar.

Performance I doubt Scala is ever faster than Java given that Scala likes
Heap more than Java. I had also written some pointless micro-benchmarking
code like (Random String Generation, hash computations, etc..) on Java,
Scala and Golang and Java had outperformed both Scala and Golang as well on
many occasions.

Now that Java 11 had released things seem to get even better given the
startup time is also very low.

I am happy to change my view as long as I can see some code and benchmarks!



On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin  wrote:

> did not see anything, but curious if you find something.
>
> I think one of the big benefit of using Java, for data engineering in the
> context of  Spark, is that you do not have to train a lot of your team to
> Scala. Now if you want to do data science, Java is probably not the best
> tool yet...
>
> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
>
> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I understand, there are other obvious differences (less code with scala,
> easier to focus on logic etc),
> but wrt performance - i think there would not be much of a difference
> since both of them are JVM based,
> pls. let me know if this is not the case.
>
> thanks!
>
>
>


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
did not see anything, but curious if you find something.

I think one of the big benefit of using Java, for data engineering in the 
context of  Spark, is that you do not have to train a lot of your team to 
Scala. Now if you want to do data science, Java is probably not the best tool 
yet...

> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
> 
> Hello 
> - is there a "performance" difference when using Java or Scala for Apache 
> Spark ?
> 
> I understand, there are other obvious differences (less code with scala, 
> easier to focus on logic etc), 
> but wrt performance - i think there would not be much of a difference since 
> both of them are JVM based, 
> pls. let me know if this is not the case.
> 
> thanks!



Re: dremel paper example schema

2018-10-29 Thread Debasish Das
Open source impl of dremel is parquet !

On Mon, Oct 29, 2018, 8:42 AM Gourav Sengupta 
wrote:

> Hi,
>
> why not just use dremel?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev <
> lubomir.chorbadj...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to reproduce the example from dremel paper
>> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark
>> using
>> pyspark and I wonder if it is possible at all?
>>
>> Trying to follow the paper example as close as possible I created this
>> document type:
>>
>> from pyspark.sql.types import *
>>
>> links_type = StructType([
>> StructField("Backward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> StructField("Forward", ArrayType(IntegerType(), containsNull=False),
>> nullable=False),
>> ])
>>
>> language_type = StructType([
>> StructField("Code", StringType(), nullable=False),
>> StructField("Country", StringType())
>> ])
>>
>> names_type = StructType([
>> StructField("Language", ArrayType(language_type, containsNull=False)),
>> StructField("Url", StringType()),
>> ])
>>
>> document_type = StructType([
>> StructField("DocId", LongType(), nullable=False),
>> StructField("Links", links_type, nullable=True),
>> StructField("Name", ArrayType(names_type, containsNull=False))
>> ])
>>
>> But when I store data in parquet using this type, the resulting parquet
>> schema is different from the described in the paper:
>>
>> message spark_schema {
>>   required int64 DocId;
>>   optional group Links {
>> required group Backward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>> required group Forward (LIST) {
>>   repeated group list {
>> required int32 element;
>>   }
>> }
>>   }
>>   optional group Name (LIST) {
>> repeated group list {
>>   required group element {
>> optional group Language (LIST) {
>>   repeated group list {
>> required group element {
>>   required binary Code (UTF8);
>>   optional binary Country (UTF8);
>> }
>>   }
>> }
>> optional binary Url (UTF8);
>>   }
>> }
>>   }
>> }
>>
>> Moreover, if I create a parquet file with schema described in the dremel
>> paper using Apache Parquet Java API and try to read it into Apache Spark,
>> I
>> get an exception:
>>
>> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
>> while reading parquet files. One possible cause: Parquet column cannot be
>> converted in the corresponding files
>>
>> Is it possible to create example schema described in the dremel paper
>> using
>> Apache Spark and what is the correct approach to build this example?
>>
>> Regards,
>> Lubomir Chorbadjiev
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: dremel paper example schema

2018-10-29 Thread Gourav Sengupta
Hi,

why not just use dremel?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:35 PM lchorbadjiev 
wrote:

> Hi,
>
> I'm trying to reproduce the example from dremel paper
> (https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
> pyspark and I wonder if it is possible at all?
>
> Trying to follow the paper example as close as possible I created this
> document type:
>
> from pyspark.sql.types import *
>
> links_type = StructType([
> StructField("Backward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
> StructField("Forward", ArrayType(IntegerType(), containsNull=False),
> nullable=False),
> ])
>
> language_type = StructType([
> StructField("Code", StringType(), nullable=False),
> StructField("Country", StringType())
> ])
>
> names_type = StructType([
> StructField("Language", ArrayType(language_type, containsNull=False)),
> StructField("Url", StringType()),
> ])
>
> document_type = StructType([
> StructField("DocId", LongType(), nullable=False),
> StructField("Links", links_type, nullable=True),
> StructField("Name", ArrayType(names_type, containsNull=False))
> ])
>
> But when I store data in parquet using this type, the resulting parquet
> schema is different from the described in the paper:
>
> message spark_schema {
>   required int64 DocId;
>   optional group Links {
> required group Backward (LIST) {
>   repeated group list {
> required int32 element;
>   }
> }
> required group Forward (LIST) {
>   repeated group list {
> required int32 element;
>   }
> }
>   }
>   optional group Name (LIST) {
> repeated group list {
>   required group element {
> optional group Language (LIST) {
>   repeated group list {
> required group element {
>   required binary Code (UTF8);
>   optional binary Country (UTF8);
> }
>   }
> }
> optional binary Url (UTF8);
>   }
> }
>   }
> }
>
> Moreover, if I create a parquet file with schema described in the dremel
> paper using Apache Parquet Java API and try to read it into Apache Spark, I
> get an exception:
>
> org.apache.spark.sql.execution.QueryExecutionException: Encounter error
> while reading parquet files. One possible cause: Parquet column cannot be
> converted in the corresponding files
>
> Is it possible to create example schema described in the dremel paper using
> Apache Spark and what is the correct approach to build this example?
>
> Regards,
> Lubomir Chorbadjiev
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


dremel paper example schema

2018-10-29 Thread lchorbadjiev
Hi,

I'm trying to reproduce the example from dremel paper
(https://research.google.com/pubs/archive/36632.pdf) in Apache Spark using
pyspark and I wonder if it is possible at all?

Trying to follow the paper example as close as possible I created this
document type:

from pyspark.sql.types import *

links_type = StructType([
StructField("Backward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
StructField("Forward", ArrayType(IntegerType(), containsNull=False),
nullable=False),
])

language_type = StructType([
StructField("Code", StringType(), nullable=False),
StructField("Country", StringType())
])

names_type = StructType([
StructField("Language", ArrayType(language_type, containsNull=False)),
StructField("Url", StringType()),
])

document_type = StructType([
StructField("DocId", LongType(), nullable=False),
StructField("Links", links_type, nullable=True),
StructField("Name", ArrayType(names_type, containsNull=False))
])

But when I store data in parquet using this type, the resulting parquet
schema is different from the described in the paper:

message spark_schema {
  required int64 DocId;
  optional group Links {
required group Backward (LIST) {
  repeated group list {
required int32 element;
  }
}
required group Forward (LIST) {
  repeated group list {
required int32 element;
  }
}
  }
  optional group Name (LIST) {
repeated group list {
  required group element {
optional group Language (LIST) {
  repeated group list {
required group element {
  required binary Code (UTF8);
  optional binary Country (UTF8);
}
  }
}
optional binary Url (UTF8);
  }
}
  }
}

Moreover, if I create a parquet file with schema described in the dremel
paper using Apache Parquet Java API and try to read it into Apache Spark, I
get an exception:

org.apache.spark.sql.execution.QueryExecutionException: Encounter error
while reading parquet files. One possible cause: Parquet column cannot be
converted in the corresponding files

Is it possible to create example schema described in the dremel paper using
Apache Spark and what is the correct approach to build this example?

Regards,
Lubomir Chorbadjiev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Processing Flexibility Between RDD and Dataframe API

2018-10-29 Thread Gourav Sengupta
Hi,

I would recommend reading the book by Matei Zaharia. One of the main
differentiating factors between Spark 1.x and subsequent releases has been
optimization and hence dataframes, and in no way RDD is going away because
dataframes are built on RDD's. The use of RDD's are allowed and is
recommended under scenarios where greater flexibility will be required and
those scenarios are explicitly and clearly stated.

But if someone tells me that they will not use dataframes at all it means
that they are eventually going to end up delivering solutions to companies
which will be suboptimal, expensive to run, difficult to maintain, flaky
while scaling, and introduce resource dependency. No clues on why would
someone do that.


Regards,
Gourav Sengupta


On Mon, Oct 29, 2018 at 7:49 AM Jungtaek Lim  wrote:

> Just 2 cents on just one of contributors: while SQL semantic can express
> various use cases data scientists encounter, I also agree someone who are
> end users who are more familiar with code instead of SQL can feel it is not
> flexible.
>
> But counterless efforts have been incorporated into Spark SQL (and
> catalyst) so I guess it is clear Spark SQL and Structured Streaming are the
> things if your workload fits into them, but on the other hand, if it
> doesn't, just keep using RDD. RDD is still the thing underlying Spark SQL,
> so I don't expect it is deprecated unless Spark renews the underlying
> architecture.
>
> -Jungtaek Lim
>
> 2018년 10월 29일 (월) 오전 12:06, Adrienne Kole 님이 작성:
>
>> Thanks for bringing this issue to the mailing list.
>> As an addition, I would also ask the same questions about  DStreams and
>> Structured Streaming APIs.
>> Structured Streaming is high level and it makes difficult to express all
>> business logic in it, although Databricks are pushing it and recommending
>> for usage.
>> Moreover, there are some works are going on continuous streaming.
>> So, what is the Spark's future vision, support all or concentrate on one,
>> as all those paradigms have separate processing semantics?
>>
>>
>> Cheers,
>> Adrienne
>>
>> On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani 
>> wrote:
>>
>>> Hi,
>>> There are some functions like map, flatMap, reduce and ..., that
>>> construct the base data processing operation in big data (and Apache
>>> Spark). But Spark, in new versions, introduces the high-level Dataframe API
>>> and recommend using it. This is while there are no such functions in
>>> Dataframe API and it just has many built-in functions and the UDF. It's
>>> very inflexible (at least to me) and I at many points should convert
>>> Dataframes to RDD and vice-versa. My question is:
>>> Is RDD going to be outdated and if so, what is the correct road-map to
>>> do processing using Apache Spark, while Dataframe doesn't support functions
>>> like Map and reduce? How UDF functions process the data, they will apply to
>>> every row, like map functions? Are converting Dataframe to RDD comes with
>>> many costs?
>>>
>>


Re: Processing Flexibility Between RDD and Dataframe API

2018-10-29 Thread Jungtaek Lim
Just 2 cents on just one of contributors: while SQL semantic can express
various use cases data scientists encounter, I also agree someone who are
end users who are more familiar with code instead of SQL can feel it is not
flexible.

But counterless efforts have been incorporated into Spark SQL (and
catalyst) so I guess it is clear Spark SQL and Structured Streaming are the
things if your workload fits into them, but on the other hand, if it
doesn't, just keep using RDD. RDD is still the thing underlying Spark SQL,
so I don't expect it is deprecated unless Spark renews the underlying
architecture.

-Jungtaek Lim

2018년 10월 29일 (월) 오전 12:06, Adrienne Kole 님이 작성:

> Thanks for bringing this issue to the mailing list.
> As an addition, I would also ask the same questions about  DStreams and
> Structured Streaming APIs.
> Structured Streaming is high level and it makes difficult to express all
> business logic in it, although Databricks are pushing it and recommending
> for usage.
> Moreover, there are some works are going on continuous streaming.
> So, what is the Spark's future vision, support all or concentrate on one,
> as all those paradigms have separate processing semantics?
>
>
> Cheers,
> Adrienne
>
> On Sun, Oct 28, 2018 at 3:50 PM Soheil Pourbafrani 
> wrote:
>
>> Hi,
>> There are some functions like map, flatMap, reduce and ..., that
>> construct the base data processing operation in big data (and Apache
>> Spark). But Spark, in new versions, introduces the high-level Dataframe API
>> and recommend using it. This is while there are no such functions in
>> Dataframe API and it just has many built-in functions and the UDF. It's
>> very inflexible (at least to me) and I at many points should convert
>> Dataframes to RDD and vice-versa. My question is:
>> Is RDD going to be outdated and if so, what is the correct road-map to do
>> processing using Apache Spark, while Dataframe doesn't support functions
>> like Map and reduce? How UDF functions process the data, they will apply to
>> every row, like map functions? Are converting Dataframe to RDD comes with
>> many costs?
>>
>