date:20151028

How do I parallize Spark Jobs at Executor Level.

2015-10-28 Thread Vinoth Sankar

Hi,

I'm reading and filtering large no of files using Spark. It's getting
parallized at Spark Driver level only. How do i make it parallelize to
Executor(Worker) Level. Refer the following sample. Is there any way to
paralleling iterate the localIterator ?

Note : I use Java 1.7 version

JavaRDD files = javaSparkContext.parallelize(fileList)
Iterator localIterator = files.toLocalIterator();

Regards
Vinoth Sankar

Re: Spark Core Transitive Dependencies

2015-10-28 Thread Furkan KAMACI

Hi Deng,

Could you give an example of which libraries you include for your purpose?

Kind Regards,
Furkan KAMACI

On Wed, Oct 28, 2015 at 4:07 AM, Deng Ching-Mallete 
wrote:

> Hi,
>
> The spark assembly jar already includes the spark core libraries plus
> their transitive dependencies, so you don't need to include them in your
> jar. I found it easier to use inclusions instead of exclusions when
> creating an assembly jar of my spark job so I would recommend going with
> that.
>
> HTH,
> Deng
>
>
> On Wed, Oct 28, 2015 at 6:20 AM, Furkan KAMACI 
> wrote:
>
>> Hi,
>>
>> I use Spark for for its newAPIHadoopRDD method and map/reduce etc. tasks.
>> When I include it I see that it has many transitive dependencies.
>>
>> Which of them I should exclude? I've included the dependency tree of
>> spark-core. Is there any documentation that explains why they are needed
>> (maybe all of them are necessary?)
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> PS: Dependency Tree:
>>
>> [INFO] +- org.apache.spark:spark-core_2.10:jar:1.4.1:compile
>> [INFO] |  +- com.twitter:chill_2.10:jar:0.5.0:compile
>> [INFO] |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:compile
>> [INFO] |  | +-
>> com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile
>> [INFO] |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile
>> [INFO] |  | \- org.objenesis:objenesis:jar:1.2:compile
>> [INFO] |  +- com.twitter:chill-java:jar:0.5.0:compile
>> [INFO] |  +- org.apache.spark:spark-launcher_2.10:jar:1.4.1:compile
>> [INFO] |  +- org.apache.spark:spark-network-common_2.10:jar:1.4.1:compile
>> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.10:jar:1.4.1:compile
>> [INFO] |  +- org.apache.spark:spark-unsafe_2.10:jar:1.4.1:compile
>> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
>> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
>> [INFO] |  |  \- org.apache.curator:curator-framework:jar:2.4.0:compile
>> [INFO] |  | \- org.apache.curator:curator-client:jar:2.4.0:compile
>> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.3.2:compile
>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:compile
>> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
>> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.6.6:compile
>> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.6.6:compile
>> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:compile
>> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.2.0:compile
>> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.4.5:compile
>> [INFO] |  +- commons-net:commons-net:jar:2.2:compile
>> [INFO] |  +-
>> org.spark-project.akka:akka-remote_2.10:jar:2.3.4-spark:compile
>> [INFO] |  |  +-
>> org.spark-project.akka:akka-actor_2.10:jar:2.3.4-spark:compile
>> [INFO] |  |  |  \- com.typesafe:config:jar:1.2.1:compile
>> [INFO] |  |  +-
>> org.spark-project.protobuf:protobuf-java:jar:2.5.0-spark:compile
>> [INFO] |  |  \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:compile
>> [INFO] |  +-
>> org.spark-project.akka:akka-slf4j_2.10:jar:2.3.4-spark:compile
>> [INFO] |  +- org.scala-lang:scala-library:jar:2.10.4:compile
>> [INFO] |  +- org.json4s:json4s-jackson_2.10:jar:3.2.10:compile
>> [INFO] |  |  \- org.json4s:json4s-core_2.10:jar:3.2.10:compile
>> [INFO] |  | +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile
>> [INFO] |  | \- org.scala-lang:scalap:jar:2.10.0:compile
>> [INFO] |  |\- org.scala-lang:scala-compiler:jar:2.10.0:compile
>> [INFO] |  +- com.sun.jersey:jersey-server:jar:1.9:compile
>> [INFO] |  |  \- asm:asm:jar:3.1:compile
>> [INFO] |  +- com.sun.jersey:jersey-core:jar:1.9:compile
>> [INFO] |  +- org.apache.mesos:mesos:jar:shaded-protobuf:0.21.1:compile
>> [INFO] |  +- io.netty:netty-all:jar:4.0.23.Final:compile
>> [INFO] |  +- com.clearspring.analytics:stream:jar:2.7.0:compile
>> [INFO] |  +- io.dropwizard.metrics:metrics-core:jar:3.1.0:compile
>> [INFO] |  +- io.dropwizard.metrics:metrics-jvm:jar:3.1.0:compile
>> [INFO] |  +- io.dropwizard.metrics:metrics-json:jar:3.1.0:compile
>> [INFO] |  +- io.dropwizard.metrics:metrics-graphite:jar:3.1.0:compile
>> [INFO] |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.4.4:compile
>> [INFO] |  |  +-
>> com.fasterxml.jackson.core:jackson-annotations:jar:2.4.0:compile
>> [INFO] |  |  \- com.fasterxml.jackson.core:jackson-core:jar:2.4.4:compile
>> [INFO] |  +-
>> com.fasterxml.jackson.module:jackson-module-scala_2.10:jar:2.4.4:compile
>> [INFO] |  |  \- org.scala-lang:scala-reflect:jar:2.10.4:compile
>> [INFO] |  +- org.apache.ivy:ivy:jar:2.4.0:compile
>> [INFO] |  +- oro:oro:jar:2.0.8:compile
>> [INFO] |  +- org.tachyonproject:tachyon-client:jar:0.6.4:compile
>> [INFO] |  |  \- org.tachyonproject:tachyon:jar:0.6.4:compile
>> [INFO] |  +- net.razorvine:pyrolite:jar:4.4:compile
>> [INFO] |  +- net.sf.py4j:py4j:jar:0.8.2.1:compile
>> [INFO] |  \- org.spark-project.spark:unused:jar:1.0.0:compile
>>
>
>

Why is no predicate pushdown performed, when using Hive (HiveThriftServer2) ?

2015-10-28 Thread Martin Senne

Hi all,

# Programm Sketch


   1. I create a HiveContext `hiveContext`
   2. With that context, I create a DataFrame `df` from a JDBC relational
   table.
   3. I register the DataFrame `df` via

   df.registerTempTable("TESTTABLE")

   4. I start a HiveThriftServer2 via

   HiveThriftServer2.startWithContext(hiveContext)



The TESTTABLE contains 1,000,000 entries, columns are ID (INT) and NAME
(VARCHAR)

+-++
| ID  |  NAME  |
+-++
| 1   | Hello  |
| 2   | Hello  |
| 3   | Hello  |
| ... | ...|

With Beeline I access the SQL Endpoint (at port 1) of the
HiveThriftServer and perform a query. E.g.

SELECT * FROM TESTTABLE WHERE ID='3'

When I inspect the QueryLog of the DB with the SQL Statements executed I see

/*SQL #:100 t:657*/  *SELECT \"ID\",\"NAME\" FROM test;*

So there happens no predicate pushdown , as the where clause is missing.


# Questions

This gives raise to the following questions:

   1. *Why is no predicate pushdown performed?*
   2.
*Can this be changed by not using registerTempTable? If so, how? *
   3. *Or is this a known restriction of the HiveThriftServer?*


# Counterexample

If I create a DataFrame `df` in Spark SQLContext and call

df.filter( df("ID") === 3).show()

I observe

/*SQL #:1*/SELECT \"ID\",\"NAME\" FROM test *WHERE ID = 3*;

as expected.

SparkR 1.5.1 ClassCastException when working with CSV files

2015-10-28 Thread rporcio

Hi,

When I'm working with csv files in R using SparkR, I got ClassCastException
during the execution of SparkR methods. The below process works fine in
1.4.1, but it is broken from 1.5.0.

(I will use the flights csv file from the examples as a reference, but I can
reproduce this with any csv file.)

Steps to reproduce:
1. Init spark and sql contexts. Use spark package
"com.databricks:spark-csv_2.11:1.0.3" for spark context initialization.
2. Init DataFrame as /df <- read.df(sqlContext, "path_to_flights.csv_file",
source = "com.databricks.spark.csv", header="true")/
3. Run command /head(df)/
4. Exception is occurred:
/ERROR CsvRelation$: Exception while parsing line: 2011-01-24
12:00:00,14,48,1448,1546,3,-1,"CO",1079,"SAT","N14214",0,37,191.
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.spark.unsafe.types.UTF8String
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
.../

I'm using CentOS.
On windows, the exception does not occur, but the DataFrame contains 0 row.

Do I miss something?

Thanks

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-1-5-1-ClassCastException-when-working-with-CSV-files-tp25217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

53 matches

Mail list logo