Re: CBO not working for Parquet Files

2018-09-06 Thread emlyn
rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.

We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.
 
I’ve also found this ticket:
https://issues.apache.org/jira/browse/SPARK-25185, but there it has no
response so far.
 
I suspect we must be missing something, as it seems that partitioned parquet
files would be a common use case, and if this is a bug in Spark I would have
expected it to have been picked up sooner.
 
Has anybody managed to get cbo working with partitioned parquet files? Is
this a known issue?
 
Thanks,
Emlyn



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Concurrent Spark jobs

2016-03-31 Thread emlyn
In case anyone else has the same problem and finds this - in my case it was
fixed by increasing spark.sql.broadcastTimeout (I used 9000).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Concurrent Spark jobs

2016-01-25 Thread emlyn
Jean wrote
> Have you considered using pools?
> http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
> 
> I haven't tried that by myself, but it looks like pool setting is applied
> per thread so that means it's possible to configure fair scheduler, so
> that more, than one job is on a go. Although each of them would probably
> use less number of workers...

Thanks for the tip, but I don't think that would work in this case - while
writing to Redshift, the cluster is sitting idle without the new tasks even
appearing on the pending queue yet, so changing how it executes the jobs on
the queue won't help.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Concurrent Spark jobs

2016-01-21 Thread emlyn
Thanks for the responses (not sure why they aren't showing up on the list).

Michael wrote:
> The JDBC wrapper for Redshift should allow you to follow these
> instructions. Let me know if you run into any more issues. 
> http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html

I'm not sure that this solves my problem - if I understand it correctly,
this is to split a database write over multiple concurrent connections (one
from each partition), whereas what I want is to allow other tasks to
continue running on the cluster while the the write to Redshift is taking
place.
Also I don't think it's good practice to load data into Redshift with INSERT
statements over JDBC - it is recommended to use the bulk load commands that
can analyse the data and automatically set appropriate compression etc on
the table.


Rajesh wrote:
> Just a thought. Can we use Spark Job Server and trigger jobs through rest
> apis. In this case, all jobs will share same context and run the jobs
> parallel.
> If any one has other thoughts please share

I'm not sure this would work in my case as they are not completely separate
jobs, but just different outputs to Redshift, that share intermediate
results. Running them as completely separate jobs would mean recalculating
the intermediate results for each output. I suppose it might be possible to
persist the intermediate results somewhere, and then delete them once all
the jobs have run, but that is starting to add a lot of complication which
I'm not sure is justified.


Maybe some pseudocode would help clarify things, so here is a very
simplified view of our Spark application:

// load and transform data, then cache the result
df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data'))
df1.cache()

// perform some further transforms of the cached data
df2 = transform2(df1)
df3 = transform3(df1)

// write the final data out to Redshift
df2.write().options(...).(format "com.databricks.spark.redshift").save()
df3.write().options(...).(format "com.databricks.spark.redshift").save()


When the application runs, the steps are executed in the following order:
- scan parquet folder
- transform1 executes
- df1 stored in cache
- transform2 executes
- df2 written to Redshift (while cluster sits idle)
- transform3 executes
- df3 written to Redshift

I would like transform3 to begin executing as soon as the cluster has
capacity, without having to wait for df2 to be written to Redshift, so I
tried rewriting the last two lines as (again pseudocode):

f1 = future{df2.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f2 = future{df3.write().options(...).(format
"com.databricks.spark.redshift").save()}.execute()
f1.get()
f2.get()

In the hope that the first write would no longer block the following steps,
but instead it fails with a TimeoutException (see stack trace in previous
message). Is there a way to start the different writes concurrently, or is
that not possible in Spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
As I understand it, Spark 1.6 changes the behaviour of the first and last
aggregate functions to  take nulls into account
   (where they were
ignored in 1.5). From SQL you can use "IGNORE NULLS" to get the old
behaviour back. How do I ignore nulls from the  Java API

 
? I can't see any way to pass that option, so I suspect I may need to write
a user defined aggregate function, but I'd prefer not to if possible.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-ignoreNulls-in-first-last-aggregate-functions-tp26031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
Turns out I can't use a user defined aggregate function, as they are not
supported in Window operations. There surely must be some way to do a
last_value with ignoreNulls enabled in Spark 1.6? Any ideas for workarounds?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-ignoreNulls-in-first-last-aggregate-functions-tp26031p26033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Concurrent Spark jobs

2016-01-19 Thread emlyn
We have a Spark application that runs a number of ETL jobs, writing the
outputs to Redshift (using databricks/spark-redshift). This is triggered by
calling DataFrame.write.save on the different DataFrames one after another.
I noticed that during the Redshift load while the output of one job is being
loaded into Redshift (which can take ~20 minutes for some jobs), the cluster
is sitting idle.

In order to maximise the use of the cluster, we tried starting a thread for
each job so that they can all be submitted simultaneously, and therefore the
cluster can be utilised by another job while one is being written to
Redshift.

However, when this is run, it fails with a TimeoutException (see stack trace
below). Would it make sense to increase "spark.sql.broadcastTimeout"? I'm
not sure that would actually solve anything. Should it not be possible to
save multiple DataFrames simultaneously? Or any other hints on how to make
better use of the cluster's resources?

Thanks.


Stack trace:

Exception in thread "main" java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
...
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1676)
at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1673)
at org.apache.spark.sql.DataFrame.mapPartitions(DataFrame.scala:1465)
at
com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:264)
at
com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:374)
at
com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Merging compatible schemas on Spark 1.6.0

2016-01-13 Thread emlyn
I have a series of directories on S3 with parquet data, all with compatible
(but not identical) schemas. We verify that the schemas stay compatible when
they evolve using
org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility. On Spark
1.5, I could read these into a DataFrame with sqlCtx.read().parquet(path1,
path2), and Spark would take care of merging the compatible schemas.
I have just been trying to run on Spark 1.6, and that is now giving an
error, saying:

java.lang.AssertionError: assertion failed: Conflicting directory structures
detected. Suspicious paths:
s3n://bucket/data/app1/version1/event1
s3n://bucket/data/app2/version1/event1
If provided paths are partition directories, please set "basePath" in the
options of the data source to specify the root directory of the table. If
there are multiple root directories, please load them separately and then
union them.

Under these paths I have partitioned data, like
s3n://bucket/data/appN/versionN/eventN/dat_received=-MM-DD/fingerprint=/part-r--.lzo.parquet
If I load both paths into separate DataFrames and then try to union them, as
suggested in the error message, that fails with:

org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:203)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.unionAll(DataFrame.scala:1052)

How can I combine these data sets in Spark 1.6? Is there are way to union
DataFrames with different but compatible schemas?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-compatible-schemas-on-Spark-1-6-0-tp25958.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Emlyn Corrin
JAVA_HOME is unset.
I've also tried setting it with:
export JAVA_HOME=$(/usr/libexec/java_home)
which sets it to
"/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home" and I
still get the same problem.

On 23 October 2015 at 14:37, Jonathan Coveney <jcove...@gmail.com> wrote:

> do you have JAVA_HOME set to a java 7 jdk?
>
> 2015-10-23 7:12 GMT-04:00 emlyn <em...@swiftkey.com>:
>
>> xjlin0 wrote
>> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
>> > or without Hadoop or home compiled with ant or maven).  There was no
>> error
>> > message in v1.4.x, system prompt nothing.  On v1.5.x, once I enter
>> > $SPARK_HOME/bin/pyspark or spark-shell, I got
>> >
>> > Error: Could not find or load main class org.apache.spark.launcher.Main
>>
>> I have the same problem (on MacOS X Yosemite, all spark versions since
>> 1.4,
>> installed both with homebrew and downloaded manually). I've been trying to
>> start the pyspark shell, but it also fails in the same way for spark-shell
>> and spark-sql and spark-submit. I've narrowed it down to the following
>> line
>> in the spark-class script:
>>
>> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main
>> "$@")
>>
>> (where $RUNNER is "java" and $LAUNCH_CLASSPATH is
>>
>> "/usr/local/Cellar/apache-spark/1.5.1/libexec/lib/spark-assembly-1.5.1-hadoop2.6.0.jar",
>> which does exist and does contain the org.apache.spark.launcher.Main
>> class,
>> despite the message that it can't be found)
>>
>> If I run it manually, using:
>>
>> SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1/libexec java -cp
>>
>> /usr/local/Cellar/apache-spark/1.5.1/libexec/lib/spark-assembly-1.5.1-hadoop2.6.0.jar
>> org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit
>> pyspark-shell-main --name PySparkShell
>>
>> It runs without that error, and instead prints out (where "\0" is a nul
>> character):
>>
>> env\0PYSPARK_SUBMIT_ARGS="--name" "PySparkShell" "pyspark-shell"\0python\0
>>
>> I'm not really sure what to try next, maybe with this extra information
>> someone has an idea what's going wrong, and how to fix it.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-start-REPL-shell-since-1-4-0-tp24921p25176.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
*Emlyn Corrin*

Software Engineer | SwiftKey |
em...@swiftkey.com | www.swiftkey.com | @swiftkey
<http://www.twitter.com/swiftkey> | fb.com/swiftkey




SwiftKey and the SwiftKey logo are registered trade marks of TouchType Ltd,
a limited company registered in England and Wales, number 06671487

UK Headquarters: SwiftKey, 91-95 Southwark Bridge Road, London, SE1 0AX, UK

CONFIDENTIALITY NOTICE: The information in this e-mail is confidential and
privileged; it is intended for use solely by the individual or entity named
as the recipient hereof. Disclosure, copying, distribution, or use of the
contents of this e-mail by persons other than the intended recipient is
strictly prohibited and may violate applicable laws. If you have received
this e-mail in error, please delete the original message and notify us by
email immediately. Thank you. TouchType Ltd.


Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
xjlin0 wrote
> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
> or without Hadoop or home compiled with ant or maven).  There was no error
> message in v1.4.x, system prompt nothing.  On v1.5.x, once I enter
> $SPARK_HOME/bin/pyspark or spark-shell, I got
> 
> Error: Could not find or load main class org.apache.spark.launcher.Main

I have the same problem (on MacOS X Yosemite, all spark versions since 1.4,
installed both with homebrew and downloaded manually). I've been trying to
start the pyspark shell, but it also fails in the same way for spark-shell
and spark-sql and spark-submit. I've narrowed it down to the following line
in the spark-class script:

done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main
"$@")

(where $RUNNER is "java" and $LAUNCH_CLASSPATH is
"/usr/local/Cellar/apache-spark/1.5.1/libexec/lib/spark-assembly-1.5.1-hadoop2.6.0.jar",
which does exist and does contain the org.apache.spark.launcher.Main class,
despite the message that it can't be found)

If I run it manually, using:

SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1/libexec java -cp
/usr/local/Cellar/apache-spark/1.5.1/libexec/lib/spark-assembly-1.5.1-hadoop2.6.0.jar
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit
pyspark-shell-main --name PySparkShell

It runs without that error, and instead prints out (where "\0" is a nul
character):

env\0PYSPARK_SUBMIT_ARGS="--name" "PySparkShell" "pyspark-shell"\0python\0

I'm not really sure what to try next, maybe with this extra information
someone has an idea what's going wrong, and how to fix it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-start-REPL-shell-since-1-4-0-tp24921p25176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
emlyn wrote
> 
> xjlin0 wrote
>> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with
>> or without Hadoop or home compiled with ant or maven).  There was no
>> error message in v1.4.x, system prompt nothing.  On v1.5.x, once I enter
>> $SPARK_HOME/bin/pyspark or spark-shell, I got
>> 
>> Error: Could not find or load main class org.apache.spark.launcher.Main
> I have the same problem

In case anyone else has the same problem: I found that the problem only
occurred under my login, not under a new clean user. After some
investigation, I found that I had "GREP_OPTIONS='--color=always'" in my
environment, which was messing up the output of grep with colour codes. I
changed that to "GREP_OPTIONS='--color=auto'" and now it works.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-start-REPL-shell-since-1-4-0-tp24921p25182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org