Re: Mathematical functions in spark sql

2015-01-26 Thread Alexey Romanchuk
I have tried select ceil(2/3), but got key not found: floor

On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 Have you tried floor() or ceil() functions ?

 According to http://spark.apache.org/sql/, Spark SQL is compatible with
 Hive SQL.

 Cheers

 On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote:

 Hello everyone!

 I try execute select 2/3 and I get 0.. Is there any
 way
 to cast double to int or something similar?

 Also it will be cool to get list of functions supported by spark sql.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-02 Thread Alexey Romanchuk
Any ideas? Anyone got the same error?

On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk alexey.romanc...@gmail.com
 wrote:

 Hello spark users!

 I found lots of strange messages in driver log. Here it is:

 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
 ERROR
 akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
 - AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] -
 [akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
 akka.tcp://sparkExecutor@data1.hadoop:17372] [
 akka.remote.ShutDownAssociation: Shut down address:
 akka.tcp://sparkExecutor@data1.hadoop:17372
 Caused by: akka.remote.transport.Transport$InvalidAssociationException:
 The remote system terminated the association because it is shutting down.
 ]

 I got this message for every worker twice. First - for driverPropsFetcher
 and next for sparkExecutor. Looks like spark shutdown remote akka system
 incorrectly or there is some race condition in this process and driver sent
 some data to worker, but worker's actor system already in shutdown state.

 Except for this message everything works fine. But this is ERROR level
 message and I found it in my ERROR only log.

 Do you have any idea is it configuration issue, bug in spark or akka or
 something else?

 Thanks!




akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-01 Thread Alexey Romanchuk
Hello spark users!

I found lots of strange messages in driver log. Here it is:

2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25]
ERROR
akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter]
- AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] -
[akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372] [
akka.remote.ShutDownAssociation: Shut down address:
akka.tcp://sparkExecutor@data1.hadoop:17372
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The
remote system terminated the association because it is shutting down.
]

I got this message for every worker twice. First - for driverPropsFetcher
and next for sparkExecutor. Looks like spark shutdown remote akka system
incorrectly or there is some race condition in this process and driver sent
some data to worker, but worker's actor system already in shutdown state.

Except for this message everything works fine. But this is ERROR level
message and I found it in my ERROR only log.

Do you have any idea is it configuration issue, bug in spark or akka or
something else?

Thanks!


Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hello spark users and developers!

I am using hdfs + spark sql + hive schema + parquet as storage format. I
have lot of parquet files - one files fits one hdfs block for one day. The
strange thing is very slow first query for spark sql.

To reproduce situation I use only one core and I have 97sec for first time
and only 13sec for all next queries. Sure I query for different data, but
it has same structure and size. The situation can be reproduced after
restart thrift server.

Here it information about parquet files reading from worker node:

Slow one:
Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1560251 records from 30 columns in 11686 ms:
133.51454 rec/ms, 4005.4363 cell/ms

Fast one:
Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1373 ms:
1142.6796 rec/ms, 1142.6796 cell/ms

As you can see second reading is 10x times faster then first. Most of the
query time spent to work with parquet file.

This problem is really annoying, because most of my spark task contains
just 1 sql query and data processing and to speedup my jobs I put special
warmup query in from of any job.

My assumption is that it is hotspot optimizations that used due first
reading. Do you have any idea how to confirm/solve this performance problem?

Thanks for advice!

p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation
but can not figure out what are important and what are not.


Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hey Sean and spark users!

Thanks for reply. I try -Xcomp right now and start time was about few
minutes (as expected), but I got first query slow as before:
Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 30 columns in 12897 ms:
121.64837 rec/ms, 3649.451 cell/ms

and next

Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1757 ms:
892.94196 rec/ms, 892.94196 cell/ms

I have no idea about caching or other stuff because CPU load is 100% on
worker and jstack show that worker is reading from parquet file.

Any ideas?

Thanks!

On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen so...@cloudera.com wrote:

 You could try setting -Xcomp for executors to force JIT compilation
 upfront. I don't know if it's a good idea overall but might show
 whether the upfront compilation really helps. I doubt it.

 However is this almost surely due to caching somewhere, in Spark SQL
 or HDFS? I really doubt hotspot makes a difference compared to these
 much larger factors.

 On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk
 alexey.romanc...@gmail.com wrote:
  Hello spark users and developers!
 
  I am using hdfs + spark sql + hive schema + parquet as storage format. I
  have lot of parquet files - one files fits one hdfs block for one day.
 The
  strange thing is very slow first query for spark sql.
 
  To reproduce situation I use only one core and I have 97sec for first
 time
  and only 13sec for all next queries. Sure I query for different data,
 but it
  has same structure and size. The situation can be reproduced after
 restart
  thrift server.
 
  Here it information about parquet files reading from worker node:
 
  Slow one:
  Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
  Assembled and processed 1560251 records from 30 columns in 11686 ms:
  133.51454 rec/ms, 4005.4363 cell/ms
 
  Fast one:
  Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
  Assembled and processed 1568899 records from 1 columns in 1373 ms:
 1142.6796
  rec/ms, 1142.6796 cell/ms
 
  As you can see second reading is 10x times faster then first. Most of the
  query time spent to work with parquet file.
 
  This problem is really annoying, because most of my spark task contains
 just
  1 sql query and data processing and to speedup my jobs I put special
 warmup
  query in from of any job.
 
  My assumption is that it is hotspot optimizations that used due first
  reading. Do you have any idea how to confirm/solve this performance
 problem?
 
  Thanks for advice!
 
  p.s. I have billion hotspot optimization showed with
 -XX:+PrintCompilation
  but can not figure out what are important and what are not.



Re: Log hdfs blocks sending

2014-09-26 Thread Alexey Romanchuk
Hello Andrew!

Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?

I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log - https://gist.github.com/13h3r/6e5053cf0dbe33f2

Do you have any idea where to look at?

Thanks!

On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash and...@andrewash.com wrote:

 Hi Alexey,

 You should see in the logs a locality measure like NODE_LOCAL,
 PROCESS_LOCAL, ANY, etc.  If your Spark workers each have an HDFS data node
 on them and you're reading out of HDFS, then you should be seeing almost
 all NODE_LOCAL accesses.  One cause I've seen for mismatches is if Spark
 uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't
 think the data is local and does remote reads which really kills
 performance.

 Hope that helps!
 Andrew

 On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk 
 alexey.romanc...@gmail.com wrote:

 Hello again spark users and developers!

 I have standalone spark cluster (1.1.0) and spark sql running on it. My
 cluster consists of 4 datanodes and replication factor of files is 3.

 I use thrift server to access spark sql and have 1 table with 30+
 partitions. When I run query on whole table (something simple like select
 count(*) from t) spark produces a lot of network activity filling all
 available 1gb link. Looks like spark sent data by network instead of local
 reading.

 Is it any way to log which blocks were accessed locally and which are not?

 Thanks!





Log hdfs blocks sending

2014-09-25 Thread Alexey Romanchuk
Hello again spark users and developers!

I have standalone spark cluster (1.1.0) and spark sql running on it. My
cluster consists of 4 datanodes and replication factor of files is 3.

I use thrift server to access spark sql and have 1 table with 30+
partitions. When I run query on whole table (something simple like select
count(*) from t) spark produces a lot of network activity filling all
available 1gb link. Looks like spark sent data by network instead of local
reading.

Is it any way to log which blocks were accessed locally and which are not?

Thanks!