Re: Revisiting Online serving of Spark models?

2018-06-12 Thread Vadim Chelyshov
I've almost completed a library for speeding up current spark models serving
- https://github.com/Hydrospheredata/fastserving. It depends on spark, but
it provides a way to turn spark logical plan from dataframe sample, that was
passed into pipeline/transformer, into an alternative transformer that works
with a local data structure and provides a significant performance speedup.

>From the future perspective, I think introducing some dataframe-like
structure with exposed catalist-like ast and providing different ways of
interpretation(local/spark) possible could solve the current problems with a
"minimal" rewriting.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Shivaram Venkataraman
#1 - Yes. It doesn't look like that is being honored. This is
something we should follow up with CRAN about

#2 - Looking at it more closely, I'm not sure what the problem is. If
the version string is 1.8.0_144 then our parsing code does work
correctly. We might need to add more debug logging or ask CRAN to
figure out what the output of `java -version` is on that machine. We
can move this discussion to the JIRA.

Shivaram
On Tue, Jun 12, 2018 at 3:21 PM Felix Cheung  wrote:
>
> For #1 is system requirements not honored?
>
> For #2 it looks like Oracle JDK?
>
> 
> From: Shivaram Venkataraman 
> Sent: Tuesday, June 12, 2018 3:17:52 PM
> To: dev
> Cc: Felix Cheung
> Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
>
> Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
> to CRAN yesterday. Unfortunately it looks like there are a couple of
> issues (full message from CRAN is forwarded below)
>
> 1. There are some builds started with Java 10
> (http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
> which are right now counted as test failures. I wonder if we should
> somehow mark them as skipped ? I can ping the CRAN team about this.
>
> 2. There is another issue with Java version parsing which
> unfortunately affects even Java 8 builds. I've created
> https://issues.apache.org/jira/browse/SPARK-24535 to track this.
>
> Thanks
> Shivaram
>
>
> -- Forwarded message -
> From: 
> Date: Mon, Jun 11, 2018 at 11:24 AM
> Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
> To: 
> Cc: 
>
>
> Dear maintainer,
>
> package SparkR_2.3.1.tar.gz does not pass the incoming checks
> automatically, please see the following pre-tests:
> Windows: 
> 
> Status: 2 ERRORs, 1 NOTE
> Debian: 
> 
> Status: 1 ERROR, 1 WARNING, 1 NOTE
>
> Last released version's CRAN status: ERROR: 1, OK: 1
> See: 
>
> CRAN Web: 
>
> Please fix all problems and resubmit a fixed version via the webform.
> If you are not sure how to fix the problems shown, please ask for help
> on the R-package-devel mailing list:
> 
> If you are fairly certain the rejection is a false positive, please
> reply-all to this message and explain.
>
> More details are given in the directory:
> 
> The files will be removed after roughly 7 days.
>
> No strong reverse dependencies to be checked.
>
> Best regards,
> CRAN teams' auto-check service
> Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
> Check: CRAN incoming feasibility, Result: NOTE
>   Maintainer: 'Shivaram Venkataraman '
>
>   New submission
>
>   Package was archived on CRAN
>
>   Possibly mis-spelled words in DESCRIPTION:
> Frontend (4:10, 5:28)
>
>   CRAN repository db overrides:
> X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
>   corrected despite reminders.
>
> Flavor: r-devel-windows-ix86+x86_64
> Check: running tests for arch 'i386', Result: ERROR
> Running 'run-all.R' [30s]
>   Running the tests in 'tests/run-all.R' failed.
>   Complete output:
> > #
> > # Licensed to the Apache Software Foundation (ASF) under one or more
> > # contributor license agreements.  See the NOTICE file distributed with
> > # this work for additional information regarding copyright ownership.
> > # The ASF licenses this file to You under the Apache License, Version 
> 2.0
> > # (the "License"); you may not use this file except in compliance with
> > # the License.  You may obtain a copy of the License at
> > #
> > #http://www.apache.org/licenses/LICENSE-2.0
> > #
> > # Unless required by applicable law or agreed to in writing, software
> > # distributed under the License is distributed on an "AS IS" BASIS,
> > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
> implied.
> > # See the License for the specific language governing permissions and
> > # limitations under the License.
> > #
> >
> > library(testthat)
> > library(SparkR)
>
> Attaching package: 'SparkR'
>
> The following objects are masked from 'package:testthat':
>
> describe, not
>
> The following objects are masked from 'package:stats':
>
> cov, filter, lag, na.omit, predict, sd, var, window
>
> The following objects are masked from 'package:base':
>
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
>
> >
> > # Turn all 

Time for 2.1.3

2018-06-12 Thread Marcelo Vanzin
Hey all,

There are some fixes that went into 2.1.3 recently that probably
deserve a release. So as usual, please take a look if there's anything
else you'd like on that release, otherwise I'd like to start with the
process by early next week.

I'll go through jira to see what's the status of things targeted at
that release, but last I checked there wasn't anything on the radar.

Thanks!

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Very slow complex type column reads from parquet

2018-06-12 Thread Ryan Blue
Jakub,

You're right that Spark currently doesn't use the vectorized read path for
nested data, but I'm not sure that's the problem here. With 50k elements in
the f1 array, it could easily be that you're getting the significant
speed-up from not reading or materializing that column. The non-vectorized
path is slower, but it is more likely that the problem is the data if it is
that much slower.

I'd be happy to see vectorization for nested Parquet data move forward, but
I think you might want to get an idea of how much it will help before you
move forward with it. Can you use Impala to test whether vectorization
would help here?

rb



On Mon, Jun 11, 2018 at 6:16 AM, Jakub Wozniak 
wrote:

> Hello,
>
> We have stumbled upon a quite degraded performance when reading a complex
> (struct, array) type columns stored in Parquet.
> A Parquet file is of around 600MB (snappy) with ~400k rows with a field of
> a complex type { f1: array of ints, f2: array of ints } where f1 array
> length is 50k elements.
> There are also other fields like entity_id: long, timestamp: long.
>
> A simple query that selects rows using predicates entity_id = X and
> timestamp >= T1 and timestamp <= T2 plus ds.show() takes 17 minutes to
> execute.
> If we remove the complex type columns from the query it is executed in a
> sub-second time.
>
> Now when looking at the implementation of the Parquet datasource the
> Vectorized* classes are used only if the read types are primitives. In
> other case the code falls back to the parquet-mr default implementation.
> In the VectorizedParquetRecordReader there is a TODO to handle complex
> types that "should be efficient & easy with codegen".
>
> For our CERN Spark usage the current execution times are pretty much
> prohibitive as there is a lot of data stored as arrays / complex types…
> The file of 600 MB represents 1 day of measurements and our data
> scientists would like to process sometimes months or even years of those.
>
> Could you please let me know if there is anybody currently working on it
> or maybe you have it in a roadmap for the future?
> Or maybe you could give me some suggestions how to avoid / resolve this
> problem? I’m using Spark 2.2.1.
>
> Best regards,
> Jakub Wozniak
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Shivaram Venkataraman
Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
to CRAN yesterday. Unfortunately it looks like there are a couple of
issues (full message from CRAN is forwarded below)

1. There are some builds started with Java 10
(http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
which are right now counted as test failures. I wonder if we should
somehow mark them as skipped ? I can ping the CRAN team about this.

2. There is another issue with Java version parsing which
unfortunately affects even Java 8 builds. I've created
https://issues.apache.org/jira/browse/SPARK-24535 to track this.

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jun 11, 2018 at 11:24 AM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
To: 
Cc: 


Dear maintainer,

package SparkR_2.3.1.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 

Status: 2 ERRORs, 1 NOTE
Debian: 

Status: 1 ERROR, 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: 

CRAN Web: 

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:

If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:

The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'i386', Result: ERROR
Running 'run-all.R' [30s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following objects are masked from 'package:testthat':

describe, not

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: http://apache.mirror.digionline.de/spark
Downloading spark-2.3.1 for Hadoop 2.7 from:
- 
http://apache.mirror.digionline.de/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
trying URL 

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Felix Cheung
For #1 is system requirements not honored?

For #2 it looks like Oracle JDK?


From: Shivaram Venkataraman 
Sent: Tuesday, June 12, 2018 3:17:52 PM
To: dev
Cc: Felix Cheung
Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
to CRAN yesterday. Unfortunately it looks like there are a couple of
issues (full message from CRAN is forwarded below)

1. There are some builds started with Java 10
(http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
which are right now counted as test failures. I wonder if we should
somehow mark them as skipped ? I can ping the CRAN team about this.

2. There is another issue with Java version parsing which
unfortunately affects even Java 8 builds. I've created
https://issues.apache.org/jira/browse/SPARK-24535 to track this.

Thanks
Shivaram


-- Forwarded message -
From: 
Date: Mon, Jun 11, 2018 at 11:24 AM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
To: 
Cc: 


Dear maintainer,

package SparkR_2.3.1.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 

Status: 2 ERRORs, 1 NOTE
Debian: 

Status: 1 ERROR, 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: 

CRAN Web: 

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:

If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:

The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'i386', Result: ERROR
Running 'run-all.R' [30s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following objects are masked from 'package:testthat':

describe, not

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directory if they exist.
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred