date:20180618

Re: Best way to process this dataset

2018-06-18 Thread Georg Heiler

use pandas or dask

If you do want to use spark store the dataset as parquet / orc. And then
continue to perform analytical queries on that dataset.

Raymond Xie  schrieb am Di., 19. Juni 2018 um
04:29 Uhr:

> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
> is 20GB ssd harddisk and 2GB RAM.
>
> The dataset comes with
> User ID: 987,994
> Item ID: 4,162,024
> Category ID: 9,439
> Behavior type ('pv', 'buy', 'cart', 'fav')
> Unix Timestamp: span between November 25 to December 03, 2017
>
> I would like to hear any suggestion from you on how should I process the
> dataset with my current environment.
>
> Thank you.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>

Best way to process this dataset

2018-06-18 Thread Raymond Xie

I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is
20GB ssd harddisk and 2GB RAM.

The dataset comes with
User ID: 987,994
Item ID: 4,162,024
Category ID: 9,439
Behavior type ('pv', 'buy', 'cart', 'fav')
Unix Timestamp: span between November 25 to December 03, 2017

I would like to hear any suggestion from you on how should I process the
dataset with my current environment.

Thank you.

**
*Sincerely yours,*


*Raymond*

Re: convert array of values column to string column (containing serialised json) (SPARK-21513)

2018-06-18 Thread summersk

Resending with formatting hopefully fixed:

Hello,

SPARK-21513   
proposes to support support using the  to_json

  
UDF on any column type, however it fails with the following error when
operating on ArrayType columns of strings, ints, or other non struct data
types:

org.apache.spark.sql.AnalysisException: cannot resolve
'structstojson(`item`.`messages`)'   due to data type mismatch: Input type
array   must be a struct, array of structs or a map or array of map.;;

Would it be possible for someone with access to raise an issue to include
this in
a future release?

Details are outlined on this StackOverflow post:
https://stackoverflow.com/questions/50195796/convert-array-of-values-column-to-string-column-containing-serialised-json,
and included below.

Thank you,

Kyle

Example datasets/schemas:

Given a dataset of string records such as:

{
  "item": {
"messages": [
  "test",
  "test2",
  "test3"
]
  }
}

Which when loaded with read().json(dataSetOfJsonStrings) produces a schema
like:

root
 |-- item: struct (nullable = true)
 ||-- messages: array (nullable = true)
 |||-- element: string (containsNull = true)

How might ArrayType columns be transformed to serialised json? Eg, this
schema:

root
 |-- item: struct (nullable = true)
 ||-- messages: string (nullable = true)

Which might be written out in JSON format like:

{
  "item": {
"messages": "[\"test\",\"test2\",\"test3\"]"
  }
}

Note: Example output not flattened, just illustrating to_json() usage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

convert array of values column to string column (containing serialised json) (SPARK-21513)

2018-06-18 Thread summersk

Hello, SPARK-21513   
proposes to support support using the  to_json

  
UDF on any column type, however it fails with the following error when
operating on ArrayType columns of strings, ints, or other non struct data
types:org.apache.spark.sql.AnalysisException: cannot resolve
'structstojson(`item`.`messages`)'   due to data type mismatch: Input type
array   must be a struct, array of structs or a map or array of map.;;Would
it be possible for someone with access to raise an issue to include this in
a future release?Details are outlined on this StackOverflow post:
https://stackoverflow.com/questions/50195796/convert-array-of-values-column-to-string-column-containing-serialised-json,
and included below.Thank you,Kyle*Example datasets/schemas:*Given a dataset
of string records such as:{  "item": {"messages": [  "test", 
"test2",  "test3"]  }}Which when loaded with
read().json(dataSetOfJsonStrings) produces a schema like:root |-- item:
struct (nullable = true) ||-- messages: array (nullable = true) ||   
|-- element: string (containsNull = true)How might ArrayType columns be
transformed to serialised json? Eg, this schema:root |-- item: struct
(nullable = true) ||-- messages: string (nullable = true)Which might be
written out in JSON format like:{  "item": {"messages":
"[\"test\",\"test2\",\"test3\"]"  }}Note: Example output not flattened, just
illustrating to_json() usage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark-Mongodb connector issue

2018-06-18 Thread ayan guha

Hi Guys

I have a large mongodb collection with complex document structure. I an
facing an issue when I am getting error as

Can not cast Array to Struct. Value:BsonArray([])

The target column is indeed a struct. So the error makes sense.

I am able to successfully read from another collection with exactly same
structure but subset of data.

I am suspecting some documents are corrupted at mongodb.

Question:
1. Is there any way to filter out such documents in mongodb connector?
2. I tried to exclude the column from a custom select statement but did not
work. Is it possible?
3. Is there any way to suppress errors to a certain amount? I do not want
to stall the load of 1M record if 1 record is bad.

I know this might be a question for mongodb forum. But I started from here
as there may be some generic solution I can use. I am going to post to SO
and Mongo forum shortly.

Best
Ayan

-- 
Best Regards,
Ayan Guha

Re: Spark batch job: failed to compile: java.lang.NullPointerException

2018-06-18 Thread ARAVIND SETHURATHNAM

Spark version is 2.2 and I think I am running into this issue 
https://issues.apache.org/jira/browse/SPARK-18016as the dataset schema is 
pretty huge and nested

From: ARAVIND SETHURATHNAM 
Date: Monday, June 18, 2018 at 4:00 PM
To: "user@spark.apache.org" 
Subject: Spark batch job: failed to compile: java.lang.NullPointerException


Hi,
We have a  spark job that reads AVRO data from a S3 location , does some 
processing and writes it back to S3. Of late it has been failing with the 
exception below,


Application application_1529346471665_0020 failed 1 times due to AM Container 
for appattempt_1529346471665_0020_01 exited with exitCode: -104
For more detailed output, check application tracking 
page:http://10.122.49.134:8088/proxy/application_1529346471665_0020/Then, click 
on links to logs of each attempt.
Diagnostics: Container 
[pid=14249,containerID=container_1529346471665_0020_01_01] is running 
beyond physical memory limits. Current usage: 23.4 GB of 22 GB physical memory 
used; 28.7 GB of 46.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1529346471665_0020_01_01 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 14255 14249 14249 14249 (java) 23834 8203 30684336128 6142485 
/usr/java/default/bin/java -server -Xmx20480m 
-Djava.io.tmpdir=/media/ephemeral0/yarn/local/usercache/asethurathnam/appcache/application_1529346471665_0020/container_1529346471665_0020_01_01/tmp
 -Dspring.profiles.active=stage 
-Dspark.yarn.app.container.log.dir=/media/ephemeral1/logs/yarn/application_1529346471665_0020/container_1529346471665_0020_01_01
 -XX:MaxPermSize=512m org.apache.spark.deploy.yarn.ApplicationMaster --class 
com.homeaway.omnihub.OmnitaskApp --jar 
/tmp/spark-9f42e005-e1b4-47c2-a6e8-ac0bc9fa595b/omnitask-spark-compaction-0.0.1.jar
 --arg 
--PATH=s3://ha-stage-datalake-landing-zone-us-east-1/avro-hourly/entityEventLodgingRate-2/
 --arg 
--OUTPUT_PATH=s3://ha-stage-datalake-landing-zone-us-east-1/avro-daily/entityEventLodgingRate-2/
 --arg --DB_NAME=tier1_landingzone --arg 
--TABLE_NAME=entityeventlodgingrate_2_daily --arg --TABLE_DESCRIPTION=data in: 
's3://ha-stage-datalake-landing-zone-us-east-1/avro-daily/entityEventLodgingRate-2'
 --arg --FORMAT=AVRO --arg --PARTITION_COLUMNS=dateid --arg --HOURLY=false 
--arg --START_DATE=20180616 --arg --END_DATE=20180616 --properties-file 
/media/ephemeral0/yarn/local/usercache/asethurathnam/appcache/application_1529346471665_0020/container_1529346471665_0020_01_01/__spark_conf__/__spark_conf__.properties
|- 14249 14247 14249 14249 (bash) 0 1 115826688 704 /bin/bash -c 
LD_LIBRARY_PATH=/usr/lib/hadoop2/lib/native::/usr/lib/qubole/packages/hadoop2-2.6.0/hadoop2/lib/native:/usr/lib/qubole/packages/hadoop2-2.6.0/hadoop2/lib/native
 /usr/java/default/bin/java -server -Xmx20480m 
-Djava.io.tmpdir=/media/ephemeral0/yarn/local/usercache/asethurathnam/appcache/application_1529346471665_0020/container_1529346471665_0020_01_01/tmp
 '-Dspring.profiles.active=stage' 
-Dspark.yarn.app.container.log.dir=/media/ephemeral1/logs/yarn/application_1529346471665_0020/container_1529346471665_0020_01_01
 -XX:MaxPermSize=512m org.apache.spark.deploy.yarn.ApplicationMaster --class 
'com.homeaway.omnihub.OmnitaskApp' --jar 
/tmp/spark-9f42e005-e1b4-47c2-a6e8-ac0bc9fa595b/omnitask-spark-compaction-0.0.1.jar
 --arg 
'--PATH=s3://ha-stage-datalake-landing-zone-us-east-1/avro-hourly/entityEventLodgingRate-2/'
 --arg 
'--OUTPUT_PATH=s3://ha-stage-datalake-landing-zone-us-east-1/avro-daily/entityEventLodgingRate-2/'
 --arg '--DB_NAME=tier1_landingzone' --arg 
'--TABLE_NAME=entityeventlodgingrate_2_daily' --arg '--TABLE_DESCRIPTION=data 
in: 
'\''s3://ha-stage-datalake-landing-zone-us-east-1/avro-daily/entityEventLodgingRate-2'\'''
 --arg '--FORMAT=AVRO' --arg '--PARTITION_COLUMNS=dateid' --arg 
'--HOURLY=false' --arg '--START_DATE=20180616' --arg '--END_DATE=20180616' 
--properties-file 
/media/ephemeral0/yarn/local/usercache/asethurathnam/appcache/application_1529346471665_0020/container_1529346471665_0020_01_01/__spark_conf__/__spark_conf__.properties
 1> 
/media/ephemeral1/logs/yarn/application_1529346471665_0020/container_1529346471665_0020_01_01/stdout
 2> 
/media/ephemeral1/logs/yarn/application_1529346471665_0020/container_1529346471665_0020_01_01/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.





and in one of the executor logs which has a failed task I see below, can 
someone please lmk   whats causing the exception and the tasks to fail and what 
that below class is ?


18/06/18 20:34:05 dispatcher-event-loop-6 INFO BlockManagerInfo: Added 
broadcast_0_piece0 in memory on 10.122.51.238:42797 (size: 30.7 KB, free: 10.5 
GB)
18/06/18 20:34:06 dispatcher-event-loop-0 INFO BlockManagerInfo: Added 
broadcast_0_piece0 in

Repartition not working on a csv file

2018-06-18 Thread Abdeali Kothari

I am using Spark 2.3.0 and trying to read a CSV file which has 500 records.
When I try to read it, spark says that it has two stages: 10, 11 and then
they join into stage 12.

This makes sense and is what I would expect, as I have 30 map-based UDFs
after which i do a join, and run another 10 UDFs and then save the file as
parquet.

The stages 10 and 11 have only 2 tasks according to spark. I have a
max-executors possible of 20 on my cluster. I would like Spark to use all
20 executors for this task.

*1csv+Repartition*: Right after reading the file, if I do a repartition, it
still takes *2 tasks*
*1csv+Repartition+count()*: Right after reading the file, if I do a
repartition and then do an action word like count(), it still takes *2
tasks*
*50csv*: If I split my 500line csv into 50 files with 10 lines each, it
takes *18 tasks*
*50csv+Repartition*: If I split my 500line csv into 50 files with 10 lines
each, and do a repartition and a count, it takes *19 tasks*
*500csv+Repartition*: If I split my 500line csv into 500 files with 1 line
each, and do a repartition and a count, it takes *19 tasks*

All repartitions above are: .repartition(200)

I can't understand what it's trying to do.
I was expecting that if I do a .repartition(200) it would just create 200
tasks after shuffling the data. But it's not doing that.
I can recollect this worked find on Spark 1.6.x.

PS: The reason I want more tasks is because those UDFs are very heavy and
slow - I'd like to use more executors to reduce computation time. I'm sure
they are parallelizable ...

load hbase data using spark

2018-06-18 Thread Lian Jiang

Hi,

I am considering tools to load hbase data using spark. One choice is
https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to
be out-of-date (e.g. "This version of 1.0.0 requires Spark 1.4.0."). Which
tool should I use for this purpose? Thanks for any hint.

Re: Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

2018-06-18 Thread Koert Kuipers

we use DataFrame and RDD. Dataset not only has issues with predicate
pushdown, it also adds shufffles at times where it shouldn't. and there is
some overhead from the encoders themselves, because under the hood it is
still just Row objects.


On Mon, Jun 18, 2018 at 5:00 PM, Valery Khamenya  wrote:

> Hi Spark gurus,
>
> I was surprised to read here:
> https://stackoverflow.com/questions/50129411/why-is-
> predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap
>
> that filters are not pushed down in typed Datasets and one should rather
> stick to Dataframes.
>
> But writing code for groupByKey + mapGroups is a headache with Dataframes
> if compared to typed Dataset. The former mostly doesn't force you to write
> any Encoders (unless you try to write generic transformations on
> parametrized Dataset[T]) . Neither typed Dataset forces you to do an ugly
> Row parsing with getInteger, getString, etc -- like it is needed with
> Dataframes.
>
> So, what should the poor Spark user rely on by default, if the goal is to
> deliver a library of  data transformations -- Dataset or Dataframe?
>
> best regards
> --
> Valery
>

Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

2018-06-18 Thread Valery Khamenya

Hi Spark gurus,

I was surprised to read here:
https://stackoverflow.com/questions/50129411/why-is-predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap

that filters are not pushed down in typed Datasets and one should rather
stick to Dataframes.

But writing code for groupByKey + mapGroups is a headache with Dataframes
if compared to typed Dataset. The former mostly doesn't force you to write
any Encoders (unless you try to write generic transformations on
parametrized Dataset[T]) . Neither typed Dataset forces you to do an ugly
Row parsing with getInteger, getString, etc -- like it is needed with
Dataframes.

So, what should the poor Spark user rely on by default, if the goal is to
deliver a library of  data transformations -- Dataset or Dataframe?

best regards
--
Valery

Re: Spark 2.4 release date

2018-06-18 Thread Jacek Laskowski

Hi,

What about https://issues.apache.org/jira/projects/SPARK/versions/12342385?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jun 18, 2018 at 9:41 PM, Li Gao  wrote:

> Hello,
>
> Do we know the estimate when Spark 2.4 will be GA?
> We are evaluating whether to back port some of 2.4 fixes into our 2.3
> deployment.
>
> Thank you.
>

Re: best practices to implement library of custom transformations of Dataframe/Dataset

2018-06-18 Thread Georg Heiler

I believe explicit is better than implicits, however as you mentioned the
notation is very nice.

Therefore, I suggest
https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c
to
use df.transform(myFunction)

Valery Khamenya  schrieb am Mo., 18. Juni 2018 um
21:34 Uhr:

> Dear Spark gurus,
>
> *Question*: what way would you recommend to shape a library of custom
> transformations for Dataframes/Datasets?
>
> *Details*: e.g., consider we need several custom transformations over the
> Dataset/Dataframe instances. For example injecting columns, apply specially
> tuned row filtering, lookup-table based replacements, etc.
>
> I'd consider basically 2 options:
>
> 1) implicits! create a class that looks like derived from
> Dataset/Dataframe and then and implement the transformations as its methods
>
> or
>
> 2) implement the transformations as stand-alone functions
>
> The use of first approach leads to such beautiful code:
>
> val result = inputDataframe
>   .myAdvancedFiter(params)
>   .myAdvancedReplacement(params)
>   .myColumnInjection(params)
>   .mySomethingElseTransformation(params)
>   .andTheFinalGoodies(params)
>
> nice!
>
> whereas the second option will lead to this:
>
> val result = andTheFinalGoodies(
>   mySomethingElseTransformation(
> myColumnInjection(
>   myAdvancedReplacement(
> inputDataframe.myAdvancedFiter(params),
> params),
>   params),
> params),
>   params)
>
> terrible! ;)
>
> So, ideally it would be nice to learn how to implement Option 1. Luckily
> there are different approaches for this:
> https://stackoverflow.com/questions/32585670/what-is-the-best-way-to-define-custom-methods-on-a-dataframe
>
> However in reality such transformations rely on
>
>   import spark.implicits._
>
> and I never seen solution on how to pass SparkContext to such library
> classes and safely use it in there. This article shows, that it is not that
> straight-forward thing:
>
>
> https://docs.azuredatabricks.net/spark/latest/rdd-streaming/tips-for-running-streaming-apps-in-databricks.html
>
> Said that, I still need a wisdom of Spark community to get over this.
>
> P.S. and a good Spark application "boilerplate" with a separately
> implemented library of Dataframe/Dataset transformations relying on "import
> spark.implicits._" is still wanted badly!
>
> best regards
> --
> Valery
>

Spark 2.4 release date

2018-06-18 Thread Li Gao

Hello,

Do we know the estimate when Spark 2.4 will be GA?
We are evaluating whether to back port some of 2.4 fixes into our 2.3
deployment.

Thank you.

best practices to implement library of custom transformations of Dataframe/Dataset

2018-06-18 Thread Valery Khamenya

Dear Spark gurus,

*Question*: what way would you recommend to shape a library of custom
transformations for Dataframes/Datasets?

*Details*: e.g., consider we need several custom transformations over the
Dataset/Dataframe instances. For example injecting columns, apply specially
tuned row filtering, lookup-table based replacements, etc.

I'd consider basically 2 options:

1) implicits! create a class that looks like derived from Dataset/Dataframe
and then and implement the transformations as its methods

or

2) implement the transformations as stand-alone functions

The use of first approach leads to such beautiful code:

val result = inputDataframe
  .myAdvancedFiter(params)
  .myAdvancedReplacement(params)
  .myColumnInjection(params)
  .mySomethingElseTransformation(params)
  .andTheFinalGoodies(params)

nice!

whereas the second option will lead to this:

val result = andTheFinalGoodies(
  mySomethingElseTransformation(
myColumnInjection(
  myAdvancedReplacement(
inputDataframe.myAdvancedFiter(params),
params),
  params),
params),
  params)

terrible! ;)

So, ideally it would be nice to learn how to implement Option 1. Luckily
there are different approaches for this:
https://stackoverflow.com/questions/32585670/what-is-the-best-way-to-define-custom-methods-on-a-dataframe

However in reality such transformations rely on

  import spark.implicits._

and I never seen solution on how to pass SparkContext to such library
classes and safely use it in there. This article shows, that it is not that
straight-forward thing:

https://docs.azuredatabricks.net/spark/latest/rdd-streaming/tips-for-running-streaming-apps-in-databricks.html

Said that, I still need a wisdom of Spark community to get over this.

P.S. and a good Spark application "boilerplate" with a separately
implemented library of Dataframe/Dataset transformations relying on "import
spark.implicits._" is still wanted badly!

best regards
--
Valery

Zstd codec for writing dataframes

2018-06-18 Thread Nikhil Goyal

Hi guys,

I was wondering if there is a way to compress files using zstd. It seems
zstd compression can be used for shuffle data, is there a way to use it for
output data as well?

Thanks
Nikhil

Fwd: StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu

*Correction, 60C2 * 3*


-- Forwarded message --
From: Aakash Basu 
Date: Mon, Jun 18, 2018 at 4:15 PM
Subject: StackOverFlow ERROR - Bulk interaction for many columns fail
To: user 


Hi,

When doing bulk interaction on around 60 columns, I want 3 columns to be
created out of each one of them, since it has a combination of 3, then it
becomes 60C2 * 3, which creates a lot of columns.

So, for a lesser than 50 - 60 columns, even though it takes time, it still
works fine, but, for a little larger number of columns, it throws this
error -

  File "/usr/local/lib/python3.5/dist-packages/backend/feature_
> extraction/cont_bulk_interactions.py", line 100, in bulktransformer_pairs
> df = df.withColumn(col_name, each_op(df[var_2], df[var_1]))
>   File 
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 1849, in withColumn
>   File "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.
> 10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
>   File 
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
>   File "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.
> 10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o26887.withColumn.
> : java.lang.StackOverflowError
> at scala.collection.generic.GenericTraversableTemplate$
> class.genericBuilder(GenericTraversableTemplate.scala:70)
> at scala.collection.AbstractTraversable.genericBuilder(Traversable.
> scala:104)
> at scala.collection.generic.GenTraversableFactory$
> GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
> at scala.collection.generic.GenTraversableFactory$
> GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
> at scala.collection.TraversableLike$class.builder$
> 1(TraversableLike.scala:229)
> at scala.collection.TraversableLike$class.map(
> TraversableLike.scala:233)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>

What to do?

Thanks,
Aakash.

StackOverFlow ERROR - Bulk interaction for many columns fail

2018-06-18 Thread Aakash Basu

Hi,

When doing bulk interaction on around 60 columns, I want 3 columns to be
created out of each one of them, since it has a combination of 3, then it
becomes 60N2 * 3, which creates a lot of columns.

So, for a lesser than 50 - 60 columns, even though it takes time, it still
works fine, but, for a little larger number of columns, it throws this
error -

  File
> "/usr/local/lib/python3.5/dist-packages/backend/feature_extraction/cont_bulk_interactions.py",
> line 100, in bulktransformer_pairs
> df = df.withColumn(col_name, each_op(df[var_2], df[var_1]))
>   File
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 1849, in withColumn
>   File
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>   File
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
>   File
> "/appdata/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o26887.withColumn.
> : java.lang.StackOverflowError
> at
> scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
> at
> scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
> at
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
> at
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
> at
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>

What to do?

Thanks,
Aakash.

is spark stream-stream joins in update mode targeted for 2.4?

2018-06-18 Thread kant kodali

Hi All,

Is spark stream-stream joins in update mode targeted for 2.4?

Thanks!

Re: Best way to process this dataset

Best way to process this dataset

Re: convert array of values column to string column (containing serialised json) (SPARK-21513)

convert array of values column to string column (containing serialised json) (SPARK-21513)

Spark-Mongodb connector issue

Re: Spark batch job: failed to compile: java.lang.NullPointerException

Repartition not working on a csv file

load hbase data using spark

Re: Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

Re: Spark 2.4 release date

Re: best practices to implement library of custom transformations of Dataframe/Dataset

Spark 2.4 release date

best practices to implement library of custom transformations of Dataframe/Dataset

Zstd codec for writing dataframes

Fwd: StackOverFlow ERROR - Bulk interaction for many columns fail

StackOverFlow ERROR - Bulk interaction for many columns fail

is spark stream-stream joins in update mode targeted for 2.4?

18 matches

Site Navigation

Mail list logo

Footer information