date:20180328

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu

Thanks for your response. What do you mean when you said "immediately return"? On Wed, Mar 28, 2018, 10:33 PM Jörn Franke wrote: > I don’t think select * is a good benchmark. You should do a more complex > operation, otherwise optimizes might see that you don’t do anything in the > query and im

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke

I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu wrote: > >

Unable to get results of intermediate dataset

2018-03-28 Thread Sunitha Chennareddy

Hi Team, I am new to Spark, my requirement is I have a huge list, which is converted to spark dataset and I need to operate on this dataset and store computed values in another object/dataset and store in memory for further processing. Approach I tried is : list is retrieved from third party in

unsubscribe

2018-03-28 Thread purna pradeep

- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

val test_schema = DataType.fromJson(schema).asInstanceOf[StructType] val session = SparkHelper.getSparkSession val df1: DataFrame = session.read .format("json") .schema(test_schema) .option("inferSchema","false") .option("mode","FAILFAST") .load("src/test/resources/*.gz") df1.show(80) On

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

I've had more success exporting the schema toJson and importing that. Something like: val df1: DataFrame = session.read .format("json") .schema(test_schema) .option("inferSchema","false") .option("mode","FAILFAST") .load("src/test/resources/*.gz") df1.show(80) On Wed, Mar 28, 2018 at

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu

Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: - TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB. - Da

Unsubscribe

2018-03-28 Thread purna pradeep

- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

The to String representation look like where "someName" is unique: StructType(StructField("someName",StringType,true), StructField("someName",StructType(StructField("someName",StructType(StructField("someName",StringType,true), StructField("someName",StringType,true)),true), StructField("someName

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh

Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events. The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step i

spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

I've been learning spark-sql and have been trying to export and import some of the generated schemas to edit them. I've been writing the schemas to strings like df1.schema.toString() and df.schema.catalogString But I've been having trouble loading the schemas created. Does anyone know if it's poss

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky

I suppose that it's hardly possible that this issue is connected with the string encoding, because - "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code - "profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke

Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > > scala.MatchError: > profiles

DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky

Hello guys, I'm using Spark 2.2.0 and from time to time my job fails printing into the log the following errors scala.MatchError: profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) sc

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh

Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ { "numRowsTotal" : 1617339, "numRowsUpdated" : 9647 }, { "numRowsTotal" : 1326355,

Re: SparkStraming job break with shuffle file not found

2018-03-28 Thread Lucas Kacher

I have been running into this as well, but I am using S3 for checkpointing so I chalked it up to network partitioning with s3-isnt-hdfs as my storage location. But it seems that you are indeed using hdfs, so I wonder if there is another underlying issue. On Wed, Mar 28, 2018 at 8:21 AM, Jone Zhang

SparkStraming job break with shuffle file not found

2018-03-28 Thread Jone Zhang

The spark streaming job running for a few days,then fail as below What is the possible reason? *18/03/25 07:58:37 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 80018.0 failed 4 times, most recent failur

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-28 Thread Jiří Syrový

Quick comment: Excel CSV (very special case though) supports arrays in CSV using "\n" inside quotes, but you have to use as EOL for the row "\r\n" (Windows EOL). Cheers, Jiri 2018-03-28 14:14 GMT+02:00 Yong Zhang : > Your dataframe has array data type, which is NOT supported by CSV. How csv > f

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-28 Thread Yong Zhang

Your dataframe has array data type, which is NOT supported by CSV. How csv file can include array or other nest structure? If you want your data to be human readable text, write out as json in your case then. Yong From: Mina Aslani Sent: Wednesday, March 28

Testing with spark-base-test

2018-03-28 Thread Guillermo Ortiz

I'm using spark-unit-test and I don't get to compile the code. test("Testging") { val inputInsert = A("data2") val inputDelete = A("data1") val outputInsert = B(1) val outputDelete = C(1) val input = List(List(inputInsert), List(inputDelete)) val output = (List(List(outp

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Gourav Sengupta

Hi Michael, I think that is what I am trying to show here as the documentation mentions "NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager." So, in a way I am supporting your statement

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma

Hi, this property will be used in YARN mode only by the driver. Executors will use the properties coming from YARN for storing temporary files. Best, Michael On Wed, Mar 28, 2018 at 7:37 AM, Gourav Sengupta wrote: > Hi, > > > As per documentation in: https://spark.apache.org/ > docs/latest/co

Re: [Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Divya Gehlot

Hi , Here is example snippet in scala // Convert to a Date typeval timestamp2datetype: (Column) => Column = (x) => { to_date(x) }df = df.withColumn("date", timestamp2datetype(col("end_date"))) Hope this helps ! Thanks, Divya On 28 March 2018 at 15:16, Junfeng Chen wrote: > I am working on

[Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Junfeng Chen

I am working on adding a date transformed field on existed dataset. The current dataset contains a column named timestamp in ISO format. I want to parse this field to joda time type, and then extract the year, month, day, hour info as new column attaching to original dataset. I have tried df.withC

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Unable to get results of intermediate dataset

unsubscribe

Re: spark-sql importing schemas from catalogString or schema.toString()

Re: spark-sql importing schemas from catalogString or schema.toString()

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Unsubscribe

Re: spark-sql importing schemas from catalogString or schema.toString()

Apache Spark - Structured Streaming State Management With Watermark

spark-sql importing schemas from catalogString or schema.toString()

Re: DataFrames :: Corrupted Data

Re: DataFrames :: Corrupted Data

DataFrames :: Corrupted Data

Apache Spark - Structured Streaming StreamExecution Stats Description

Re: SparkStraming job break with shuffle file not found

SparkStraming job break with shuffle file not found

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

Testing with spark-base-test

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Re: [Spark Java] Add new column in DataSet based on existed column

[Spark Java] Add new column in DataSet based on existed column

24 matches

Site Navigation

Mail list logo

Footer information