Spark Streaming Custom Receiver Anomaly

2018-02-20 Thread Thakrar, Jayesh
Hi All, I am trying to "test" a very simple custom receiver and am a little puzzled. Using Spark 2.2.0 shell on my laptop, I am running the code below. I was expecting the code to timeout since my timeout wait period is 1 ms and I have a sleep in the class that is much more (1200 ms). Is this n

SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Thakrar, Jayesh
Hi All, I was just toying with creating a very rudimentary RDD datasource to understand the inner workings of RDDs. It seems that one of the constructors for RDD has a parameter of type SparkContext, but it (apparently) exists on the driver only and is not serializable. Consequently, any atte

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Thakrar, Jayesh
ows = myDataSourcePartition.rowCount val partitionData = 1 to rows map(r => Row(s"Partition: ${partitionId}, row ${r} of ${rows}")) partitionData.iterator } } From: Wenchen Fan Date: Wednesday, February 28, 2018 at 12:25 PM To: "Thakrar, Jayesh" Cc: "dev@

Cannot create custom streaming source in Spark 2.3.0

2018-03-19 Thread Thakrar, Jayesh
I am trying to create a custom streaming source in Spark 2.3.0 and getting the following error: scala> 2018-03-19 17:43:20 ERROR MicroBatchExecution:91 - Query [id = 48bb7a4c-7c66-4ad3-926b-81f8369a6efb, runId = 50800f9b-434d-43df-8d6a-3e0fdc865aeb] terminated with error java.lang.AssertionErro

Re: "Spark.jars not adding jars to classpath"

2018-03-22 Thread Thakrar, Jayesh
Is this in spark-shell or a spark-submit job? If spark-submit job, is it local or cluster? One reliable way of adding jars is to use the command line option "--jars" See http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management for more info. If you add ja

Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
Because these are not exposed in the usual API, its not possible (or difficult) to create custom structured streaming sources. Consequently, one has to create streaming sources in packages under org.apache.spark.sql. Any pointers or info is greatly appreciated.

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
.get(0).asInstanceOf[String]))) sqlContext.internalCreateDataFrame(internalRow, schema, isStreaming = true) } } From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, March 22, 2018 at 1:45 PM To: "Thakrar, Jayesh" Cc: "dev@spark.apache.org" Subject: Re: Any reason for no

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Thakrar, Jayesh
g" to true. But I should confess that I don't know the source code very well, so will appreciate if you can point me to any other pointers/examples please. From: Wenchen Fan Date: Thursday, March 22, 2018 at 2:52 PM To: "Thakrar, Jayesh" Cc: "rb...@netflix.com" , &

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-23 Thread Thakrar, Jayesh
mentioned and that requires the data sources to generate RDD[Row] which is what I was looking for. And yes, I understand the the API is still in flux and subject to change :) Thanks again to both you and Ryan! Jayesh From: Wenchen Fan Date: Thursday, March 22, 2018 at 6:59 PM To: "Thakrar, Jayesh

Spark 2.3 V2 Datasource API questions

2018-04-06 Thread Thakrar, Jayesh
First of all thank you to the Spark dev team for coming up with the standardized and intuitive API interfaces. I am sure it will encourage integrating a lot more new datasource integration. I have been creating playing with the API and have some questions on the continuous streaming API (see htt

Re: Spark 2.3 V2 Datasource API questions

2018-04-06 Thread Thakrar, Jayesh
Thank you Jose for the quick reply! I have made myself a watcher on them. From: Joseph Torres Date: Friday, April 6, 2018 at 10:41 AM To: "Thakrar, Jayesh" Cc: "dev@spark.apache.org" Subject: Re: Spark 2.3 V2 Datasource API questions Thanks for trying it out! We haven

V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
In browsing through the API docs, the links to Github source code seem to be pointing to a dev branch rather than the release branch. Here's one example Go to the API doc page below and click on the "ProcessingTime.scala" link which points to Sameer's dev branch. http://spark.apache.org/docs/lat

Re: V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
Thanks Sameer! From: Sameer Agarwal Date: Sunday, April 15, 2018 at 10:02 PM To: "Thakrar, Jayesh" Cc: "dev@spark.apache.org" , Hyukjin Kwon Subject: Re: V2.3 Scala API to Github Links Incorrect [+Hyukjin] Thanks for flagging this Jayesh. https://github.com/apache/spark

Datasource API V2 and checkpointing

2018-04-23 Thread Thakrar, Jayesh
I was wondering when checkpointing is enabled, who does the actual work? The streaming datasource or the execution engine/driver? I have written a small/trivial datasource that just generates strings. After enabling checkpointing, I do see a folder being created under the checkpoint folder, but t

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Wondering if this issue is related to SPARK-23323? Any pointers will be greatly appreciated…. Thanks, Jayesh From: "Thakrar, Jayesh" Date: Monday, April 23, 2018 at 9:49 PM To: "dev@spark.apache.org" Subject: Datasource API V2 and checkpointing I was wondering when chec

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Thanks Joseph! From: Joseph Torres Date: Friday, April 27, 2018 at 11:23 AM To: "Thakrar, Jayesh" Cc: "dev@spark.apache.org" Subject: Re: Datasource API V2 and checkpointing The precise interactions with the DataSourceV2 API haven't yet been hammered out in desig

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Thakrar, Jayesh
From: Joseph Torres Sent: Tuesday, May 1, 2018 1:58:54 PM To: Ryan Blue Cc: Thakrar, Jayesh; dev@spark.apache.org Subject: Re: Datasource API V2 and checkpointing I agree that Spark should fully handle state serialization and recovery for most sources. This is how

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
One option is to use plain JDBC to interrogate Postgresql catalog for the source table and generate the DDL to create the destination table. Then using plain JDBC again, create the table at the destination. See the link below for some pointers….. https://stackoverflow.com/questions/2593803/how-t

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
data. I have lots of table to transport from one environment to other and I don’t want to create unnecessary load on the DB. On 7/12/18, 10:09 AM, "Thakrar, Jayesh" wrote: One option is to use plain JDBC to interrogate Postgresql catalog for the source tab

Re: data source api v2 refactoring

2018-09-07 Thread Thakrar, Jayesh
Ryan et al, Wondering if the existing Spark based data sources (e.g. for HDFS, Kafka) have been ported to V2. I remember reading threads where there were discussions about the inefficiency/overhead of converting from Row to InternalRow that was preventing certain porting effort etc. I ask beca

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Thakrar, Jayesh
So if Spark and the destination datastore are both non-transactional, you will have to resort to an external mechanism for “transactionality”. Here are some options for both RDBMS and non-transaction datastore destination. For now assuming that Spark is used in batch mode (and not streaming mode)

Re: DataSourceWriter V2 Api questions

2018-09-13 Thread Thakrar, Jayesh
connector may not be good and flexible enough. From: Russell Spitzer Date: Tuesday, September 11, 2018 at 9:58 AM To: "Thakrar, Jayesh" Cc: Arun Mahadevan , Jungtaek Lim , Wenchen Fan , Reynold Xin , Ross Lawley , Ryan Blue , dev , "dbis...@us.ibm.com" Subject: Re: Dat

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread Thakrar, Jayesh
I am not involved with the design or development of the V2 API - so these could be naïve comments/thoughts. Just as dataset is to abstract away from RDD, which otherwise required a little more intimate knowledge about Spark internals, I am guessing the absence of partition operations is either d

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-18 Thread Thakrar, Jayesh
Totally agree with you Dale, that there are situations for efficiency, performance and better control/visibility/manageability that we need to expose partition management. So as described, I suggested two things - the ability to do it in the current V2 API form via options and appropriate imple

Re: data source api v2 refactoring

2018-09-19 Thread Thakrar, Jayesh
Thanks for the info Ryan – very helpful! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 3:17 PM To: "Thakrar, Jayesh" Cc: Wenchen Fan , Hyukjin Kwon , Spark Dev List Subject: Re: data source api v2 refactoring Hi Jayesh, The exis

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Thakrar, Jayesh
platform. Similar to the static/dynamic partition loading in Hive and Oracle. So in short, I agree that partition management should be an optional interface. From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 2:58 PM To: "Thakrar, Jayesh

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-20 Thread Thakrar, Jayesh
t; Date: Wednesday, September 19, 2018 at 4:35 PM To: "Thakrar, Jayesh" Cc: "tigerqu...@outlook.com" , Spark Dev List Subject: Re: [Discuss] Datasource v2 support for manipulating partitions What does partition management look like in those systems and what are the optio

Double pass over ORC data files even after supplying schema and setting inferSchema = false

2018-11-21 Thread Thakrar, Jayesh
Hi All, We have some batch processing where we read 100s of thousands of ORC files. What I found is that this was taking too much time AND that there was a long pause between the point the read begins in the code and the executors get into action. That period is about 1.5+ hours where only the d

Re: Double pass over ORC data files even after supplying schema and setting inferSchema = false

2018-11-21 Thread Thakrar, Jayesh
Thank you for the quick reply Dongjoon. This sound interesting and it might the resolution for our issue. Let me see do some tests and will update the thread. Thanks, Jayesh From: Dongjoon Hyun Date: Wednesday, November 21, 2018 at 11:46 AM To: "Thakrar, Jayesh" Cc: dev Subject:

Re: DataSourceV2 community sync #3

2018-12-01 Thread Thakrar, Jayesh
Just curious on the need for a catalog within Spark. So Spark interface with other systems – many of which have a catalog of their own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS, filesyststem, etc). So what is the purpose of having this catalog within Spark for tables defin

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
To: "Thakrar, Jayesh" Cc: Ryan Blue , "u...@spark.apache.org" Subject: Re: DataSourceV2 community sync #3 Hi, Jayesh, This is a good question. Spark is a unified analytics engine for various data sources. We are able to get the table schema from the underlying data sources

Re: DataSourceV2 community sync #3

2018-12-03 Thread Thakrar, Jayesh
Thank you Ryan and Xiao – sharing all this info really gives a very good insight! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Monday, December 3, 2018 at 12:05 PM To: "Thakrar, Jayesh" Cc: Xiao Li , Spark Dev List Subject: Re: DataSourceV2 community sync #3 Ja

Re: Feature request: split dataset based on condition

2019-02-04 Thread Thakrar, Jayesh
Just wondering if this is what you are implying Ryan (example only): val data = (dataset to be partitionned) val splitCondition = s""" CASE WHEN …. THEN …. WHEN …. THEN ….. END partition_condition """ val partitionedData = data.withColumn("partitionColumn", expr(splitCondi

Re: Can I add a new method to RDD class?

2016-12-05 Thread Thakrar, Jayesh
Teng, Before you go down creating your own custom Spark system, do give some thought to what Holden and others are suggesting, viz. using implicit methods. If you want real concrete examples, have a look at the Spark Cassandra Connector - Here you will see an example of "extending" SparkContex

Re: Expand the Spark SQL programming guide?

2016-12-15 Thread Thakrar, Jayesh
I too am interested in expanding the documentation for Spark SQL. For my work I needed to get some info/examples/guidance on window functions and have been using https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html . How about divide and conquer? From: Michael

Re: Expand the Spark SQL programming guide?

2016-12-16 Thread Thakrar, Jayesh
this part. Jayesh, shall I skip the window functions part since you are going to work on that? 2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh mailto:jthak...@conversantmedia.com>>: I too am interested in expanding the documentation for Spark SQL. For my work I needed to get some info/exam