Re: can't download 2.4.1 sourcecode

2019-04-22 Thread 1101300123
I  get it from github and building now,but i hope someone can fix the website 
so moreone use it


| |
1101300123
|
|
邮箱:hdxg1101300...@163.com
|

签名由 网易邮箱大师 定制

On 04/23/2019 11:56, Andrew Melo wrote:
On Mon, Apr 22, 2019 at 10:54 PM yutaochina  wrote:
>
>  
>
> when i want download the sourcecode find it dosenot work
>

In the interim -- https://github.com/apache/spark/archive/v2.4.1.tar.gz

>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Re: can't download 2.4.1 sourcecode

2019-04-22 Thread Andrew Melo
On Mon, Apr 22, 2019 at 10:54 PM yutaochina  wrote:
>
> 
> 
>
>
> when i want download the sourcecode find it dosenot work
>

In the interim -- https://github.com/apache/spark/archive/v2.4.1.tar.gz

>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello Jason, Thank you for reply. My use case is that, first time I do full
load and transformation/aggregation/joins and write to parquet (as staging)
but next time onwards my source is MSSQL Server, I want to pull only those
records got changed / updated and would like to update at parquet also if
possible without side effects.
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017

On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin 
wrote:

> Hi Chetan,
>
> Do you have to use Parquet?
>
> It just feels like it might be the wrong sink for a high-frequency change
> scenario.
>
> What are you trying to accomplish?
>
> Thanks,
> Jason
>
> On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
> wrote:
>
>> Hello All,
>>
>> If I am doing incremental load / delta and would like to update / delete
>> the records in parquet, I understands that parquet is immutable and can't
>> be deleted / updated theoretically only append / overwrite can be done. But
>> I can see utility tools which claims to add value for that.
>>
>> https://github.com/Factual/parquet-rewriter
>>
>> Please throw a light.
>>
>> Thanks
>>
>
>
> --
> Thanks,
> Jason
>


can't download 2.4.1 sourcecode

2019-04-22 Thread yutaochina

 

 


when i want download the sourcecode find it dosenot work



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-22 Thread Qian He
Hi all,

I'm using Spark provided LogisticRegression to fit a dataset. Each row of
the data has 1.7 million columns, but it is sparse with only hundreds of
1s. The Spark Ui reported high GC time when the model is being trained. And
my spark application got stuck without any response. I have allocated 100
executors and 8g for each executor.

Is there any thing i should do to make the training process go successfully?


Re: Connecting to Spark cluster remotely

2019-04-22 Thread Andrew Melo
Hi Rishkesh

On Mon, Apr 22, 2019 at 4:26 PM Rishikesh Gawade
 wrote:
>
> To put it simply, what are the configurations that need to be done on the 
> client machine so that it can run driver on itself and executors on 
> spark-yarn cluster nodes?

TBH, if it were me, I would simply SSH to the cluster and start the
spark-shell there. I don't think there's any special spark
configuration you need, but depending on what address space your
cluster is using/where you're connecting from, it might be really hard
to get all the networking components lined up.

>
> On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade  
> wrote:
>>
>> Hi.
>> I have been experiencing trouble while trying to connect to a Spark cluster 
>> remotely. This Spark cluster is configured to run using YARN.
>> Can anyone guide me or provide any step-by-step instructions for connecting 
>> remotely via spark-shell?
>> Here's the setup that I am using:
>> The Spark cluster is running with each node as a docker container hosted on 
>> a VM. It is using YARN for scheduling resources for computations.
>> I have a dedicated docker container acting as a spark client, on which i 
>> have the spark-shell installed(spark binary in standalone setup) and also 
>> the Hadoop and Yarn config directories set so that spark-shell can 
>> coordinate with the RM for resources.
>> With all of this set, i tried using the following command:
>>
>> spark-shell --master yarn --deploy-mode client
>>
>> This results in the spark-shell giving me a scala-based console, however, 
>> when I check the Resource Manager UI on the cluster, there seems to be no 
>> application/spark session running.
>> I have been expecting the driver to be running on the client machine and the 
>> executors running in the cluster. But that doesn't seem to happen.
>>
>> How can I achieve this?
>> Is whatever I am trying feasible, and if so, a good practice?
>>
>> Thanks & Regards,
>> Rishikesh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan,

Do you have to use Parquet?

It just feels like it might be the wrong sink for a high-frequency change
scenario.

What are you trying to accomplish?

Thanks,
Jason

On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
wrote:

> Hello All,
>
> If I am doing incremental load / delta and would like to update / delete
> the records in parquet, I understands that parquet is immutable and can't
> be deleted / updated theoretically only append / overwrite can be done. But
> I can see utility tools which claims to add value for that.
>
> https://github.com/Factual/parquet-rewriter
>
> Please throw a light.
>
> Thanks
>


-- 
Thanks,
Jason


Re: Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
To put it simply, what are the configurations that need to be done on the
client machine so that it can run driver on itself and executors on
spark-yarn cluster nodes?

On Mon, Apr 22, 2019, 8:22 PM Rishikesh Gawade 
wrote:

> Hi.
> I have been experiencing trouble while trying to connect to a Spark
> cluster remotely. This Spark cluster is configured to run using YARN.
> Can anyone guide me or provide any step-by-step instructions for
> connecting remotely via spark-shell?
> Here's the setup that I am using:
> The Spark cluster is running with each node as a docker container hosted
> on a VM. It is using YARN for scheduling resources for computations.
> I have a dedicated docker container acting as a spark client, on which i
> have the spark-shell installed(spark binary in standalone setup) and also
> the Hadoop and Yarn config directories set so that spark-shell can
> coordinate with the RM for resources.
> With all of this set, i tried using the following command:
>
> spark-shell --master yarn --deploy-mode client
>
> This results in the spark-shell giving me a scala-based console, however,
> when I check the Resource Manager UI on the cluster, there seems to be no
> application/spark session running.
> I have been expecting the driver to be running on the client machine and
> the executors running in the cluster. But that doesn't seem to happen.
>
> How can I achieve this?
> Is whatever I am trying feasible, and if so, a good practice?
>
> Thanks & Regards,
> Rishikesh
>


Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello All,

If I am doing incremental load / delta and would like to update / delete
the records in parquet, I understands that parquet is immutable and can't
be deleted / updated theoretically only append / overwrite can be done. But
I can see utility tools which claims to add value for that.

https://github.com/Factual/parquet-rewriter

Please throw a light.

Thanks


Re: How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-22 Thread Tathagata Das
SQL windows with the 'over' syntax does not work in Structured Streaming.
It is very hard to incrementalize that in the general case. Hence non-time
windows are not supported.

On Sat, Apr 20, 2019, 2:16 PM Stephen Boesch  wrote:

> Consider the following *intended* sql:
>
> select row_number()
>   over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,*
>   from flights
>
> This will *not* work in *structured streaming* : The culprit is:
>
>  partition by Origin
>
> The requirement is to use a timestamp-typed field such as
>
>  partition by flightTime
>
> Tathagata Das (core committer for *spark streaming*) - replies on that in
> a nabble thread:
>
>  The traditional SQL windows with `over` is not supported in streaming.
> Only time-based windows, that is, `window("timestamp", "10 minutes")` is
> supported in streaming
>
> *W**hat then* for my query above - which *must* be based on the *Origin* 
> field?
> What is the closest equivalent to that query? Or what would be a workaround
> or different approach to achieve same results?
>


Re: Use derived column for other derived column in the same statement

2019-04-22 Thread Vipul Rajan
Hi Rishi,

TL;DR Using Scala, this would work
df.withColumn("derived1", lit("something")).withColumn("derived2",
col("derived1") === "something")

just note that I used 3 equal to signs instead of just two. That should be
enough, if you want to understand why read further.

so "==" gives boolean as a return value, but that is not what you want,
that's why you wrap your string "something" in lit() in the first
withColumn statement. This turns your string type into
org.apache.spark.sql.Column type which the withColumn function would accept.
alternatively
lit(col("derived1") == "something") would syntactically work and not throw
any errors, but it would always be false, since you are not checking the
values in the column derived1, you are merely testing if col("derived1"),
which is of type org.apache.spark.sql.Column is the same as "something",
which is of type string which is obviously false

below is the output of my spark shell:
scala> col("asdf") == col("asdf")
res5: Boolean = true

scala> col("derived1") == "something"
res6: Boolean = false

what you want is for your expression to return an
org.apache.spark.sql.Column type. Please take a look here and scroll down
till the "===" function. You'd see that it return an
org.apache.spark.sql.Column.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column@===(other:Any):org.apache.spark.sql.Column

It doesn't explicitly say so but using this you actually compare the values
in column "derived1" against the string "something".

Hope it helps

Regards

On Mon, Apr 22, 2019 at 8:56 AM Shraddha Shah 
wrote:

> Also the same thing for groupby agg operation, how can we use one
> aggregated result (say min(amount)) to derive another aggregated column?
>
> On Sun, Apr 21, 2019 at 11:24 PM Rishi Shah 
> wrote:
>
>> Hello All,
>>
>> How can we use a derived column1 for deriving another column in the same
>> dataframe operation statement?
>>
>> something like:
>>
>> df = df.withColumn('derived1', lit('something'))
>> .withColumn('derived2', col('derived1') == 'something')
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>


Re: Structured Streaming initialized with cached data or others

2019-04-22 Thread Vipul Rajan
Please look into arbitrary stateful aggregation. I do not completely
understand your problem though. If you could give me an example. I'd be
happy to help

On Mon, 22 Apr 2019, 15:31 shicheng31...@gmail.com, 
wrote:

> Hi ,all:
> As we all known, structured streaming  is used to handle incremental
> problems.  However, if I need to make an increment based on an initial
> value, I need to get a previous state value when the program is
> initialized.
> Is there any way to assign an initial value to the'state'? Or
> other solutions?
>Thanks!
>
> --
> shicheng31...@gmail.com
>


Connecting to Spark cluster remotely

2019-04-22 Thread Rishikesh Gawade
Hi.
I have been experiencing trouble while trying to connect to a Spark cluster
remotely. This Spark cluster is configured to run using YARN.
Can anyone guide me or provide any step-by-step instructions for connecting
remotely via spark-shell?
Here's the setup that I am using:
The Spark cluster is running with each node as a docker container hosted on
a VM. It is using YARN for scheduling resources for computations.
I have a dedicated docker container acting as a spark client, on which i
have the spark-shell installed(spark binary in standalone setup) and also
the Hadoop and Yarn config directories set so that spark-shell can
coordinate with the RM for resources.
With all of this set, i tried using the following command:

spark-shell --master yarn --deploy-mode client

This results in the spark-shell giving me a scala-based console, however,
when I check the Resource Manager UI on the cluster, there seems to be no
application/spark session running.
I have been expecting the driver to be running on the client machine and
the executors running in the cluster. But that doesn't seem to happen.

How can I achieve this?
Is whatever I am trying feasible, and if so, a good practice?

Thanks & Regards,
Rishikesh


Structured Streaming initialized with cached data or others

2019-04-22 Thread shicheng31...@gmail.com
Hi ,all:
As we all known, structured streaming  is used to handle incremental 
problems.  However, if I need to make an increment based on an initial value, I 
need to get a previous state value when the program is initialized. 
Is there any way to assign an initial value to the'state'? Or other 
solutions?
   Thanks!



shicheng31...@gmail.com