Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Jörn Franke
Well it could also depend on the receiving database. You should also check the 
executors. Updating to the latest version of the JDBC driver and JDK8, if 
supported by JDBC driver, could help.

> On 20 Apr 2016, at 00:14, Jonathan Gray  wrote:
> 
> Hi,
> 
> I'm trying to write ~60 million rows from a DataFrame to a database using 
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
> 
> The write seems to not be performing well.  Profiling the application with a 
> master of local[*] it appears there is not much socket write activity and 
> also not much CPU.
> 
> I would expect there to be an almost continuous block of socket write 
> activity showing up somewhere in the profile.
> 
> I can see that the top hot method involves 
> apache.spark.unsafe.platform.CopyMemory all from calls within 
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly 
> stressed so I'm guessing this isn't the cause of the problem.
> 
> Is there any best practices or has anyone come across a case like this before 
> where a write to a database seems to perform poorly?
> 
> Thanks,
> Jon

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Charles Nnamdi Akalugwu
I get the same error for fields which are not null unfortunately.

Can't translate null value for field
StructField(density,DecimalType(4,2),true)
On Apr 21, 2016 1:37 AM, "Ted Yu"  wrote:

> The weight field is not nullable.
>
> Looks like your source table had null value for this field.
>
> On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu <
> cprenzb...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using spark 1.4.1 and trying to copy all rows from a table in one
>> MySQL Database to a Amazon RDS table using spark SQL.
>>
>> Some columns in the source table are defined as DECIMAL type and are
>> nullable. Others are not.  When I run my spark job,
>>
>> val writeData = sqlContext.read.format("jdbc").option("url",
 sourceUrl).option("driver", "com.mysql.jdbc.Driver").option("dbtable",
 table).option("user", sourceUsername).option("password",
 sourcePassword).load()
>>>
>>>
>>>
>>>
>>>
>>> writeData.write.format("com.databricks.spark.redshift").option("url",
 String.format(targetUrl, targetUsername, targetPassword)).option("dbtable",
 table).option("tempdir", redshiftTempDir+table).mode("append").save()
>>>
>>>
>> it fails with the following exception
>>
>> Can't translate null value for field
>>> StructField(weight,DecimalType(5,2),false)
>>
>>
>> Any insights about this exception would be very helpful.
>>
>
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
Sorry to wrongly send message in mid.
How about trying to increate 'batchsize` in a jdbc option to improve
performance?

// maropu

On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> How about trying to increate 'batchsize
>
> On Wed, Apr 20, 2016 at 7:14 AM, Jonathan Gray 
> wrote:
>
>> Hi,
>>
>> I'm trying to write ~60 million rows from a DataFrame to a database using
>> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>
>> The write seems to not be performing well.  Profiling the application
>> with a master of local[*] it appears there is not much socket write
>> activity and also not much CPU.
>>
>> I would expect there to be an almost continuous block of socket write
>> activity showing up somewhere in the profile.
>>
>> I can see that the top hot method involves
>> apache.spark.unsafe.platform.CopyMemory all from calls within
>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>> stressed so I'm guessing this isn't the cause of the problem.
>>
>> Is there any best practices or has anyone come across a case like this
>> before where a write to a database seems to perform poorly?
>>
>> Thanks,
>> Jon
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
---
Takeshi Yamamuro


Re: VectorAssembler handling null values

2016-04-20 Thread Koert Kuipers
thanks for that, its good to know that functionality exists.

but shouldn't a decision tree be able to handle missing (aka null) values
more intelligently than simply using replacement values?

see for example here:
http://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algorithms-deal-with-missing-values-under-the-hoo


On Thu, Apr 21, 2016 at 12:29 AM, John Trengrove <
john.trengr...@servian.com.au> wrote:

> You could handle null values by using the DataFrame.na functions in a
> preprocessing step like DataFrame.na.fill().
>
> For reference:
>
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
>
> John
>
> On 21 April 2016 at 03:41, Andres Perez  wrote:
>
>> so the missing data could be on a one-off basis, or from fields that are
>> in general optional, or from, say, a count that is only relevant for
>> certain cases (very sparse):
>>
>> f1|f2|f3|optF1|optF2|sparseF1
>> a|15|3.5|cat1|142L|
>> b|13|2.4|cat2|64L|catA
>> c|2|1.6|||
>> d|27|5.1||0|
>>
>> -Andy
>>
>> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath > > wrote:
>>
>>> Could you provide an example of what your input data looks like?
>>> Supporting missing values in a sparse result vector makes sense.
>>>
>>> On Tue, 19 Apr 2016 at 23:55, Andres Perez  wrote:
>>>
 Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently
 cannot handle null values. This presents a problem for us as we wish to run
 a decision tree classifier on sometimes sparse data. Is there a particular
 reason VectorAssembler is implemented in this way, and can anyone recommend
 the best path for enabling VectorAssembler to build vectors for data that
 will contain empty values?

 Thanks!

 -Andres


>>
>


Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a
preprocessing step like DataFrame.na.fill().

For reference:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

John

On 21 April 2016 at 03:41, Andres Perez  wrote:

> so the missing data could be on a one-off basis, or from fields that are
> in general optional, or from, say, a count that is only relevant for
> certain cases (very sparse):
>
> f1|f2|f3|optF1|optF2|sparseF1
> a|15|3.5|cat1|142L|
> b|13|2.4|cat2|64L|catA
> c|2|1.6|||
> d|27|5.1||0|
>
> -Andy
>
> On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath 
> wrote:
>
>> Could you provide an example of what your input data looks like?
>> Supporting missing values in a sparse result vector makes sense.
>>
>> On Tue, 19 Apr 2016 at 23:55, Andres Perez  wrote:
>>
>>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently
>>> cannot handle null values. This presents a problem for us as we wish to run
>>> a decision tree classifier on sometimes sparse data. Is there a particular
>>> reason VectorAssembler is implemented in this way, and can anyone recommend
>>> the best path for enabling VectorAssembler to build vectors for data that
>>> will contain empty values?
>>>
>>> Thanks!
>>>
>>> -Andres
>>>
>>>
>


bisecting kmeans tree

2016-04-20 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show on the heatmap I
am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-R


HBase Spark Module

2016-04-20 Thread Benjamin Kim
I see that the new CDH 5.7 has been release with the HBase Spark module 
built-in. I was wondering if I could just download it and use the hbase-spark 
jar file for CDH 5.5. Has anyone tried this yet?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Ted Yu
The weight field is not nullable.

Looks like your source table had null value for this field.

On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu <
cprenzb...@gmail.com> wrote:

> Hi,
>
> I am using spark 1.4.1 and trying to copy all rows from a table in one
> MySQL Database to a Amazon RDS table using spark SQL.
>
> Some columns in the source table are defined as DECIMAL type and are
> nullable. Others are not.  When I run my spark job,
>
> val writeData = sqlContext.read.format("jdbc").option("url",
>>> sourceUrl).option("driver", "com.mysql.jdbc.Driver").option("dbtable",
>>> table).option("user", sourceUsername).option("password",
>>> sourcePassword).load()
>>
>>
>>
>>
>>
>> writeData.write.format("com.databricks.spark.redshift").option("url",
>>> String.format(targetUrl, targetUsername, targetPassword)).option("dbtable",
>>> table).option("tempdir", redshiftTempDir+table).mode("append").save()
>>
>>
> it fails with the following exception
>
> Can't translate null value for field
>> StructField(weight,DecimalType(5,2),false)
>
>
> Any insights about this exception would be very helpful.
>


Re: spark on yarn

2016-04-20 Thread Mail.com
I get an error with a message that state what is max number of cores allowed.


> On Apr 20, 2016, at 11:21 AM, Shushant Arora  
> wrote:
> 
> I am running a spark application on yarn cluster.
> 
> say I have available vcors in cluster as 100.And I start spark application 
> with --num-executors 200 --num-cores 2 (so I need total 200*2=400 vcores) but 
> in my cluster only 100 are available.
> 
> What will happen ? Will the job abort or it will be submitted successfully 
> and 100 vcores will be aallocated to 50 executors and rest executors will be 
> started as soon as vcores are available ?
> 
> Please note dynamic allocation is not enabled in cluster. I have old version 
> 1.2.
> 
> Thanks
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



StructField Translation Error with Spark SQL

2016-04-20 Thread Charles Nnamdi Akalugwu
Hi,

I am using spark 1.4.1 and trying to copy all rows from a table in one
MySQL Database to a Amazon RDS table using spark SQL.

Some columns in the source table are defined as DECIMAL type and are
nullable. Others are not.  When I run my spark job,

val writeData = sqlContext.read.format("jdbc").option("url",
>> sourceUrl).option("driver", "com.mysql.jdbc.Driver").option("dbtable",
>> table).option("user", sourceUsername).option("password",
>> sourcePassword).load()
>
>
>
>
>
> writeData.write.format("com.databricks.spark.redshift").option("url",
>> String.format(targetUrl, targetUsername, targetPassword)).option("dbtable",
>> table).option("tempdir", redshiftTempDir+table).mode("append").save()
>
>
it fails with the following exception

Can't translate null value for field
> StructField(weight,DecimalType(5,2),false)


Any insights about this exception would be very helpful.


Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Actually you are correct. It will be considered a non logged operation that
probably they (the DBAs) won't allow it in production. The only option for
the thread owner is to perform smaller batches with frequent commits in
MSSQL


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 21:52, Strange, Nick  wrote:

> Nologging means no redo log is generated (or minimal redo). However undo
> is still generated and the transaction will still be rolled back in the
> event of an issue.
>
>
>
> Nick
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Wednesday, April 20, 2016 4:08 PM
> *To:* Andrés Ivaldi
> *Cc:* user @spark
> *Subject:* Re: Spark SQL Transaction
>
>
>
> Well Oracle will allow that if the underlying table is in NOLOOGING mode :)
>
>
>
> mtale...@mydb12.mich.LOCAL> create table testme(col1 int);
>
> Table created.
>
> mtale...@mydb12.mich.LOCAL> *alter table testme NOLOGGING;*
>
> Table altered.
>
> mtale...@mydb12.mich.LOCAL> insert into testme values(1);
>
> 1 row created.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 20 April 2016 at 20:45, Andrés Ivaldi  wrote:
>
> I think the same, and I don't think reducing batches size improves speed
> but will avoid loosing all data when rollback.
>
>
>
>
>
> Thanks for the help..
>
>
>
>
>
> On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> yep. I think it is not possible to make SQL Server do a non logged
> transaction. Other alternative is doing inserts in small batches if
> possible. Or write to a CSV type file and use Bulk copy to load the file
> into MSSQL with frequent commits like every 50K rows?
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 20 April 2016 at 19:42, Andrés Ivaldi  wrote:
>
> Yes, I know that behavior , but there is not explicit Begin Transaction in
> my code, so, maybe Spark or the same driver is adding the begin
> transaction, or implicit transaction is configured. If spark is'n adding a
> Begin transaction on each insertion, then probably is database or Driver
> configuration...
>
>
>
> On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> You will see what is happening in SQL Server. First create a test table
> called  testme
>
>
>
> 1> use tempdb
> 2> go
> 1> create table testme(col1 int)
> 2> go
>
> -- Now explicitly begin a transaction and insert 1 row and select from
> table
> 1> *begin tran*
> 2> insert into testme values(1)
> 3> select * from testme
> 4> go
> (1 row affected)
>  col1
>  ---
>1
>
> -- That value col1=1 is there
>
> --
>
> (1 row affected)
>
> -- Now rollback that transaction meaning in your case by killing your
> Spark process!
>
> --
> 1> rollback tran
> 2> select * from testme
> 3> go
>  col1
>  ---
>
> (0 rows affected)
>
>
>
> -- You can see that record has gone as it rolled back!
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:
>
> Sorry I'cant answer before, I want to know if spark is the responsible to
> add the Begin Tran, The point is to speed up insertion over losing data,
>  Disabling Transaction will speed up the insertion and we dont care about
> consistency... I'll disable te implicit_transaction and see what happens.
>
>
>
> thanks
>
>
>
> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Assuming that you are using JDBC for putting data into any ACID compliant
> database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly
>  adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL
> does not know or care where data is coming from. If your connection
> completes OK a COMMIT TRAN will be sent and that will tell MSQL to commit
> transaction. If yoy kill Spark transaction before MSSQL receive COMMIT
> TRAN, the transaction will be rolled back.
>
>
>
> The only option is that if you don't care about full data getting to
> MSSQL,to break your insert into chunks at source and send data to MSSQL 

RE: Spark SQL Transaction

2016-04-20 Thread Strange, Nick
Nologging means no redo log is generated (or minimal redo). However undo is 
still generated and the transaction will still be rolled back in the event of 
an issue.

Nick

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, April 20, 2016 4:08 PM
To: Andrés Ivaldi
Cc: user @spark
Subject: Re: Spark SQL Transaction

Well Oracle will allow that if the underlying table is in NOLOOGING mode :)

mtale...@mydb12.mich.LOCAL> create table 
testme(col1 int);
Table created.
mtale...@mydb12.mich.LOCAL> alter table 
testme NOLOGGING;
Table altered.
mtale...@mydb12.mich.LOCAL> insert into 
testme values(1);
1 row created.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 20 April 2016 at 20:45, Andrés Ivaldi 
> wrote:
I think the same, and I don't think reducing batches size improves speed but 
will avoid loosing all data when rollback.


Thanks for the help..


On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh 
> wrote:
yep. I think it is not possible to make SQL Server do a non logged transaction. 
Other alternative is doing inserts in small batches if possible. Or write to a 
CSV type file and use Bulk copy to load the file into MSSQL with frequent 
commits like every 50K rows?




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 20 April 2016 at 19:42, Andrés Ivaldi 
> wrote:
Yes, I know that behavior , but there is not explicit Begin Transaction in my 
code, so, maybe Spark or the same driver is adding the begin transaction, or 
implicit transaction is configured. If spark is'n adding a Begin transaction on 
each insertion, then probably is database or Driver configuration...

On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh 
> wrote:


You will see what is happening in SQL Server. First create a test table called  
testme

1> use tempdb
2> go
1> create table testme(col1 int)
2> go
-- Now explicitly begin a transaction and insert 1 row and select from table
1> begin tran
2> insert into testme values(1)
3> select * from testme
4> go
(1 row affected)
 col1
 ---
   1
-- That value col1=1 is there
--
(1 row affected)
-- Now rollback that transaction meaning in your case by killing your Spark 
process!
--
1> rollback tran
2> select * from testme
3> go
 col1
 ---
(0 rows affected)

-- You can see that record has gone as it rolled back!



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 20 April 2016 at 18:42, Andrés Ivaldi 
> wrote:
Sorry I'cant answer before, I want to know if spark is the responsible to add 
the Begin Tran, The point is to speed up insertion over losing data,  Disabling 
Transaction will speed up the insertion and we dont care about consistency... 
I'll disable te implicit_transaction and see what happens.

thanks

On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh 
> wrote:
Assuming that you are using JDBC for putting data into any ACID compliant 
database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly  adding 
BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL does not 
know or care where data is coming from. If your connection completes OK a 
COMMIT TRAN will be sent and that will tell MSQL to commit transaction. If yoy 
kill Spark transaction before MSSQL receive COMMIT TRAN, the transaction will 
be rolled back.

The only option is that if you don't care about full data getting to MSSQL,to 
break your insert into chunks at source and send data to MSSQL in small 
batches. In that way you will not lose all data in MSSQL because of rollback.

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 20 April 2016 at 07:33, Mich Talebzadeh 
> wrote:
Are you using JDBC to push data to MSSQL?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 19 April 2016 at 23:41, Andrés Ivaldi 
> 

Spark Streaming Job Question about retries and failover

2016-04-20 Thread map reduced
Hi,

I have simple spark streaming application which reads data from Kafka and
then send this data after transformation on a http end point (or another
kafka - for this question let's consider http). I am submitting jobs using
job-server .

I am currently starting the consumption from source kafka with
"auto.offset.reset"="smallest" and interval=3s. In happy case everything
looks good. Here's an excerpt:

kafkaInputDStream.foreachRDD(rdd => {
  rdd.foreach(item => {
  //This will throw exception if http endpoint isn't reachable
  httpProcessor.process(item._1, item._2)
  })
})

Since "auto.offset.reset"="smallest", this processes about 200K messages in
one job. If I stop http server mid-job (simulating some issue in POSTing)
and httpProcessor.process throws exception, that Job fails and whatever is
unprocessed is lost. I see it keeps on polling every 3 seconds after that.

So my question is:

   1. Is my assumption right that if in next 3 second job if it got X
   messages and only Y could be processed before hitting an error, rest X-Y
   will not be processed?
   2. Is there a way to pause the stream/consumption from Kafka? For
   instance in case there's a intermittent network issue and most likely all
   messages consumed will be lost in that time. Something which keeps on
   retrying (maybe exponential backoff) and whenever http end point is UP,
   start consuming again.

I see the failed job rerun after about an hour, can this be configured? :
Failed Jobs (2)
Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all
stages): Succeeded/Total
4253 Streaming job from [output operation 1, batch time 12:54:44]foreach at
HTTPStream.scala:73 2016/04/20 12:54:44 0.3 s 0/1 (1 failed)
2/3 (20 failed)
0 Streaming job from [output operation 1, batch time 11:43:51]foreach at
HTTPStream.scala:73 2016/04/20 11:43:51 3 s 0/1 (1 failed)
0/3 (46 failed)


Thanks,

K


Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Well Oracle will allow that if the underlying table is in NOLOOGING mode :)

mtale...@mydb12.mich.LOCAL> create table testme(col1 int);
Table created.
mtale...@mydb12.mich.LOCAL> *alter table testme NOLOGGING;*
Table altered.
mtale...@mydb12.mich.LOCAL> insert into testme values(1);
1 row created.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 20:45, Andrés Ivaldi  wrote:

> I think the same, and I don't think reducing batches size improves speed
> but will avoid loosing all data when rollback.
>
>
> Thanks for the help..
>
>
> On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> yep. I think it is not possible to make SQL Server do a non logged
>> transaction. Other alternative is doing inserts in small batches if
>> possible. Or write to a CSV type file and use Bulk copy to load the file
>> into MSSQL with frequent commits like every 50K rows?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 April 2016 at 19:42, Andrés Ivaldi  wrote:
>>
>>> Yes, I know that behavior , but there is not explicit Begin Transaction
>>> in my code, so, maybe Spark or the same driver is adding the begin
>>> transaction, or implicit transaction is configured. If spark is'n adding a
>>> Begin transaction on each insertion, then probably is database or Driver
>>> configuration...
>>>
>>> On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 You will see what is happening in SQL Server. First create a test table
 called  testme

 1> use tempdb
 2> go
 1> create table testme(col1 int)
 2> go
 -- Now explicitly begin a transaction and insert 1 row and select from
 table
 1>
 *begin tran*2> insert into testme values(1)
 3> select * from testme
 4> go
 (1 row affected)
  col1
  ---
1
 -- That value col1=1 is there
 --
 (1 row affected)
 -- Now rollback that transaction meaning in your case by killing your
 Spark process!
 --
 1> rollback tran
 2> select * from testme
 3> go
  col1
  ---
 (0 rows affected)

 -- You can see that record has gone as it rolled back!


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:

> Sorry I'cant answer before, I want to know if spark is the responsible
> to add the Begin Tran, The point is to speed up insertion over losing 
> data,
>  Disabling Transaction will speed up the insertion and we dont care about
> consistency... I'll disable te implicit_transaction and see what happens.
>
> thanks
>
> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Assuming that you are using JDBC for putting data into any ACID
>> compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
>> explicitly  adding BEGIN TRAN to INSERT statement in a distributed
>> transaction. MSSQL does not know or care where data is coming from. If 
>> your
>> connection completes OK a COMMIT TRAN will be sent and that will tell 
>> MSQL
>> to commit transaction. If yoy kill Spark transaction before MSSQL receive
>> COMMIT TRAN, the transaction will be rolled back.
>>
>> The only option is that if you don't care about full data getting to
>> MSSQL,to break your insert into chunks at source and send data to MSSQL 
>> in
>> small batches. In that way you will not lose all data in MSSQL because of
>> rollback.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 April 2016 at 07:33, Mich Talebzadeh > > wrote:
>>
>>> Are you using JDBC to push data to MSSQL?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> 

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
I think the same, and I don't think reducing batches size improves speed
but will avoid loosing all data when rollback.


Thanks for the help..


On Wed, Apr 20, 2016 at 4:03 PM, Mich Talebzadeh 
wrote:

> yep. I think it is not possible to make SQL Server do a non logged
> transaction. Other alternative is doing inserts in small batches if
> possible. Or write to a CSV type file and use Bulk copy to load the file
> into MSSQL with frequent commits like every 50K rows?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 19:42, Andrés Ivaldi  wrote:
>
>> Yes, I know that behavior , but there is not explicit Begin Transaction
>> in my code, so, maybe Spark or the same driver is adding the begin
>> transaction, or implicit transaction is configured. If spark is'n adding a
>> Begin transaction on each insertion, then probably is database or Driver
>> configuration...
>>
>> On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> You will see what is happening in SQL Server. First create a test table
>>> called  testme
>>>
>>> 1> use tempdb
>>> 2> go
>>> 1> create table testme(col1 int)
>>> 2> go
>>> -- Now explicitly begin a transaction and insert 1 row and select from
>>> table
>>> 1>
>>> *begin tran*2> insert into testme values(1)
>>> 3> select * from testme
>>> 4> go
>>> (1 row affected)
>>>  col1
>>>  ---
>>>1
>>> -- That value col1=1 is there
>>> --
>>> (1 row affected)
>>> -- Now rollback that transaction meaning in your case by killing your
>>> Spark process!
>>> --
>>> 1> rollback tran
>>> 2> select * from testme
>>> 3> go
>>>  col1
>>>  ---
>>> (0 rows affected)
>>>
>>> -- You can see that record has gone as it rolled back!
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:
>>>
 Sorry I'cant answer before, I want to know if spark is the responsible
 to add the Begin Tran, The point is to speed up insertion over losing data,
  Disabling Transaction will speed up the insertion and we dont care about
 consistency... I'll disable te implicit_transaction and see what happens.

 thanks

 On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Assuming that you are using JDBC for putting data into any ACID
> compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
> explicitly  adding BEGIN TRAN to INSERT statement in a distributed
> transaction. MSSQL does not know or care where data is coming from. If 
> your
> connection completes OK a COMMIT TRAN will be sent and that will tell MSQL
> to commit transaction. If yoy kill Spark transaction before MSSQL receive
> COMMIT TRAN, the transaction will be rolled back.
>
> The only option is that if you don't care about full data getting to
> MSSQL,to break your insert into chunks at source and send data to MSSQL in
> small batches. In that way you will not lose all data in MSSQL because of
> rollback.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 07:33, Mich Talebzadeh 
> wrote:
>
>> Are you using JDBC to push data to MSSQL?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:
>>
>>> I mean local transaction, We've ran a Job that writes into SQLServer
>>> then we killed spark JVM just for testing purpose and we realized that
>>> SQLServer did a rollback.
>>>
>>> Regards
>>>
>>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 What do you mean by *without transaction*? do you mean forcing SQL
 Server to accept a non logged operation?


Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
yep. I think it is not possible to make SQL Server do a non logged
transaction. Other alternative is doing inserts in small batches if
possible. Or write to a CSV type file and use Bulk copy to load the file
into MSSQL with frequent commits like every 50K rows?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 19:42, Andrés Ivaldi  wrote:

> Yes, I know that behavior , but there is not explicit Begin Transaction in
> my code, so, maybe Spark or the same driver is adding the begin
> transaction, or implicit transaction is configured. If spark is'n adding a
> Begin transaction on each insertion, then probably is database or Driver
> configuration...
>
> On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> You will see what is happening in SQL Server. First create a test table
>> called  testme
>>
>> 1> use tempdb
>> 2> go
>> 1> create table testme(col1 int)
>> 2> go
>> -- Now explicitly begin a transaction and insert 1 row and select from
>> table
>> 1>
>> *begin tran*2> insert into testme values(1)
>> 3> select * from testme
>> 4> go
>> (1 row affected)
>>  col1
>>  ---
>>1
>> -- That value col1=1 is there
>> --
>> (1 row affected)
>> -- Now rollback that transaction meaning in your case by killing your
>> Spark process!
>> --
>> 1> rollback tran
>> 2> select * from testme
>> 3> go
>>  col1
>>  ---
>> (0 rows affected)
>>
>> -- You can see that record has gone as it rolled back!
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:
>>
>>> Sorry I'cant answer before, I want to know if spark is the responsible
>>> to add the Begin Tran, The point is to speed up insertion over losing data,
>>>  Disabling Transaction will speed up the insertion and we dont care about
>>> consistency... I'll disable te implicit_transaction and see what happens.
>>>
>>> thanks
>>>
>>> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Assuming that you are using JDBC for putting data into any ACID
 compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
 explicitly  adding BEGIN TRAN to INSERT statement in a distributed
 transaction. MSSQL does not know or care where data is coming from. If your
 connection completes OK a COMMIT TRAN will be sent and that will tell MSQL
 to commit transaction. If yoy kill Spark transaction before MSSQL receive
 COMMIT TRAN, the transaction will be rolled back.

 The only option is that if you don't care about full data getting to
 MSSQL,to break your insert into chunks at source and send data to MSSQL in
 small batches. In that way you will not lose all data in MSSQL because of
 rollback.

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 20 April 2016 at 07:33, Mich Talebzadeh 
 wrote:

> Are you using JDBC to push data to MSSQL?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:
>
>> I mean local transaction, We've ran a Job that writes into SQLServer
>> then we killed spark JVM just for testing purpose and we realized that
>> SQLServer did a rollback.
>>
>> Regards
>>
>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> What do you mean by *without transaction*? do you mean forcing SQL
>>> Server to accept a non logged operation?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 April 2016 at 21:18, Andrés Ivaldi 

Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Yes, I know that behavior , but there is not explicit Begin Transaction in
my code, so, maybe Spark or the same driver is adding the begin
transaction, or implicit transaction is configured. If spark is'n adding a
Begin transaction on each insertion, then probably is database or Driver
configuration...

On Wed, Apr 20, 2016 at 3:33 PM, Mich Talebzadeh 
wrote:

>
> You will see what is happening in SQL Server. First create a test table
> called  testme
>
> 1> use tempdb
> 2> go
> 1> create table testme(col1 int)
> 2> go
> -- Now explicitly begin a transaction and insert 1 row and select from
> table
> 1>
> *begin tran*2> insert into testme values(1)
> 3> select * from testme
> 4> go
> (1 row affected)
>  col1
>  ---
>1
> -- That value col1=1 is there
> --
> (1 row affected)
> -- Now rollback that transaction meaning in your case by killing your
> Spark process!
> --
> 1> rollback tran
> 2> select * from testme
> 3> go
>  col1
>  ---
> (0 rows affected)
>
> -- You can see that record has gone as it rolled back!
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:
>
>> Sorry I'cant answer before, I want to know if spark is the responsible to
>> add the Begin Tran, The point is to speed up insertion over losing data,
>>  Disabling Transaction will speed up the insertion and we dont care about
>> consistency... I'll disable te implicit_transaction and see what happens.
>>
>> thanks
>>
>> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Assuming that you are using JDBC for putting data into any ACID
>>> compliant database (MSSQL, Sybase, Oracle etc), you are implicitly or
>>> explicitly  adding BEGIN TRAN to INSERT statement in a distributed
>>> transaction. MSSQL does not know or care where data is coming from. If your
>>> connection completes OK a COMMIT TRAN will be sent and that will tell MSQL
>>> to commit transaction. If yoy kill Spark transaction before MSSQL receive
>>> COMMIT TRAN, the transaction will be rolled back.
>>>
>>> The only option is that if you don't care about full data getting to
>>> MSSQL,to break your insert into chunks at source and send data to MSSQL in
>>> small batches. In that way you will not lose all data in MSSQL because of
>>> rollback.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 20 April 2016 at 07:33, Mich Talebzadeh 
>>> wrote:
>>>
 Are you using JDBC to push data to MSSQL?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:

> I mean local transaction, We've ran a Job that writes into SQLServer
> then we killed spark JVM just for testing purpose and we realized that
> SQLServer did a rollback.
>
> Regards
>
> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> What do you mean by *without transaction*? do you mean forcing SQL
>> Server to accept a non logged operation?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 April 2016 at 21:18, Andrés Ivaldi  wrote:
>>
>>> Hello, is possible to execute a SQL write without Transaction? we
>>> dont need transactions to save our data and this adds an overhead to the
>>> SQLServer.
>>>
>>> Regards.
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>
>


-- 
Ing. Ivaldi Andres


Fwd: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
You will see what is happening in SQL Server. First create a test table
called  testme

1> use tempdb
2> go
1> create table testme(col1 int)
2> go
-- Now explicitly begin a transaction and insert 1 row and select from table
1>
*begin tran*2> insert into testme values(1)
3> select * from testme
4> go
(1 row affected)
 col1
 ---
   1
-- That value col1=1 is there
--
(1 row affected)
-- Now rollback that transaction meaning in your case by killing your Spark
process!
--
1> rollback tran
2> select * from testme
3> go
 col1
 ---
(0 rows affected)

-- You can see that record has gone as it rolled back!


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 18:42, Andrés Ivaldi  wrote:

> Sorry I'cant answer before, I want to know if spark is the responsible to
> add the Begin Tran, The point is to speed up insertion over losing data,
>  Disabling Transaction will speed up the insertion and we dont care about
> consistency... I'll disable te implicit_transaction and see what happens.
>
> thanks
>
> On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Assuming that you are using JDBC for putting data into any ACID compliant
>> database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly
>>  adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL
>> does not know or care where data is coming from. If your connection
>> completes OK a COMMIT TRAN will be sent and that will tell MSQL to commit
>> transaction. If yoy kill Spark transaction before MSSQL receive COMMIT
>> TRAN, the transaction will be rolled back.
>>
>> The only option is that if you don't care about full data getting to
>> MSSQL,to break your insert into chunks at source and send data to MSSQL in
>> small batches. In that way you will not lose all data in MSSQL because of
>> rollback.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 April 2016 at 07:33, Mich Talebzadeh 
>> wrote:
>>
>>> Are you using JDBC to push data to MSSQL?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:
>>>
 I mean local transaction, We've ran a Job that writes into SQLServer
 then we killed spark JVM just for testing purpose and we realized that
 SQLServer did a rollback.

 Regards

 On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> What do you mean by *without transaction*? do you mean forcing SQL
> Server to accept a non logged operation?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 21:18, Andrés Ivaldi  wrote:
>
>> Hello, is possible to execute a SQL write without Transaction? we
>> dont need transactions to save our data and this adds an overhead to the
>> SQLServer.
>>
>> Regards.
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


 --
 Ing. Ivaldi Andres

>>>
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Append is not working with data frame

2016-04-20 Thread Anil Langote
Hi All,

We are pulling the data from oracle tables and writing them using partitions as 
parquet files, this is daily process it works fine till 18th day (18 days load 
works fine), however on 19 th day load the data frame load process hangs and 
load action called more than once, if we remove the data and just run for 19th 
day it loads the data successfully, why the load fails for 19th day in APPEND 
mode where as the 19th day works fine. On Spark UI we can see first load job 
takes around 5 min and duplicate load jobs just takes few seconds, we are stuck 
with this we want to process 60 days of data.

∂
Thank you 
Anil Langote


> On Apr 20, 2016, at 1:12 PM, Wei Chen  wrote:
> 
> Found it. In case someone else if looking for this:
> cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights
> 
> On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen  > wrote:
> Hi All,
> 
> I am using the example of model selection via cross-validation from the 
> documentation here: http://spark.apache.org/docs/latest/ml-guide.html 
> . After I get the 
> "cvModel", I would like to see the weights for each feature for the best 
> logistic regression model. I've been looking at the methods and attributes of 
> this "cvModel" and "cvModel.bestModel" and still can't figure out where these 
> weights are referred. It must be somewhere since we can use "cvModel" to 
> transform a new dataframe. Your help is much appreciated.
> 
> 
> Thank you,
> Wei
> 
> 
> 
> -- 
> Wei Chen, Ph.D.
> Astronomer and Data Scientist
> Phone: (832)646-7124
> Email: wei.chen.ri...@gmail.com 
> LinkedIn: https://www.linkedin.com/in/weichen1984 
> 


Unable to improve ListStatus performance of ParquetRelation

2016-04-20 Thread Ditesh Kumar
Hi,

When creating a DataFrame from a partitioned file structure (
sqlContext.read.parquet("s3://bucket/path/to/partitioned/parquet/filles")
), takes a lot of time to get list of files recursively from S3 when large
number of files are involved.
To circumvent this I wanted to override the FileStatusCache class
in HadoopFsRelation to create a new Relation which can fetch FileStatus
list from a cached source ( eg : MySql ). Currently this is not possible so
my questions are these :

   1. Is there any other way to do what I want to do ?
   2. If no to above, then can this extensibility be included by
   making FileStatusCache & related variable protected instead of private ?
   3. If yes to above, then can I help ?

Regards,
Ditesh Kumar


Re: Spark SQL Transaction

2016-04-20 Thread Andrés Ivaldi
Sorry I'cant answer before, I want to know if spark is the responsible to
add the Begin Tran, The point is to speed up insertion over losing data,
 Disabling Transaction will speed up the insertion and we dont care about
consistency... I'll disable te implicit_transaction and see what happens.

thanks

On Wed, Apr 20, 2016 at 12:09 PM, Mich Talebzadeh  wrote:

> Assuming that you are using JDBC for putting data into any ACID compliant
> database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly
>  adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL
> does not know or care where data is coming from. If your connection
> completes OK a COMMIT TRAN will be sent and that will tell MSQL to commit
> transaction. If yoy kill Spark transaction before MSSQL receive COMMIT
> TRAN, the transaction will be rolled back.
>
> The only option is that if you don't care about full data getting to
> MSSQL,to break your insert into chunks at source and send data to MSSQL in
> small batches. In that way you will not lose all data in MSSQL because of
> rollback.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 07:33, Mich Talebzadeh 
> wrote:
>
>> Are you using JDBC to push data to MSSQL?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:
>>
>>> I mean local transaction, We've ran a Job that writes into SQLServer
>>> then we killed spark JVM just for testing purpose and we realized that
>>> SQLServer did a rollback.
>>>
>>> Regards
>>>
>>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 What do you mean by *without transaction*? do you mean forcing SQL
 Server to accept a non logged operation?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 19 April 2016 at 21:18, Andrés Ivaldi  wrote:

> Hello, is possible to execute a SQL write without Transaction? we dont
> need transactions to save our data and this adds an overhead to the
> SQLServer.
>
> Regards.
>
> --
> Ing. Ivaldi Andres
>


>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>


-- 
Ing. Ivaldi Andres


Re: VectorAssembler handling null values

2016-04-20 Thread Andres Perez
so the missing data could be on a one-off basis, or from fields that are in
general optional, or from, say, a count that is only relevant for certain
cases (very sparse):

f1|f2|f3|optF1|optF2|sparseF1
a|15|3.5|cat1|142L|
b|13|2.4|cat2|64L|catA
c|2|1.6|||
d|27|5.1||0|

-Andy

On Wed, Apr 20, 2016 at 1:38 AM, Nick Pentreath 
wrote:

> Could you provide an example of what your input data looks like?
> Supporting missing values in a sparse result vector makes sense.
>
> On Tue, 19 Apr 2016 at 23:55, Andres Perez  wrote:
>
>> Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot
>> handle null values. This presents a problem for us as we wish to run a
>> decision tree classifier on sometimes sparse data. Is there a particular
>> reason VectorAssembler is implemented in this way, and can anyone recommend
>> the best path for enabling VectorAssembler to build vectors for data that
>> will contain empty values?
>>
>> Thanks!
>>
>> -Andres
>>
>>


Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Hi,

you do not need to do anything with the RDD at all. Just follow the
instructions in this site https://github.com/databricks/spark-csv and
everything will be super fast and smooth.

Remember that in case the data is large then converting RDD to dataframes
takes a very very very very long time. Therefore just use the SPARK CSV
package and data is instantaneously available as SPARK dataframe.

In case you need any help just let me know.

Regards,
Gourav

On Wed, Apr 20, 2016 at 4:30 PM, Wei Chen  wrote:

> Let's assume K is String, and V is Integer,
> schema = StructType([StructField("K", StringType(), True),
> StructField("V", IntegerType(), True)])
> df = sqlContext.createDataFrame(rdd, schema=schema)
> udf1 = udf(lambda x: [x], ArrayType(IntegerType()))
> df1 = df.select("K", udf1("V").alias("arrayV"))
> df1.show()
>
>
> On Tue, Apr 19, 2016 at 12:51 PM, pth001 
> wrote:
>
>> Hi,
>>
>> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
>> Pyspark?
>>
>> Best,
>> Patcharee
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Wei Chen, Ph.D.
> Astronomer and Data Scientist
> Phone: (832)646-7124
> Email: wei.chen.ri...@gmail.com
> LinkedIn: https://www.linkedin.com/in/weichen1984
>


Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin



  From: Sourav Mazumder 
 To: user  
 Sent: Wednesday, April 20, 2016 11:07 AM
 Subject: Spark 2.0 forthcoming features
   
Hi All,

Is there somewhere we can get idea of the upcoming features in Spark 2.0.

I got a list for Spark ML from here 
https://issues.apache.org/jira/browse/SPARK-12626.

Is there other links where I can similar enhancements planned for Sparl SQL, 
Spark Core, Spark Streaming. GraphX etc. ?

Thanks in advance.

Regards,
Sourav


   

Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Found it. In case someone else if looking for this:
cvModel.bestModel.asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel].weights

On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen  wrote:

> Hi All,
>
> I am using the example of model selection via cross-validation from the
> documentation here: http://spark.apache.org/docs/latest/ml-guide.html.
> After I get the "cvModel", I would like to see the weights for each feature
> for the best logistic regression model. I've been looking at the methods
> and attributes of this "cvModel" and "cvModel.bestModel" and still can't
> figure out where these weights are referred. It must be somewhere since we
> can use "cvModel" to transform a new dataframe. Your help is much
> appreciated.
>
>
> Thank you,
> Wei
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


Spark 2.0 forthcoming features

2016-04-20 Thread Sourav Mazumder
Hi All,

Is there somewhere we can get idea of the upcoming features in Spark 2.0.

I got a list for Spark ML from here
https://issues.apache.org/jira/browse/SPARK-12626.

Is there other links where I can similar enhancements planned for Sparl
SQL, Spark Core, Spark Streaming. GraphX etc. ?

Thanks in advance.

Regards,
Sourav


Re: Invoking SparkR from Spark shell

2016-04-20 Thread Ted Yu
Please take a look at:
https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes

On Wed, Apr 20, 2016 at 9:50 AM, Ashok Kumar 
wrote:

> Hi,
>
> I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R
> with Spark.
>
> Is there a s hell similar to spark-shell that supports R besides Scala
> please?
>
>
> Thanks
>


Invoking SparkR from Spark shell

2016-04-20 Thread Ashok Kumar
Hi,
I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R with 
Spark.
Is there a s hell similar to spark-shell that supports R besides Scala please?

Thanks

custom transformer pipeline sample code

2016-04-20 Thread Andy Davidson

Someone recently asked me for a code example of how to to write a custom
pipeline transformer in Java

Enjoy, Share

Andy



https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9
b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa
rk/SparseVectorToLogicalTranformer.java#L52

https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9
b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa
rk/NaiveBayesStanfordExampleTest.java#L207




Executor still on the UI even if the worker is dead

2016-04-20 Thread kundan kumar
Hi TD/Cody,

Why does it happen so in Spark Streaming that the executors are still shown
on the UI even when the worker is killed and not in the cluster.

This severely impacts my running jobs which takes too longer and the stages
failing with the exception

java.io.IOException: Failed to connect to --- (dead worker)

Is this a bug in Spark ??

Version is 1.4.0

This is entirely against the fault tolerance of the workers. Killing a
worker in a cluster of 5 impacts the entire job.

Thanks,
Kundan


Re: pyspark split pair rdd to multiple

2016-04-20 Thread Wei Chen
Let's assume K is String, and V is Integer,
schema = StructType([StructField("K", StringType(), True), StructField("V",
IntegerType(), True)])
df = sqlContext.createDataFrame(rdd, schema=schema)
udf1 = udf(lambda x: [x], ArrayType(IntegerType()))
df1 = df.select("K", udf1("V").alias("arrayV"))
df1.show()


On Tue, Apr 19, 2016 at 12:51 PM, pth001  wrote:

> Hi,
>
> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
> Pyspark?
>
> Best,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


Re: how to get weights of logistic regression model inside cross validator model?

2016-04-20 Thread Wei Chen
Forgot to mention, I am using 1.5.2 Scala version.

On Tue, Apr 19, 2016 at 1:12 PM, Wei Chen  wrote:

> Hi All,
>
> I am using the example of model selection via cross-validation from the
> documentation here: http://spark.apache.org/docs/latest/ml-guide.html.
> After I get the "cvModel", I would like to see the weights for each feature
> for the best logistic regression model. I've been looking at the methods
> and attributes of this "cvModel" and "cvModel.bestModel" and still can't
> figure out where these weights are referred. It must be somewhere since we
> can use "cvModel" to transform a new dataframe. Your help is much
> appreciated.
>
>
> Thank you,
> Wei
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


spark on yarn

2016-04-20 Thread Shushant Arora
I am running a spark application on yarn cluster.

say I have available vcors in cluster as 100.And I start spark application
with --num-executors 200 --num-cores 2 (so I need total 200*2=400 vcores)
but in my cluster only 100 are available.

What will happen ? Will the job abort or it will be submitted successfully
and 100 vcores will be aallocated to 50 executors and rest executors will
be started as soon as vcores are available ?

Please note dynamic allocation is not enabled in cluster. I have old
version 1.2.

Thanks


Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Assuming that you are using JDBC for putting data into any ACID compliant
database (MSSQL, Sybase, Oracle etc), you are implicitly or explicitly
 adding BEGIN TRAN to INSERT statement in a distributed transaction. MSSQL
does not know or care where data is coming from. If your connection
completes OK a COMMIT TRAN will be sent and that will tell MSQL to commit
transaction. If yoy kill Spark transaction before MSSQL receive COMMIT
TRAN, the transaction will be rolled back.

The only option is that if you don't care about full data getting to
MSSQL,to break your insert into chunks at source and send data to MSSQL in
small batches. In that way you will not lose all data in MSSQL because of
rollback.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 07:33, Mich Talebzadeh 
wrote:

> Are you using JDBC to push data to MSSQL?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:
>
>> I mean local transaction, We've ran a Job that writes into SQLServer then
>> we killed spark JVM just for testing purpose and we realized that SQLServer
>> did a rollback.
>>
>> Regards
>>
>> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> What do you mean by *without transaction*? do you mean forcing SQL
>>> Server to accept a non logged operation?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 April 2016 at 21:18, Andrés Ivaldi  wrote:
>>>
 Hello, is possible to execute a SQL write without Transaction? we dont
 need transactions to save our data and this adds an overhead to the
 SQLServer.

 Regards.

 --
 Ing. Ivaldi Andres

>>>
>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


Re: pyspark split pair rdd to multiple

2016-04-20 Thread patcharee

I can also use dataframe. Any suggestions?

Best,
Patcharee

On 20. april 2016 10:43, Gourav Sengupta wrote:

Is there any reason why you are not using data frames?


Regards,
Gourav

On Tue, Apr 19, 2016 at 8:51 PM, pth001 > wrote:


Hi,

How can I split pair rdd [K, V] to map [K, Array(V)] efficiently
in Pyspark?

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org







Re: 回复:Spark sql and hive into different result with same sql

2016-04-20 Thread Ted Yu
Do you mind trying out build from master branch ?

1.5.3 is a bit old.

On Wed, Apr 20, 2016 at 5:25 AM, FangFang Chen 
wrote:

> I found spark sql lost precision, and handle data as int with some rule.
> Following is data got via hive shell and spark sql, with same sql to same
> hive table:
> Hive:
> 0.4
> 0.5
> 1.8
> 0.4
> 0.49
> 1.5
> Spark sql:
> 1
> 2
> 2
> Seems the handle rule is: when decimal point data <0.5 then to 0, when
> decimal point data>=0.5 then to 1.
>
> Is this a bug or some configuration thing? Please give some suggestions.
> Thanks
>
> 发自 网易邮箱大师 
> 在2016年04月20日 18:45,FangFang Chen  写道:
>
> The output is:
> Spark SQ:6828127
> Hive:6980574.1269
>
> 发自 网易邮箱大师 
> 在2016年04月20日 18:06,FangFang Chen  写道:
>
> Hi all,
> Please give some suggestions. Thanks
>
> With following same sql, spark sql and hive give different result. The sql
> is sum(decimal(38,18)) columns.
> Select sum(column) from table;
> column is defined as decimal(38,18).
>
> Spark version:1.5.3
> Hive version:2.0.0
>
> 发自 网易邮箱大师 
>
>
>
>
>
>
>


reading EOF exception while reading parquet ile from hadoop

2016-04-20 Thread Naveen Kumar Pokala
Hi,

I am trying to read parquet file(for ex: one.parquet)

I am creating rdd out of it like ..

My program In scala like below...

val data = 
sqlContext.read.parquet("hdfs://machine:port/home/user/one.parquet").rdd.map { 
x => (x.getString(0),x) }

data.count()


I am using spark 1.4 and Hadoop 2.4


java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
parquet.hadoop.ParquetInputSplit.readArray(ParquetInputSplit.java:240)
at parquet.hadoop.ParquetInputSplit.readUTF8(ParquetInputSplit.java:230)
at 
parquet.hadoop.ParquetInputSplit.readFields(ParquetInputSplit.java:197)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at 
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:45)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
at 
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Thanks,
Naveen Kumar Pokala
[cid:image001.jpg@01D19B26.32EE0FE0]



Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-20 Thread pierre lacave
Hi


I am trying to use spark to write to a protected zone in hdfs, I am
able to create and list file using the hdfs client but when writing
via Spark I get this exception.

I could not find any mention of CryptoProtocolVersion in the spark doc.


Any idea what could have gone wrong?


spark (1.5.0), hadoop (2.6.1)


Thanks


org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException):
No crypto protocol versions provided by the client are supported.
Client provided: [] NameNode supports:
[CryptoProtocolVersion{description='Unknown', version=1,
unknownValue=null}, CryptoProtocolVersion{description='Encryption
zones', version=2, unknownValue=null}]
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.chooseProtocolVersion(FSNamesystem.java:2468)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2600)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)

at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy13.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:264)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1612)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1488)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1413)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:387)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:383)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:383)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:327)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Re: Exceeding spark.akka.frameSize when saving Word2VecModel

2016-04-20 Thread Stefan Falk

Nobody here who can help me on this? :/

On 19/04/16 13:15, Stefan Falk wrote:

Hello Sparklings!

I am trying to train a word vector model but as I call 
Word2VecModel#save() I am getting a org.apache.spark.SparkException 
saying that this would exceed the frameSize limit (stackoverflow 
question [1]).


Increasing the frameSize would only help me in this particular case I 
guess, but as soon as the model exceeds 2GB the problem is going to 
occur again.


I am not aware of my options here - what can I do to save a 
Word2VecModel of arbitrary size?


Thanks for any help!

BR; Stefan

[1] 
http://stackoverflow.com/questions/36692386/exceeding-spark-akka-framesize-when-saving-word2vecmodel

--
Stefan R. Falk

Know-Center Graz
Inffeldgasse 13 / 6. Stock
8010 Graz, Austria
Email :sf...@know-center.at
Tel: +43 316 873 30869
http://www.know-center.at


--
Stefan R. Falk

Know-Center Graz
Inffeldgasse 13 / 6. Stock
8010 Graz, Austria
Email : sf...@know-center.at
Tel: +43 316 873 30869
http://www.know-center.at



Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Jörn Franke
Python can access the JVM - this how it interfaces with Spark. Some of the 
components do not have a wrapper fro the corresponding Java Api yet and thus 
are not accessible in Python.

Same for elastic search. You need to write a more or less simple wrapper.

> On 20 Apr 2016, at 09:53, "kramer2...@126.com"  wrote:
> 
> I am using python and spark. 
> 
> I think one problem might be to communicate spark with third product. For
> example, combine spark with elasticsearch. You have to use java or scala.
> Python is not supported
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805p26806.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any NLP lib could be used on spark?

2016-04-20 Thread Chris Fregly
this took me a bit to get working, but I finally got it up and running so with 
the package that Burak pointed out.

here's some relevant links to my project that should give you some clues:

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/spark/ml/src/main/scala/com/advancedspark/ml/nlp/ItemDescriptionsDF.scala

https://github.com/fluxcapacitor/pipeline/blob/master/myapps/spark/ml/build.sbt

https://github.com/fluxcapacitor/pipeline/blob/master/config/bash/pipeline.bashrc
 (look for the SPARK_SUBMIT_PACKAGES export var)

there's also a few Zeppelin notebooks in that repo that will show you got to 
use it.  I'm doing sentiment analysis in one - as well as entity recognition 
and other fun stuff.

it's a pain to setup, unfortunately.  not sure why it.  lots of missing pieces 
that had to be manually cobbled together.

> On Apr 19, 2016, at 5:00 PM, Burak Yavuz  wrote:
> 
> A quick search on spark-packages returns: 
> http://spark-packages.org/package/databricks/spark-corenlp.
> 
> You may need to build it locally and add it to your session by --jars.
> 
> On Tue, Apr 19, 2016 at 10:47 AM, Gavin Yue  wrote:
>> Hey, 
>> 
>> Want to try the NLP on the spark. Could anyone recommend any easy to run NLP 
>> open source lib on spark?
>> 
>> Also is there any recommended semantic network? 
>> 
>> Thanks a lot.  
> 


Re: Spark support for Complex Event Processing (CEP)

2016-04-20 Thread Mario Ds Briggs

I did see your earlier post about Stratio decision. Will readup on it


thanks
Mario



From:   Alonso Isidoro Roman 
To: Mich Talebzadeh 
Cc: Mario Ds Briggs/India/IBM@IBMIN, Luciano Resende
, "user @spark" 
Date:   20/04/2016 02:24 pm
Subject:Re: Spark support for Complex Event Processing (CEP)



Stratio decision could do the job

https://github.com/Stratio/Decision



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-04-20 7:55 GMT+02:00 Mich Talebzadeh :
  Thanks a lot Mario. Will have a look.

  Regards,


  Dr Mich Talebzadeh

  LinkedIn
  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

  http://talebzadehmich.wordpress.com




  On 20 April 2016 at 06:53, Mario Ds Briggs 
  wrote:
   Hi Mich,

   Info is here - https://issues.apache.org/jira/browse/SPARK-14745

   overview is in the pdf -
   
https://issues.apache.org/jira/secure/attachment/12799670/SparkStreamingCEP.pdf


   Usage examples not in the best place for now (will make it better) -
   
https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532


   Your feedback is appreciated.


   thanks
   Mario

   Inactive hide details for Mich Talebzadeh ---19/04/2016 12:45:52
   am---great stuff Mario. Much appreciated. MichMich Talebzadeh
   ---19/04/2016 12:45:52 am---great stuff Mario. Much appreciated. Mich

   From: Mich Talebzadeh 
   To: Mario Ds Briggs/India/IBM@IBMIN
   Cc: "user @spark" , Luciano Resende <
   luckbr1...@gmail.com>
   Date: 19/04/2016 12:45 am
   Subject: Re: Spark support for Complex Event Processing (CEP)




   great stuff Mario. Much appreciated.

   Mich

   Dr Mich Talebzadeh

   LinkedIn
   
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


   http://talebzadehmich.wordpress.com




   On 18 April 2016 at 20:08, Mario Ds Briggs 
   wrote:


 Hey Mich, Luciano

 Will provide links with docs by tomorrow

 thanks
 Mario

 - Message from Mich Talebzadeh  on
 Sun, 17 Apr 2016 19:17:38 +0100 -
   
To: Luciano Resende  
   
cc: "user @spark"   
   
   Subject: Re: Spark support for Complex Event
Processing (CEP)   
   

 Thanks Luciano. Appreciated.

 Regards

 Dr Mich Talebzadeh

 LinkedIn
 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


 http://talebzadehmich.wordpress.com




 On 17 April 2016 at 17:32, Luciano Resende 
 wrote:


 Hi Mitch,

 I know some folks that were investigating/prototyping
 on this area, let me see if I can get them to reply
 here with more details.

 On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:
 Hi,

 Has Spark got libraries for CEP using Spark Streaming
 with Kafka by any chance?

 I am looking at Flink that supposed to have these
 libraries for CEP but I find Flink itself very much
 work in progress.

 Thanks

 Dr Mich Talebzadeh

 LinkedIn
 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


 http://talebzadehmich.wordpress.com






 --
 Luciano Resende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/

















Re: pyspark split pair rdd to multiple

2016-04-20 Thread Gourav Sengupta
Is there any reason why you are not using data frames?


Regards,
Gourav

On Tue, Apr 19, 2016 at 8:51 PM, pth001  wrote:

> Hi,
>
> How can I split pair rdd [K, V] to map [K, Array(V)] efficiently in
> Pyspark?
>
> Best,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re:Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi 


the input data size is less than 10M. The task result size should be less I 
think. Because I am doing aggregation on the data 






At 2016-04-20 16:18:31, "Jeff Zhang"  wrote:

Do you mean the input data size as 10M or the task result size ?


>>> But my way is to setup a forever loop to handle continued income data. Not 
>>> sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in the 
forever loop ?






On Wed, Apr 20, 2016 at 3:55 PM, 李明伟  wrote:

Hi Jeff


The total size of my data is less than 10M. I already set the driver memory to 
4GB.











在 2016-04-20 13:42:25,"Jeff Zhang"  写道:

Seems it is OOM in driver side when fetching task result. 


You can try to increase spark.driver.memory and spark.driver.maxResultSize


On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:

Hi Zhan Zhang




Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..


I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data. 
Not sure if it is the right way to use spark




16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not
sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in
the forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟  wrote:

> Hi Jeff
>
> The total size of my data is less than 10M. I already set the driver
> memory to 4GB.
>
>
>
>
>
>
>
> 在 2016-04-20 13:42:25,"Jeff Zhang"  写道:
>
> Seems it is OOM in driver side when fetching task result.
>
> You can try to increase spark.driver.memory and spark.driver.maxResultSize
>
> On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:
>
>> Hi Zhan Zhang
>>
>>
>> Please see the exception trace below. It is saying some GC overhead limit
>> error
>> I am not a java or scala developer so it is hard for me to understand
>> these infor.
>> Also reading coredump is too difficult to me..
>>
>> I am not sure if the way I am using spark is correct. I understand that
>> spark can do batch or stream calculation. But my way is to setup a forever
>> loop to handle continued income data.
>> Not sure if it is the right way to use spark
>>
>>
>> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
>> task-result-getter-2
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> 

Re:Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi Jeff


The total size of my data is less than 10M. I already set the driver memory to 
4GB.











在 2016-04-20 13:42:25,"Jeff Zhang"  写道:

Seems it is OOM in driver side when fetching task result. 


You can try to increase spark.driver.memory and spark.driver.maxResultSize


On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:

Hi Zhan Zhang




Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..


I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data. 
Not sure if it is the right way to use spark




16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com
I am using python and spark. 

I think one problem might be to communicate spark with third product. For
example, combine spark with elasticsearch. You have to use java or scala.
Python is not supported



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805p26806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Zhang, Jingyu
Graphx did not support Python yet.
http://spark.apache.org/docs/latest/graphx-programming-guide.html

The workaround solution is use graphframes (3rd party API),
https://issues.apache.org/jira/browse/SPARK-3789

but some features in Python are not as same as Scala,
https://github.com/graphframes/graphframes/issues/57

Jingyu

On 20 April 2016 at 16:52, sujeet jog  wrote:

> It depends on the trade off's you wish to have,
>
> Python being a interpreted language, speed of execution will be lesser,
> but it being a very common language used across, people can jump in hands
> on quickly
>
> Scala programs run in java environment,  so it's obvious you will get good
> execution speed,  although it's not common for people to know this language
> readily.
>
>
> Pyspark API's i believe will have everything which Scala Spark API's offer
> in long run.
>
>
>
> On Wed, Apr 20, 2016 at 12:14 PM, berkerkozan 
> wrote:
>
>> I know scala better than python but my team (2 other my friend) knows only
>> python. We want to use graphx or maybe try graphframes.
>> What will be the future of these 2 languages for spark ecosystem? Will
>> python cover everything scala can in short time periods? what do you
>> advice?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread sujeet jog
It depends on the trade off's you wish to have,

Python being a interpreted language, speed of execution will be lesser, but
it being a very common language used across, people can jump in hands on
quickly

Scala programs run in java environment,  so it's obvious you will get good
execution speed,  although it's not common for people to know this language
readily.


Pyspark API's i believe will have everything which Scala Spark API's offer
in long run.



On Wed, Apr 20, 2016 at 12:14 PM, berkerkozan  wrote:

> I know scala better than python but my team (2 other my friend) knows only
> python. We want to use graphx or maybe try graphframes.
> What will be the future of these 2 languages for spark ecosystem? Will
> python cover everything scala can in short time periods? what do you
> advice?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Scala vs Python for Spark ecosystem

2016-04-20 Thread berkerkozan
I know scala better than python but my team (2 other my friend) knows only
python. We want to use graphx or maybe try graphframes. 
What will be the future of these 2 languages for spark ecosystem? Will
python cover everything scala can in short time periods? what do you advice?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Transaction

2016-04-20 Thread Mich Talebzadeh
Are you using JDBC to push data to MSSQL?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 April 2016 at 23:41, Andrés Ivaldi  wrote:

> I mean local transaction, We've ran a Job that writes into SQLServer then
> we killed spark JVM just for testing purpose and we realized that SQLServer
> did a rollback.
>
> Regards
>
> On Tue, Apr 19, 2016 at 5:27 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> What do you mean by *without transaction*? do you mean forcing SQL Server
>> to accept a non logged operation?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 April 2016 at 21:18, Andrés Ivaldi  wrote:
>>
>>> Hello, is possible to execute a SQL write without Transaction? we dont
>>> need transactions to save our data and this adds an overhead to the
>>> SQLServer.
>>>
>>> Regards.
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>