Re: Continue reading dataframe from file despite errors

2017-09-12 Thread Suresh Thalamati
Try the CSV   Option(“mode”,  "dropmalformed”), that might skip the error 
records. 


> On Sep 12, 2017, at 2:33 PM, jeff saremi  wrote:
> 
> should have added some of the exception to be clear:
> 
> 17/09/12 14:14:17 ERROR TaskSetManager: Task 0 in stage 15.0 failed 1 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 
> (TID 15, localhost, executor driver): java.lang.NumberFormatException: For 
> input string: "south carolina"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)
> 
> From: jeff saremi 
> Sent: Tuesday, September 12, 2017 2:32:03 PM
> To: user@spark.apache.org
> Subject: Continue reading dataframe from file despite errors
>  
> I'm using a statement like the following to load my dataframe from some text 
> file
> Upon encountering the first error, the whole thing throws an exception and 
> processing stops.
> I'd like to continue loading even if that results in zero rows in my 
> dataframe. How can i do that?
> thanks
> 
> spark.read.schema(SomeSchema).option("sep", 
> "\t").format("csv").load("somepath")



Re: Dataset count on database or parquet

2017-02-09 Thread Suresh Thalamati
If you have to get the data into parquet format for other reasons   then I 
think count() on the parquet should be better.  If it just the count  you need 
using database  sending dbTable = (select count(*) from  ) might be 
quicker,  t will avoid unnecessary data transfer from the database to spark.


Hope that helps
-suresh

> On Feb 8, 2017, at 2:58 AM, Rohit Verma  wrote:
> 
> Hi Which of the following is better approach for too many values in database
> 
>   final Dataset dataset = spark.sqlContext().read()
> .format("jdbc")
> .option("url", params.getJdbcUrl())
> .option("driver", params.getDriver())
> .option("dbtable", params.getSqlQuery())
> //.option("partitionColumn", hashFunction)
> //.option("lowerBound", 0)
> //.option("upperBound", 10)
> //.option("numPartitions", 10)
> //.option("oracle.jdbc.timezoneAsRegion", "false")
> .option("fetchSize", 10)
> .load();
> dataset.write().parquet(params.getPath());
> 
> // target is to get count of persisted rows.
> 
> 
> // approach 1 i.e getting count directly from dataset
> // as I understood this count will be transalted to jdbcRdd.count and 
> could be on database
> long count = dataset.count();
> //approach 2 i.e read back saved parquet and get count from it. 
> long count = spark.read().parquet(params.getPath()).count();
> 
> 
> Regards
> Rohit



Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-26 Thread Suresh Thalamati
I notice columns are quoted wit double quotes in the error message 
('"user","age","state”)) . By chance did you override the MySQL JDBC dialect,  
default MySQL identifiers are quoted with `
override def quoteIdentifier(colName: String): String = {
  s"`$colName`"
}
Just wondering if the error you are running into is related to quotes. 

Thanks
-suresh


> On Jan 26, 2017, at 1:28 AM, Didac Gil  wrote:
> 
> Are you sure that “age” is a numeric field?
> 
> Even numeric, you could pass the “44” between quotes: 
> 
> INSERT into your_table ("user","age","state") VALUES ('user3’,’44','CT’)
> 
> Are you sure there are no more fields that are specified as NOT NULL, and 
> that you did not provide a value (besides user, age and state)?
> 
> 
>> On 26 Jan 2017, at 04:42, Xuan Dzung Doan  
>> wrote:
>> 
>> Hi,
>> 
>> Spark version 2.1.0
>> MySQL community server version 5.7.17
>> MySQL Connector Java 5.1.40
>> 
>> I need to save a dataframe to a MySQL table. In spark shell, the following 
>> statement succeeds:
>> 
>> scala> df.write.mode(SaveMode.Append).format("jdbc").option("url", 
>> "jdbc:mysql://127.0.0.1:3306/mydb").option("dbtable", 
>> "person").option("user", "username").option("password", "password").save()
>> 
>> I write an app that basically does the same thing, issuing the same 
>> statement saving the same dataframe to the same MySQL table. I run it using 
>> spark-submit, but it fails, reporting some error in the SQL syntax. Here's 
>> the detailed stack trace:
>> 
>> 17/01/25 16:06:02 INFO DAGScheduler: Job 2 failed: save at 
>> DataIngestionJob.scala:119, took 0.159574 s
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
>> to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: 
>> Lost task 0.0 in stage 2.0 (TID 3, localhost, executor driver): 
>> java.sql.BatchUpdateException: You have an error in your SQL syntax; check 
>> the manual that corresponds to your MySQL server version for the right 
>> syntax to use near '"user","age","state") VALUES ('user3',44,'CT')' at line 1
>>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>  at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>  at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
>>  at com.mysql.jdbc.Util.getInstance(Util.java:408)
>>  at 
>> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1162)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1773)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1257)
>>  at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:958)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
>>  at 
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You 
>> have an error in your SQL syntax; check the manual that corresponds to your 
>> MySQL server version for the right syntax to use near '"user","age","state") 
>> VALUES ('user3',44,'CT')' at line 1
>>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>  at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>  at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
>>  at com.mysql.jdbc.Util.getInstance(Util.java:408)
>>  at 

Re: Deep learning libraries for scala

2016-09-30 Thread Suresh Thalamati
Tensor frames

https://spark-packages.org/package/databricks/tensorframes 


Hope that helps
-suresh

> On Sep 30, 2016, at 8:00 PM, janardhan shetty  wrote:
> 
> Looking for scala dataframes in particular ?
> 
> On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue  > wrote:
> Skymind you could try. It is java
> 
> I never test though.
> 
> > On Sep 30, 2016, at 7:30 PM, janardhan shetty  > > wrote:
> >
> > Hi,
> >
> > Are there any good libraries which can be used for scala deep learning 
> > models ?
> > How can we integrate tensorflow with scala ML ?
> 



Re: Spark_JDBC_Partitions

2016-09-13 Thread Suresh Thalamati
There is also another  jdbc method in  data frame  reader api o specify your 
own predicates for  each partition. Using this you can control what is included 
in  each partition.

val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 
200")
val df = spark.read.jdbc(
  urlWithUserAndPass,
  "TEST.PEOPLE",
  predicates = jdbcPartitionWhereClause,
  new Properties())


Hope that helps. 
-suresh


> On Sep 13, 2016, at 9:44 AM, Rabin Banerjee  
> wrote:
> 
> Trust me, Only thing that can help you in your situation is SQOOP oracle 
> direct connector which is known as  ORAOOP. Spark cannot do everything , 
> you need a OOZIE workflow which will trigger sqoop job with oracle direct 
> connector to pull the data then spark batch to process .
> 
> Hope it helps !!
> 
> On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic  > wrote:
> Hi, 
> 
> One way can be to use NTILE function to partition data. 
> Example:
> 
> REM Creating test table
> create table Test_part as select * from ( select rownum rn from all_tables t1 
> ) where rn <= 1000;
> 
> REM Partition lines by Oracle block number, 11 partitions in this example. 
> select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt 
> from Test_part
> 
> 
> Let's see distribution: 
> 
> select nt, count(*) from ( select ntile(11) over( order by 
> dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt;
> 
> NT   COUNT(*)
> -- --
>  1 10
>  6 10
> 11  9
>  2 10
>  4 10
>  5 10
>  8 10
>  3 10
>  7 10
>  9  9
> 10  9
> 
> 11 rows selected.
> ^^ It looks good. Sure feel free to chose any other condition to order your 
> lines as best suits your case
> 
> So you can 
> 1) have one session reading and then decide where line goes (1 reader )
> 2) Or do multiple reads by specifying partition number. Note that in this 
> case you read whole table n times (in parallel) and is more internsive on 
> read part. (multiple readers)
> 
> Regards, 
> Igor
> 
> 
> 
> 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh  >:
> Good points
> 
> Unfortunately databump. expr, imp use binary format for import and export. 
> that cannot be used to import data into HDFS in a suitable way.
> 
> One can use what is known as flat,sh script to get data out tab or , 
> separated etc.
> 
> ROWNUM is a pseudocolumn (not a real column) that is available in a query. 
> The issue is that in a table of 280Million rows to get the position of the 
> row it will have to do a table scan since no index cannot be built on it 
> (assuming there is no other suitable index). Not ideal but can be done.
> 
> I think a better alternative is to use datapump to take that table to 
> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a unique 
> index on the sequence column and do the partitioning there.
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 10 September 2016 at 22:37, ayan guha  > wrote:
> In oracle something called row num is present in every row.  You can create 
> an evenly distribution using that column. If it is one time work, try using 
> sqoop. Are you using Oracle's own appliance? Then you can use data pump format
> 
> On 11 Sep 2016 01:59, "Mich Talebzadeh"  > wrote:
> creating an Oracle sequence for a table of 200million is not going to be that 
> easy without changing the schema. It is possible to export that table from 
> prod and import it to DEV/TEST and create the sequence there.
> 
> If it is a FACT table then the foreign keys from the Dimension tables will be 
> bitmap indexes on the FACT table so they can be potentially used.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and 

Re: JDBC SQL Server RDD

2016-05-17 Thread Suresh Thalamati
What is the error you are getting ?

At least on the  main code line I see JDBCRDD is marked as private[sql].  
Simple alternative  might be to call sql server using data frame api , and get 
rdd from data frame. 

eg:
val df = 
sqlContext.read.jdbc("jdbc:sqlserver://usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com
 
;database=ProdAWS;user=sa;password=?s3iY2mv6.H",
 "(select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY)" , new java.util.Properties())

val rdd = df.rdd 


Hope that helps
-suresh

> On May 15, 2016, at 12:05 PM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi ,
> 
> I am trying to test sql server connection with JDBC RDD but unable to connect.
> 
> val myRDD = new JdbcRDD( sparkContext, () => 
> DriverManager.getConnection(sqlServerConnectionString) ,
>   "select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY limit ?, ?",
>   0, 5, 1, r => r.getString("CTRY_NA") + ", " + 
> r.getString("CTRY_SHRT_NA"))
> 
> 
> sqlServerConnectionString here is 
> jdbc:sqlserver://usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com 
> ;database=ProdAWS;user=sa;password=?s3iY2mv6.H
> 
> 
> can you please let me know what I am doing worng. I tried solutions from all 
> forums but didnt find any luck
> 
> Thanks,
> Asmath.



Re: Microsoft SQL dialect issues

2016-03-15 Thread Suresh Thalamati
You should be able to register your own dialect if the default mappings are  
not working for your scenario.

import org.apache.spark.sql.jdbc
JdbcDialects.registerDialect(MyDialect)

Please refer to the  JdbcDialects to find example of  existing default dialect 
for your database or another database.
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
 

https://github.com/apache/spark/tree/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/jdbc
 


 


> On Mar 15, 2016, at 12:41 PM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> Can you please clarify what you are trying to achieve and I guess you mean 
> Transact_SQL for MSSQL?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 15 March 2016 at 19:09, Andrés Ivaldi  > wrote:
> Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having dialect 
> problems
> I found this
> https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E
>  
> 
> 
> That is what is happening to me, It's possible to define the dialect? so I 
> can override the default for SQLServer?
> 
> Regards. 
> 
> -- 
> Ing. Ivaldi Andres
> 



Re: Error reading a CSV

2016-02-24 Thread Suresh Thalamati
Try creating  /user/hive/warehouse/  directory if it does not exists , and 
check it has
 write permission for the user. Note the lower case ‘user’  in the path.  

> On Feb 24, 2016, at 2:42 PM, skunkwerk  wrote:
> 
> I have downloaded the Spark binary with Hadoop 2.6.
> When I run the spark-sql program like this with the CSV library:
> ./bin/spark-sql --packages com.databricks:spark-csv_2.11:1.3.0
> 
> I get into the console for spark-sql.
> However, when I try to import a CSV file from my local filesystem:
> 
> CREATE TABLE customerview USING com.databricks.spark.csv OPTIONS (path
> "/Users/imran/Downloads/test.csv", header "true", inferSchema "true");
> 
> I get the following error:
> org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:file:/user/hive/warehouse/test is not a directory or
> unable to create one)
> 
> http://pastebin.com/BfyVv14U
> 
> How can I fix this?
> 
> thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-a-CSV-tp26329.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[no subject]

2016-01-08 Thread Suresh Thalamati



subscribe

2016-01-04 Thread Suresh Thalamati