from:"Nikolay Zhebet"

Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet

Hi! Can you split init code with current comand? I thing it is main problem
in your code.
16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
написал:

> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
> data from PostgreSQL using JDBC? I can get to the table and see the schema,
> but when I do a show, it takes very long or keeps timing out.
>
> The code is simple.
>
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:postgresql://dbserver:port/database?user=user&
> password=password",
>"dbtable" -> “schema.table")).load()
>
> jdbcDF.show
>
>
> If anyone can help, please let me know.
>
> Thanks,
> Ben
>
>

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet

Yes, Spark always trying to deliver snippet of code to the data (not vice
versa). But you should realize, that if you try to run groupBY or Join on
the large dataset, then you always should migrate temporary localy grouped
data from one worker node to the another(It is shuffle operation as i
know). In the end of all batch proceses, you can fetch your grouped
dataset. But in underhood you can see alot of network connection between
worker-nodes, because all your 2TB data was splitted on 128MB parts and was
writed on the different HDFSDataNodes.

As example: You analyze your workflow and realized, that in most cases, you
 grouped your data by date(-mm-dd). In this case you can save data from
all day in one Region Server(if you use Spark-on-HBase DataFrame). In this
case your "group By date" operation can be done on the local worker-node
and without shuffling your temporary data between other workers-nodes.
Maybe this article can be usefull:
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/

2016-08-01 18:56 GMT+03:00 Jestin Ma <jestinwith.a...@gmail.com>:

> Hi Nikolay, I'm looking at data locality improvements for Spark, and I
> have conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere (
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place
> tasks next to HDFS blocks.
>
> Can anyone verify/support one side or the other?
>
> Thank you,
> Jestin
>
> On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <phpap...@gmail.com> wrote:
>
>> Hi.
>> Maybe you can help "data locality"..
>> If you use groupBY and joins, than most likely you will see alot of
>> network operations. This can be werry slow. You can try prepare, transform
>> your information in that way, what can minimize transporting temporary
>> information between worker-nodes.
>>
>> Try google in this way "Data locality in Hadoop"
>>
>>
>> 2016-08-01 4:41 GMT+03:00 Jestin Ma <jestinwith.a...@gmail.com>:
>>
>>> It seems that the number of tasks being this large do not matter. Each
>>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>>> no avail.
>>>
>>> I tried coalescing to 50 but that introduced large data skew and slowed
>>> down my job a lot.
>>>
>>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <and...@aehrlich.com>
>>> wrote:
>>>
>>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>>> .coalesce(50) placed right after loading the data. It will probably either
>>>> run faster or crash with out of memory errors.
>>>>
>>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com>
>>>> wrote:
>>>>
>>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>>> leading to about 15000 tasks.
>>>>
>>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>>> to disk.
>>>>
>>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>>
>>>> I read on the Spark Tuning guide that:
>>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>>
>>>> This means that I should have about 30-50 tasks instead of 15000, and
>>>> each task would be much bigger in size. Is my understanding correct, and is
>>>> this suggested? I've read from difference sources to decrease or increase
>>>> parallelism, or even keep it default.
>>>>
>>>> Thank you for your help,
>>>> Jestin
>>>>
>>>>
>>>>
>>>
>>
>

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread Nikolay Zhebet

Your exception says, that you have  connection trouble with Spark master.

Check if it is available from your environment where you trying to run job.
In Linux system for this can be suitable this commands: "telnet 127.0.0.1
7077" or "netstat -ntpl | grep 7077" or "nmap 127.0.0.1 | grep 7077".

Try to use analog of this commands in Windows and check if is available
spark master from your running environment?

2016-08-01 14:35 GMT+03:00 ayan guha <guha.a...@gmail.com>:

> No I confirmed master is running by spark ui at localhost:8080
> On 1 Aug 2016 18:22, "Nikolay Zhebet" <phpap...@gmail.com> wrote:
>
>> I think you haven't run spark master yet, or maybe port 7077 is not yours
>> default port for spark master.
>>
>> 2016-08-01 4:24 GMT+03:00 ayan guha <guha.a...@gmail.com>:
>>
>>> Hi
>>>
>>> I just downloaded Spark 2.0 on my windows 7 to check it out. However,
>>> not able to set up a standalone cluster:
>>>
>>>
>>> Step 1: master set up (Successful)
>>>
>>> bin/spark-class org.apache.spark.deploy.master.Master
>>>
>>> It did throw an error about not able to find winutils, but started
>>> successfully.
>>>
>>> Step II: Set up Worker (Failed)
>>>
>>> bin/spark-class org.apache.spark.deploy.worker.Worker
>>> spark://localhost:7077
>>>
>>> This step fails with following error:
>>>
>>> 16/08/01 11:21:27 INFO Worker: Connecting to master localhost:7077...
>>> 16/08/01 11:21:28 WARN Worker: Failed to connect to master localhost:7077
>>> org.apache.spark.SparkException: Exception thrown in awaitResult
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>>> la:77)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>>> la:75)
>>> at
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.s
>>> cala:36)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>>> rElse(RpcTimeout.scala:59)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>>> rElse(RpcTimeout.scala:59)
>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>> at
>>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>>> at
>>> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>>> at
>>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deplo
>>> y$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>>> Source)
>>> at java.util.concurrent.FutureTask.run(Unknown Source)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>>> Source)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>>> Source)
>>> at java.lang.Thread.run(Unknown Source)
>>> Caused by: java.io.IOException: Failed to connect to localhost/
>>> 127.0.0.1:7077
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>>> ransportClientFactory.java:228)
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>>> ransportClientFactory.java:179)
>>> at
>>> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala
>>> :197)
>>> at
>>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
>>> at
>>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
>>> ... 4 more
>>> Caused by: java.net.ConnectException: Connection refused: no further
>>> information
>>> : localhost/127.0.0.1:7077
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>>> at
>>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocke
>>> tChannel.java:224)
>>> at
>>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConne
>>> ct(AbstractNioChannel.java:289)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
>>> a:528)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
>>> ntLoop.java:468)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
>>> va:382)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
>>> EventExecutor.java:111)
>>> ... 1 more
>>>
>>> Am I doing something wrong?
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>

Re: multiple spark streaming contexts

2016-08-01 Thread Nikolay Zhebet

You always can save data in hdfs where you need, and you can controll
paralelizm in your app by configuring --driver-cores and --driver-memory.This
approach can maintain Spark master and it can controll your failure issues,
data locality and etc. But if you want to controll it by self with
"Executors.newFixedThreadPool(threadNum)" or other ways, i think you can
catch problems with yarn/mesos job recovery and failure mechanizm.
I wish you good luck in your struggle of parallelism )) This is an
interesting question!)

2016-08-01 10:41 GMT+03:00 Sumit Khanna <sumit.kha...@askme.in>:

> Hey Nikolay,
>
> I know the approach, but this pretty much doesnt fit the bill for my
> usecase wherein each topic needs to be logged / persisted as a separate
> hdfs location.
>
> I am looking for something where a streaming context pertains to a topic
> and that topic only, and was wondering if I could have them all in parallel
> in one app / jar run.
>
> Thanks,
>
> On Mon, Aug 1, 2016 at 1:08 PM, Nikolay Zhebet <phpap...@gmail.com> wrote:
>
>> Hi, If you want read several kafka topics in spark-streaming job, you can
>> set names of topics splited by coma and after that you can read all
>> messages from all topics in one flow:
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>>
>> val lines = KafkaUtils.createStream[String, String, StringDecoder, 
>> StringDecoder](ssc, kafkaParams, topicMap, 
>> StorageLevel.MEMORY_ONLY).map(_._2)
>>
>>
>> After that you can use ".filter" function for splitting your topics and 
>> iterate messages separately.
>>
>> val orders_paid = lines.filter(x => { x("table_name") == 
>> "kismia.orders_paid"})
>>
>> orders_paid.foreachRDD( rdd => { 
>>
>>
>> Or you can you you if..else construction for splitting your messages by
>> names in foreachRDD:
>>
>> lines.foreachRDD((recrdd, time: Time) => {
>>
>>recrdd.foreachPartition(part => {
>>
>>   part.foreach(item_row => {
>>
>>  if (item_row("table_name") == "kismia.orders_paid") { ...} else if 
>> (...) {...}
>>
>> 
>>
>>
>> 2016-08-01 9:39 GMT+03:00 Sumit Khanna <sumit.kha...@askme.in>:
>>
>>> Any ideas guys? What are the best practices for multiple streams to be
>>> processed?
>>> I could trace a few Stack overflow comments wherein they better
>>> recommend a jar separate for each stream / use case. But that isn't pretty
>>> much what I want, as in it's better if one / multiple spark streaming
>>> contexts can all be handled well within a single jar.
>>>
>>> Guys please reply,
>>>
>>> Awaiting,
>>>
>>> Thanks,
>>> Sumit
>>>
>>> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna <sumit.kha...@askme.in>
>>> wrote:
>>>
>>>> Any ideas on this one guys ?
>>>>
>>>> I can do a sample run but can't be sure of imminent problems if any?
>>>> How can I ensure different batchDuration etc etc in here, per
>>>> StreamingContext.
>>>>
>>>> Thanks,
>>>>
>>>> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna <sumit.kha...@askme.in>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Was wondering if I could create multiple spark stream contexts in my
>>>>> application (e.g instantiating a worker actor per topic and it has its own
>>>>> streaming context its own batch duration everything).
>>>>>
>>>>> What are the caveats if any?
>>>>> What are the best practices?
>>>>>
>>>>> Have googled half heartedly on the same but the air isn't pretty much
>>>>> demystified yet. I could skim through something like
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>>>>>
>>>>> Thanks in Advance!
>>>>> Sumit
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread Nikolay Zhebet

I think you haven't run spark master yet, or maybe port 7077 is not yours
default port for spark master.

2016-08-01 4:24 GMT+03:00 ayan guha :

> Hi
>
> I just downloaded Spark 2.0 on my windows 7 to check it out. However, not
> able to set up a standalone cluster:
>
>
> Step 1: master set up (Successful)
>
> bin/spark-class org.apache.spark.deploy.master.Master
>
> It did throw an error about not able to find winutils, but started
> successfully.
>
> Step II: Set up Worker (Failed)
>
> bin/spark-class org.apache.spark.deploy.worker.Worker
> spark://localhost:7077
>
> This step fails with following error:
>
> 16/08/01 11:21:27 INFO Worker: Connecting to master localhost:7077...
> 16/08/01 11:21:28 WARN Worker: Failed to connect to master localhost:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
> la:77)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
> la:75)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.s
> cala:36)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
> rElse(RpcTimeout.scala:59)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
> rElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
> at
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deplo
> y$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Failed to connect to localhost/
> 127.0.0.1:7077
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(T
> ransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(T
> ransportClientFactory.java:179)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala
> :197)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
> ... 4 more
> Caused by: java.net.ConnectException: Connection refused: no further
> information
> : localhost/127.0.0.1:7077
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocke
> tChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConne
> ct(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
> a:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
> ntLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
> va:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
> EventExecutor.java:111)
> ... 1 more
>
> Am I doing something wrong?
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet

Hi.
Maybe you can help "data locality"..
If you use groupBY and joins, than most likely you will see alot of network
operations. This can be werry slow. You can try prepare, transform your
information in that way, what can minimize transporting temporary
information between worker-nodes.

Try google in this way "Data locality in Hadoop"


2016-08-01 4:41 GMT+03:00 Jestin Ma :

> It seems that the number of tasks being this large do not matter. Each
> task was set default by the HDFS as 128 MB (block size) which I've heard to
> be ok. I've tried tuning the block (task) size to be larger and smaller to
> no avail.
>
> I tried coalescing to 50 but that introduced large data skew and slowed
> down my job a lot.
>
> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich 
> wrote:
>
>> 15000 seems like a lot of tasks for that size. Test it out with a
>> .coalesce(50) placed right after loading the data. It will probably either
>> run faster or crash with out of memory errors.
>>
>> On Jul 29, 2016, at 9:02 AM, Jestin Ma  wrote:
>>
>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>> leading to about 15000 tasks.
>>
>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>> I'm performing groupBy, count, and an outer-join with another DataFrame
>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>> to disk.
>>
>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>
>> I read on the Spark Tuning guide that:
>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>
>> This means that I should have about 30-50 tasks instead of 15000, and
>> each task would be much bigger in size. Is my understanding correct, and is
>> this suggested? I've read from difference sources to decrease or increase
>> parallelism, or even keep it default.
>>
>> Thank you for your help,
>> Jestin
>>
>>
>>
>

Re: multiple spark streaming contexts

2016-08-01 Thread Nikolay Zhebet

Hi, If you want read several kafka topics in spark-streaming job, you can
set names of topics splited by coma and after that you can read all
messages from all topics in one flow:

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

val lines = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topicMap,
StorageLevel.MEMORY_ONLY).map(_._2)


After that you can use ".filter" function for splitting your topics
and iterate messages separately.

val orders_paid = lines.filter(x => { x("table_name") == "kismia.orders_paid"})

orders_paid.foreachRDD( rdd => { 


Or you can you you if..else construction for splitting your messages by
names in foreachRDD:

lines.foreachRDD((recrdd, time: Time) => {

   recrdd.foreachPartition(part => {

  part.foreach(item_row => {

 if (item_row("table_name") == "kismia.orders_paid") { ...}
else if (...) {...}




2016-08-01 9:39 GMT+03:00 Sumit Khanna :

> Any ideas guys? What are the best practices for multiple streams to be
> processed?
> I could trace a few Stack overflow comments wherein they better recommend
> a jar separate for each stream / use case. But that isn't pretty much what
> I want, as in it's better if one / multiple spark streaming contexts can
> all be handled well within a single jar.
>
> Guys please reply,
>
> Awaiting,
>
> Thanks,
> Sumit
>
> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna 
> wrote:
>
>> Any ideas on this one guys ?
>>
>> I can do a sample run but can't be sure of imminent problems if any? How
>> can I ensure different batchDuration etc etc in here, per StreamingContext.
>>
>> Thanks,
>>
>> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna 
>> wrote:
>>
>>> Hey,
>>>
>>> Was wondering if I could create multiple spark stream contexts in my
>>> application (e.g instantiating a worker actor per topic and it has its own
>>> streaming context its own batch duration everything).
>>>
>>> What are the caveats if any?
>>> What are the best practices?
>>>
>>> Have googled half heartedly on the same but the air isn't pretty much
>>> demystified yet. I could skim through something like
>>>
>>>
>>> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>>>
>>>
>>> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>>>
>>> Thanks in Advance!
>>> Sumit
>>>
>>
>>
>

Re: spark.read.format("jdbc")

2016-08-01 Thread Nikolay Zhebet

You should specify classpath for your jdbc connection.
As example, if you want connect to Impala, you can try it snippet:



import java.util.Properties
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import java.sql.Connection
import java.sql.DriverManager
Class.forName("com.cloudera.impala.jdbc41.Driver")

var conn: java.sql.Connection = null
conn = 
DriverManager.getConnection("jdbc:impala://127.0.0.1:21050/default;auth=noSasl",
"", "")
val statement = conn.createStatement();

val result = statement.executeQuery("SELECT * FROM users limit 10")
result.next()
result.getString("user_id")val sql_insert = "INSERT INTO users
VALUES('user_id','email','gender')"
statement.executeUpdate(sql_insert)


Also you should specify path your jdbc jar file in --driver-class-path
variable when you running spark-submit:

spark-shell --master "local[2]" --driver-class-path
/opt/cloudera/parcels/CDH/jars/ImpalaJDBC41.jar


2016-08-01 9:37 GMT+03:00 kevin :

> maybe there is another version spark on the classpath?
>
> 2016-08-01 14:30 GMT+08:00 kevin :
>
>> hi,all:
>>I try to load data from jdbc datasource,but I got error with :
>> java.lang.RuntimeException: Multiple sources found for jdbc
>> (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider,
>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please
>> specify the fully qualified class name.
>>
>> spark version is 2.0
>>
>>
>

Re: JDBC Very Slow

Re: Tuning level of Parallelism: Increase or decrease?

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

Re: multiple spark streaming contexts

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

Re: Tuning level of Parallelism: Increase or decrease?

Re: multiple spark streaming contexts

Re: spark.read.format("jdbc")

8 matches

Site Navigation

Mail list logo

Footer information