date:20160921

Spark Application Log

2016-09-21 Thread Divya Gehlot

Hi,
I have initialised the logging in my spark App

/*Initialize Logging */
val log = Logger.getLogger(getClass.getName)
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

log.warn("Some text"+Somemap.size)


When I run my spark job in using spark-submit like as below

spark-submit \
--master yarn-client \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 1 \
--num-executors 2 \
--class MainClass /home/hadoop/Spark-assembly-1.0.jar

I could see the log in terminal itself

16/09/22 03:45:31 WARN MainClass$: SomeText  : 10


When I set up this job in scheduler

where I can see these logs?


Thanks,

Divya

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Sean Owen

No, Xmx only controls the maximum size of on-heap allocated memory.
The JVM doesn't manage/limit off-heap (how could it? it doesn't know
when it can be released).

The answer is that YARN will kill the process because it's using more
memory than it asked for. A JVM is always going to use a little
off-heap memory by itself, so setting a max heap size of 2GB means the
JVM process may use a bit more than 2GB of memory. With an off-heap
intensive app like Spark it can be a lot more.

There's a built-in 10% overhead, so that if you ask for a 3GB executor
it will ask for 3.3GB from YARN. You can increase the overhead.

On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke  wrote:
> All off-heap memory is still managed by the JVM process. If you limit the
> memory of this process then you limit the memory. I think the memory of the
> JVM process could be limited via the xms/xmx parameter of the JVM. This can
> be configured via spark options for yarn (be aware that they are different
> in cluster and client mode), but i recommend to use the spark options for
> the off heap maximum.
>
> https://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> On 21 Sep 2016, at 22:02, Michael Segel  wrote:
>
> Iâ€™ve asked this question a couple of times from a friend who didnâ€™t know
> the answerâ€¦ so I thought I would try here.
>
>
> Suppose we launch a job on a cluster (YARN) and we have set up the
> containers to be 3GB in size.
>
>
> What does that 3GB represent?
>
> I mean what happens if we end up using 2-3GB of off heap storage via
> tungsten?
> What will Spark do?
> Will it try to honor the containerâ€™s limits and throw an exception or will
> it allow my job to grab that amount of memory and exceed YARNâ€™s
> expectations since its off heap?
>
> Thx
>
> -Mike
>
> B‹CB• È
> [œÝXœØÜšX™H K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃBƒ

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Sqoop vs spark jdbc

2016-09-21 Thread Don Drake

We just had this conversation at work today.  We have a long sqoop pipeline
and I argued to keep it in sqoop since we can take advantage of OraOop
(direct mode) for performance and spark can't match that AFAIK.  Sqoop also
allows us to write directly into parquet format, which then Spark can read
optimally.

In regards to the MB_MILLIS_MAP exception, I ran into this same exception a
while ago running Titan DB map/reduce jobs.  The cause was a mismatch in
.jars in the distribution.  Your CLASSPATH probably contains some old .jar
files.

-Don

On Wed, Sep 21, 2016 at 6:17 PM, Mich Talebzadeh 
wrote:

> I do not know why this happening.
>
> Trying to load an Hbase table at command line
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.separator=',' -Dimporttsv.columns="HBASE_ROW_KEY,c1,c2" t2
> hdfs://rhes564:9000/tmp/crap.txt
>
> Comes back with this error
>
>
> 2016-09-22 00:12:46,576 INFO  [main] mapreduce.JobSubmitter: Submitting
> tokens for job: job_1474455325627_0052
> 2016-09-22 00:12:46,755 INFO  [main] impl.YarnClientImpl: Submitted
> application application_1474455325627_0052 to ResourceManager at rhes564/
> 50.140.197.217:8032
> 2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: The url to track the
> job: http://http://rhes564:8088/proxy/application_1474455325627_0052/
> 2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: Running job:
> job_1474455325627_0052
> 2016-09-22 00:12:55,913 INFO  [main] mapreduce.Job: Job
> job_1474455325627_0052 running in uber mode : false
> 2016-09-22 00:12:55,915 INFO  [main] mapreduce.Job:  map 0% reduce 0%
> 2016-09-22 00:13:01,994 INFO  [main] mapreduce.Job:  map 100% reduce 0%
> 2016-09-22 00:13:03,008 INFO  [main] mapreduce.Job: Job
> job_1474455325627_0052 completed successfully
> Exception in thread "main" java.lang.IllegalArgumentException: No enum
> constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
> at java.lang.Enum.valueOf(Enum.java:238)
> at org.apache.hadoop.mapreduce.counters.
> FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
> at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.
> findCounter(FrameworkCounterGroup.java:182)
> at org.apache.hadoop.mapreduce.counters.AbstractCounters.
> findCounter(AbstractCounters.java:154)
> at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(
> TypeConverter.java:240)
> at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(
> ClientServiceDelegate.java:370)
> at org.apache.hadoop.mapred.YARNRunner.getJobCounters(
> YARNRunner.java:511)
> at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
> at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1491)
> at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
> at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.
> java:1361)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.
> java:1289)
> at org.apache.hadoop.hbase.mapreduce.ImportTsv.run(
> ImportTsv.java:680)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(
> ImportTsv.java:684)
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 September 2016 at 21:47, Jörn Franke  wrote:
>
>> I think there might be still something messed up with the classpath. It
>> complains in the logs about deprecated jars and deprecated configuration
>> files.
>>
>> On 21 Sep 2016, at 22:21, Mich Talebzadeh 
>> wrote:
>>
>> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
>>
>> You may argue why and it is because Spark does it in one process and no
>> errors
>>
>> With sqoop I am getting this error message which leaves the RDBMS table
>> data on HDFS file but stops there.
>>
>> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
>> Connector for Oracle and Hadoop is disabled.
>> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
>> default fetchSize of 1000
>> 2016-09-21

Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread Tathagata Das

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Wed, Sep 21, 2016 at 4:26 PM, ayan guha  wrote:

> Connection object is not serialisable. You need to implement a getorcreate
> function which would run on each executors to create hbase connection
> locally.
> On 22 Sep 2016 08:34, "KhajaAsmath Mohammed" 
> wrote:
>
>> Hello Everyone,
>>
>> I am running spark application to push data from kafka. I am able to get
>> hbase kerberos connection successfully outside of functon before calling
>> foreachrdd on Dstream.
>>
>> Job fails inside foreachrdd stating that hbaseconnection object is not
>> serialized. could you please let me now  how toresolve this.
>>
>> @transient val hbaseConnection=hBaseEntityManager.getConnection()
>>
>> appEventDStream.foreachRDD(rdd => {
>>   if (!rdd.isEmpty()) {
>> rdd.foreach { entity =>
>>   {
>>   
>> generatePut(hBaseEntityManager,hbaseConnection,entity.getClass.getSimpleName,entity.asInstanceOf[DataPoint])
>>
>> }
>>
>> }
>>
>>
>> Error is thrown exactly at connection object inside foreachRdd saying it is 
>> not serialize. could anyone provide solution for it
>>
>> Asmath
>>
>>

Re: Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread ayan guha

Connection object is not serialisable. You need to implement a getorcreate
function which would run on each executors to create hbase connection
locally.
On 22 Sep 2016 08:34, "KhajaAsmath Mohammed" 
wrote:

> Hello Everyone,
>
> I am running spark application to push data from kafka. I am able to get
> hbase kerberos connection successfully outside of functon before calling
> foreachrdd on Dstream.
>
> Job fails inside foreachrdd stating that hbaseconnection object is not
> serialized. could you please let me now  how toresolve this.
>
> @transient val hbaseConnection=hBaseEntityManager.getConnection()
>
> appEventDStream.foreachRDD(rdd => {
>   if (!rdd.isEmpty()) {
> rdd.foreach { entity =>
>   {
>   
> generatePut(hBaseEntityManager,hbaseConnection,entity.getClass.getSimpleName,entity.asInstanceOf[DataPoint])
>
> }
>
> }
>
>
> Error is thrown exactly at connection object inside foreachRdd saying it is 
> not serialize. could anyone provide solution for it
>
> Asmath
>
>

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh

I do not know why this happening.

Trying to load an Hbase table at command line

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=','
-Dimporttsv.columns="HBASE_ROW_KEY,c1,c2" t2
hdfs://rhes564:9000/tmp/crap.txt

Comes back with this error


2016-09-22 00:12:46,576 INFO  [main] mapreduce.JobSubmitter: Submitting
tokens for job: job_1474455325627_0052
2016-09-22 00:12:46,755 INFO  [main] impl.YarnClientImpl: Submitted
application application_1474455325627_0052 to ResourceManager at rhes564/
50.140.197.217:8032
2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: The url to track the
job: http://http://rhes564:8088/proxy/application_1474455325627_0052/
2016-09-22 00:12:46,783 INFO  [main] mapreduce.Job: Running job:
job_1474455325627_0052
2016-09-22 00:12:55,913 INFO  [main] mapreduce.Job: Job
job_1474455325627_0052 running in uber mode : false
2016-09-22 00:12:55,915 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-09-22 00:13:01,994 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-09-22 00:13:03,008 INFO  [main] mapreduce.Job: Job
job_1474455325627_0052 completed successfully
Exception in thread "main" java.lang.IllegalArgumentException: No enum
constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
at java.lang.Enum.valueOf(Enum.java:238)
at
org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
at
org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182)
at
org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
at
org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370)
at
org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)
at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
at
org.apache.hadoop.hbase.mapreduce.ImportTsv.run(ImportTsv.java:680)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:684)




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 September 2016 at 21:47, Jörn Franke  wrote:

> I think there might be still something messed up with the classpath. It
> complains in the logs about deprecated jars and deprecated configuration
> files.
>
> On 21 Sep 2016, at 22:21, Mich Talebzadeh 
> wrote:
>
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
>
> You may argue why and it is because Spark does it in one process and no
> errors
>
> With sqoop I am getting this error message which leaves the RDBMS table
> data on HDFS file but stops there.
>
> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
> Connector for Oracle and Hadoop is disabled.
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
> default fetchSize of 1000
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning
> code generation
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/
> StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/
> slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.
> jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/data6/hduser/hbase-
> 0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/
> impl/StaticLoggerBinder.class]
> SLF4J: Found binding in

RE: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Arif,Mubaraka

we installed it but the kernel dies. 
Any clue, why ?

thanks for the link :)-

~muby

From: Jakob Odersky [ja...@odersky.com]
Sent: Wednesday, September 21, 2016 4:54 PM
To: Arif,Mubaraka
Cc: User; Toivola,Sami
Subject: Re: Has anyone installed the scala kernel for Jupyter notebook

One option would be to use Apache Toree. A quick setup guide can be
found here 
https://urldefense.proofpoint.com/v2/url?u=https-3A__toree.incubator.apache.org_documentation_user_quick-2Dstart=CwIBaQ=RI9dKKMRNVHr9NFa7OQiQw=dUN85GiSQZVDs0gTK4x1mSiAdXTZ-7F0KzGt2fcse38=aQ-ch2WNqv83T9vSyNogXuQZ5X3hK9k6MRt7uUhtfmg=qc-mcUm9Yx0_kXIfKLy0FUmsv_pRLZyCIHI7nzLbKr0=

On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka  wrote:
> Has anyone installed the scala kernel for Jupyter notebook.
>
>
>
> Any blogs or steps to follow in appreciated.
>
>
>
> thanks,
>
> Muby
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark JavaRDD pipe() need help

2016-09-21 Thread Jakob Odersky

Can you provide more details? It's unclear what you're asking

On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
 wrote:
> Hi All,
>
> I am trying to use the JavaRDD.pipe() API.
>
> I have one object with me from the JavaRDD

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Jörn Franke

All off-heap memory is still managed by the JVM process. If you limit the 
memory of this process then you limit the memory. I think the memory of the JVM 
process could be limited via the xms/xmx parameter of the JVM. This can be 
configured via spark options for yarn (be aware that they are different in 
cluster and client mode), but i recommend to use the spark options for the off 
heap maximum.

https://spark.apache.org/docs/latest/running-on-yarn.html

> On 21 Sep 2016, at 22:02, Michael Segel  wrote:
> 
> Iâ€™ve asked this question a couple of times from a friend who didnâ€™t know 
> the answerâ€¦ so I thought I would try here. 
> 
> 
> Suppose we launch a job on a cluster (YARN) and we have set up the containers 
> to be 3GB in size.
> 
> 
> What does that 3GB represent? 
> 
> I mean what happens if we end up using 2-3GB of off heap storage via 
> tungsten? 
> What will Spark do? 
> Will it try to honor the containerâ€™s limits and throw an exception or will 
> it allow my job to grab that amount of memory and exceed YARNâ€™s 
> expectations since its off heap? 
> 
> Thx
> 
> -Mike
> 
> B‹CB•È[œÝXœØÜšX™HK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃBƒ

Hbase Connection not seraializible in Spark -> foreachrdd

2016-09-21 Thread KhajaAsmath Mohammed

Hello Everyone,

I am running spark application to push data from kafka. I am able to get
hbase kerberos connection successfully outside of functon before calling
foreachrdd on Dstream.

Job fails inside foreachrdd stating that hbaseconnection object is not
serialized. could you please let me now  how toresolve this.

@transient val hbaseConnection=hBaseEntityManager.getConnection()

appEventDStream.foreachRDD(rdd => {
  if (!rdd.isEmpty()) {
rdd.foreach { entity =>
  {
  
generatePut(hBaseEntityManager,hbaseConnection,entity.getClass.getSimpleName,entity.asInstanceOf[DataPoint])

}

}


Error is thrown exactly at connection object inside foreachRdd saying
it is not serialize. could anyone provide solution for it

Asmath

Re: Task Deserialization Error

2016-09-21 Thread Chawla,Sumit

Thanks Guys. It was a classLoader issue.  Rather than linking to the
SPARK_HOME/assembly/target/scala-2.11/jars/ i was linking the individual
jars.  Linking to the folder instead solved the issue for me.

Regards
Sumit Chawla


On Wed, Sep 21, 2016 at 2:51 PM, Jakob Odersky  wrote:

> Your app is fine, I think the error has to do with the way inttelij
> launches applications. Is your app forked in a new jvm when you run it?
>
> On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D 
> wrote:
>
>> Hello Sumit -
>>
>> I could see that SparkConf() specification is not being mentioned in your
>> program. But rest looks good.
>>
>>
>>
>> Output:
>>
>>
>> By the way, I have used the README.md template https://gis
>> t.github.com/jxson/1784669
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Sep 20, 2016 at 2:15 AM, Chawla,Sumit 
>> wrote:
>>
>>> Hi All
>>>
>>> I am trying to test a simple Spark APP using scala.
>>>
>>>
>>> import org.apache.spark.SparkContext
>>>
>>> object SparkDemo {
>>>   def main(args: Array[String]) {
>>> val logFile = "README.md" // Should be some file on your system
>>>
>>> // to run in local mode
>>> val sc = new SparkContext("local", "Simple App", 
>>> ""PATH_OF_DIRECTORY_WHERE_COMPILED_SPARK_PROJECT_FROM_GIT")
>>>
>>> val logData = sc.textFile(logFile).cache()
>>> val numAs = logData.filter(line => line.contains("a")).count()
>>> val numBs = logData.filter(line => line.contains("b")).count()
>>>
>>>
>>> println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>>>
>>>   }
>>> }
>>>
>>>
>>> When running this demo in IntelliJ, i am getting following error:
>>>
>>>
>>> java.lang.IllegalStateException: unread block data
>>> at 
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
>>> at 
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>>> at 
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>>> at 
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>>> at 
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> I guess its associated with task not being deserializable.  Any help will 
>>> be appreciated.
>>>
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Chawla,Sumit

+1 Jakob.  Thanks for the link

Regards
Sumit Chawla


On Wed, Sep 21, 2016 at 2:54 PM, Jakob Odersky  wrote:

> One option would be to use Apache Toree. A quick setup guide can be
> found here https://toree.incubator.apache.org/documentation/user/
> quick-start
>
> On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka 
> wrote:
> > Has anyone installed the scala kernel for Jupyter notebook.
> >
> >
> >
> > Any blogs or steps to follow in appreciated.
> >
> >
> >
> > thanks,
> >
> > Muby
> >
> > - To
> > unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Jakob Odersky

One option would be to use Apache Toree. A quick setup guide can be
found here https://toree.incubator.apache.org/documentation/user/quick-start

On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka  wrote:
> Has anyone installed the scala kernel for Jupyter notebook.
>
>
>
> Any blogs or steps to follow in appreciated.
>
>
>
> thanks,
>
> Muby
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Task Deserialization Error

2016-09-21 Thread Jakob Odersky

Your app is fine, I think the error has to do with the way inttelij
launches applications. Is your app forked in a new jvm when you run it?

On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D 
wrote:

> Hello Sumit -
>
> I could see that SparkConf() specification is not being mentioned in your
> program. But rest looks good.
>
>
>
> Output:
>
>
> By the way, I have used the README.md template https://
> gist.github.com/jxson/1784669
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Sep 20, 2016 at 2:15 AM, Chawla,Sumit 
> wrote:
>
>> Hi All
>>
>> I am trying to test a simple Spark APP using scala.
>>
>>
>> import org.apache.spark.SparkContext
>>
>> object SparkDemo {
>>   def main(args: Array[String]) {
>> val logFile = "README.md" // Should be some file on your system
>>
>> // to run in local mode
>> val sc = new SparkContext("local", "Simple App", 
>> ""PATH_OF_DIRECTORY_WHERE_COMPILED_SPARK_PROJECT_FROM_GIT")
>>
>> val logData = sc.textFile(logFile).cache()
>> val numAs = logData.filter(line => line.contains("a")).count()
>> val numBs = logData.filter(line => line.contains("b")).count()
>>
>>
>> println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>>
>>   }
>> }
>>
>>
>> When running this demo in IntelliJ, i am getting following error:
>>
>>
>> java.lang.IllegalStateException: unread block data
>>  at 
>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>>  at 
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I guess its associated with task not being deserializable.  Any help will be 
>> appreciated.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Task Deserialization Error

2016-09-21 Thread Gokula Krishnan D

Hello Sumit -

I could see that SparkConf() specification is not being mentioned in your
program. But rest looks good.



Output:


By the way, I have used the README.md template
https://gist.github.com/jxson/1784669

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Sep 20, 2016 at 2:15 AM, Chawla,Sumit 
wrote:

> Hi All
>
> I am trying to test a simple Spark APP using scala.
>
>
> import org.apache.spark.SparkContext
>
> object SparkDemo {
>   def main(args: Array[String]) {
> val logFile = "README.md" // Should be some file on your system
>
> // to run in local mode
> val sc = new SparkContext("local", "Simple App", 
> ""PATH_OF_DIRECTORY_WHERE_COMPILED_SPARK_PROJECT_FROM_GIT")
>
> val logData = sc.textFile(logFile).cache()
> val numAs = logData.filter(line => line.contains("a")).count()
> val numBs = logData.filter(line => line.contains("b")).count()
>
>
> println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>
>   }
> }
>
>
> When running this demo in IntelliJ, i am getting following error:
>
>
> java.lang.IllegalStateException: unread block data
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> I guess its associated with task not being deserializable.  Any help will be 
> appreciated.
>
>
>
> Regards
> Sumit Chawla
>
>

Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Arif,Mubaraka





Has anyone installed the scala kernel for Jupyter notebook.
 
Any blogs or steps to follow in appreciated.
 
thanks,
Muby




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Sqoop vs spark jdbc

2016-09-21 Thread Jörn Franke

I think there might be still something messed up with the classpath. It 
complains in the logs about deprecated jars and deprecated configuration files.

> On 21 Sep 2016, at 22:21, Mich Talebzadeh  wrote:
> 
> Well I am left to use Spark for importing data from RDBMS table to Hadoop.
> 
> You may argue why and it is because Spark does it in one process and no errors
> 
> With sqoop I am getting this error message which leaves the RDBMS table data 
> on HDFS file but stops there.
> 
> 2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data Connector 
> for Oracle and Hadoop is disabled.
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using default 
> fetchSize of 1000
> 2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning 
> code generation
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> 2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time zone 
> has been set to GMT
> 2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing SQL 
> statement: select * from sh.sales where(1 = 0)
> 2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing SQL 
> statement: select * from sh.sales where(1 = 0)
> 2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] - 
> HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
> Note: 
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java 
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] - 
> Writing jar file: 
> /tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar
> 2016-09-21 21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning 
> query import.
> 2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] - Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] - mapred.jar 
> is deprecated. Instead, use mapreduce.job.jar
> 2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] - 
> mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
> 2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to 
> ResourceManager at rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using read 
> commited transaction isolation
> 2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147] - 
> BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from 
> sh.sales where(1 = 1) ) t1
> 2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number of 
> splits:4
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapred.job.name is deprecated. Instead, use mapreduce.job.name
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files.timestamps is deprecated. Instead, use 
> mapreduce.job.cache.files.timestamps
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.inputformat.class is deprecated. Instead, use 
> mapreduce.job.inputformat.class
> 2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] - 
> mapreduce.outputformat.class is deprecated. Instead, use 
> mapreduce.job.outputformat.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.value.class is deprecated. Instead, use 
> mapreduce.job.output.value.class
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.dir is deprecated. Instead, use 
> mapreduce.output.fileoutputformat.outputdir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
> 2016-09-21 21:00:23,548 [myid:] - INFO

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh

Well I am left to use Spark for importing data from RDBMS table to Hadoop.

You may argue why and it is because Spark does it in one process and no
errors

With sqoop I am getting this error message which leaves the RDBMS table
data on HDFS file but stops there.

2016-09-21 21:00:15,084 [myid:] - INFO  [main:OraOopLog@103] - Data
Connector for Oracle and Hadoop is disabled.
2016-09-21 21:00:15,095 [myid:] - INFO  [main:SqlManager@98] - Using
default fetchSize of 1000
2016-09-21 21:00:15,095 [myid:] - INFO  [main:CodeGenTool@92] - Beginning
code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-hive.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/phoenix-4.8.0-HBase-0.98-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/data6/hduser/hbase-0.98.21-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/hduser/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
2016-09-21 21:00:15,681 [myid:] - INFO  [main:OracleManager@417] - Time
zone has been set to GMT
2016-09-21 21:00:15,717 [myid:] - INFO  [main:SqlManager@757] - Executing
SQL statement: select * from sh.sales where(1 = 0)
2016-09-21 21:00:15,727 [myid:] - INFO  [main:SqlManager@757] - Executing
SQL statement: select * from sh.sales where(1 = 0)
2016-09-21 21:00:15,748 [myid:] - INFO  [main:CompilationManager@94] -
HADOOP_MAPRED_HOME is /home/hduser/hadoop-2.7.3/share/hadoop/mapreduce
Note:
/tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.java
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

*2016-09-21 21:00:17,354 [myid:] - INFO  [main:CompilationManager@330] -
Writing jar file:
/tmp/sqoop-hduser/compile/82dcf5975118b5e271b442e547201fdf/QueryResult.jar*2016-09-21
21:00:17,366 [myid:] - INFO  [main:ImportJobBase@237] - Beginning query
import.
2016-09-21 21:00:17,511 [myid:] - WARN  [main:NativeCodeLoader@62] - Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable
2016-09-21 21:00:17,516 [myid:] - INFO  [main:Configuration@840] -
mapred.jar is deprecated. Instead, use mapreduce.job.jar
2016-09-21 21:00:17,993 [myid:] - INFO  [main:Configuration@840] -
mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2016-09-21 21:00:18,094 [myid:] - INFO  [main:RMProxy@56] - Connecting to
ResourceManager at rhes564/50.140.197.217:8032
2016-09-21 21:00:23,441 [myid:] - INFO  [main:DBInputFormat@192] - Using
read commited transaction isolation
2016-09-21 21:00:23,442 [myid:] - INFO  [main:DataDrivenDBInputFormat@147]
- BoundingValsQuery: SELECT MIN(prod_id), MAX(prod_id) FROM (select * from
sh.sales where(1 = 1) ) t1
2016-09-21 21:00:23,540 [myid:] - INFO  [main:JobSubmitter@394] - number of
splits:4
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapred.job.name is deprecated. Instead, use mapreduce.job.name
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files.timestamps is deprecated. Instead, use
mapreduce.job.cache.files.timestamps
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.inputformat.class is deprecated. Instead, use
mapreduce.job.inputformat.class
2016-09-21 21:00:23,547 [myid:] - INFO  [main:Configuration@840] -
mapreduce.outputformat.class is deprecated. Instead, use
mapreduce.job.outputformat.class
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.output.value.class is deprecated. Instead, use
mapreduce.job.output.value.class
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.output.dir is deprecated. Instead, use
mapreduce.output.fileoutputformat.outputdir
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
mapred.job.classpath.files is deprecated. Instead, use
mapreduce.job.classpath.files
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - user.name
is deprecated. Instead, use mapreduce.job.user.name
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -

Equivalent to --files for driver?

2016-09-21 Thread Everett Anderson

Hi,

I'm running Spark 1.6.2 on YARN and I often use the cluster deploy mode
with spark-submit. While the --files param is useful for getting files onto
the cluster in the working directories of the executors, the driver's
working directory doesn't get them.

Is there some equivalent to --files for the driver program when run in this
mode?

Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Michael Segel

I’ve asked this question a couple of times from a friend who didn’t know the 
answer… so I thought I would try here. 


Suppose we launch a job on a cluster (YARN) and we have set up the containers 
to be 3GB in size.


What does that 3GB represent? 

I mean what happens if we end up using 2-3GB of off heap storage via tungsten? 
What will Spark do? 
Will it try to honor the container’s limits and throw an exception or will it 
allow my job to grab that amount of memory and exceed YARN’s expectations since 
its off heap? 

Thx

-Mike

Re: How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin

>
> There's a bit of confusion setting in here; the FileSystem implementations
> spark uses are subclasses of org.apache.hadoop.fs.FileSystem; the nio
> class with the same name is different.
> grab the google cloud storage connector and put it on your classpath


I was using the gs:// filesystem as an example. I should have mentioned
that I'm aware of the workaround for that one.

I'm not asking how to read from Google Cloud Storage from Spark.

What I'm interested in is Java's built-in extension mechanism for its
"Path" objects, aka custom filesystem providers
.
What if I want to use my own different custom filesystem provider?
Something that allow me to take a funky-looking string like "foo://bar/baz"
and open it like a regular file, even though this results in a TCP
connection to the bar server and ask it to give me the "baz" file out of
its holographic quantum entangled storage (or other unspecified future
technology that can provide file-like objects).

Re: How to use a custom filesystem provider?

2016-09-21 Thread Steve Loughran


On 21 Sep 2016, at 20:10, Jean-Philippe Martin 
> wrote:


The full source for my example is available on 
github.

I'm using maven to depend on 
gcloud-java-nio,
 which provides a Java FileSystem for Google Cloud Storage, via "gs://" URLs. 
My Spark project uses maven-shade-plugin to create one big jar with all the 
source in it.

The big jar correctly includes a 
META-INF/services/java.nio.file.spi.FileSystemProviderfile, containing the 
correct name for the class 
(com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider). I 
checked and that class is also correctly included in the jar file.

The program uses FileSystemProvider.installedProviders() to list the filesystem 
providers it finds. "gs" should be listed (and it is if I run the same function 
in a non-Spark context), but when running with Spark on Dataproc, that 
provider's gone.

I'd like to know: How can I use a custom filesystem in my Spark program?



There's a bit of confusion setting in here; the FileSystem implementations 
spark uses are subclasses of org.apache.hadoop.fs.FileSystem; the nio class 
with the same name is different.

grab the google cloud storage connector and put it on your classpath

https://cloud.google.com/hadoop/google-cloud-storage-connector
https://github.com/GoogleCloudPlatform/bigdata-interop

How to use a custom filesystem provider?

2016-09-21 Thread Jean-Philippe Martin

The full source for my example is available on github
.

I'm using maven to depend on gcloud-java-nio
,
which provides a Java FileSystem for Google Cloud Storage, via "gs://"
URLs. My Spark project uses maven-shade-plugin to create one big jar with
all the source in it.

The big jar correctly includes a
META-INF/services/java.nio.file.spi.FileSystemProviderfile, containing the
correct name for the class (
com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider). I
checked and that class is also correctly included in the jar file.

The program uses FileSystemProvider.installedProviders() to list the
filesystem providers it finds. "gs" should be listed (and it is if I run
the same function in a non-Spark context), but when running with Spark on
Dataproc, that provider's gone.

I'd like to know: *How can I use a custom filesystem in my Spark program*?
(asked earlier

in Stackoverflow but I didn't get any traction there)

Spark writing to elasticsearch asynchronously

2016-09-21 Thread Sunita Arvind

Hello Experts,

Is there a way to get spark to write to elasticsearch asynchronously?
Below are the details
http://stackoverflow.com/questions/39624538/spark-savetoes-asynchronously

regards
Sunita

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh

This is happening with sqoop and also putting data into Hbase table with
command line


Sqoop 1.4.6
Hadoop 2.7.3
Hive 2.0.1

I am still getting this error when using sqoop to get a simple table
data from Oracle.


 sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
sh -P \
 --query "select * from sh.sales where \
  \$CONDITIONS" \
  --split-by prod_id \
   --target-dir "sales"

Note that it does everything and puts data on directory on hdfs and then
sends that error


2016-09-21 19:10:31,447 [myid:] - INFO  [main:Job@1317] - Running job:
job_1474455325627_0041
2016-09-21 19:10:39,696 [myid:] - INFO  [main:Job@1338] - Job
job_1474455325627_0041 running in uber mode : false
2016-09-21 19:10:39,707 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
2016-09-21 19:10:53,844 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
2016-09-21 19:11:01,903 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
2016-09-21 19:11:05,924 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
2016-09-21 19:11:13,966 [myid:] - INFO  [main:Job@1345] -  map 100% reduce
0%
2016-09-21 19:11:14,977 [myid:] - INFO  [main:Job@1356] - Job
job_1474455325627_0041 completed successfully
2016-09-21 19:11:15,138 [myid:] - ERROR [main:ImportTool@607] - Imported
Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS

Any ideas?


Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 11:51, Mich Talebzadeh 
wrote:

> Hi,
>
> I am using Hadoop 2.6
>
> hduser@rhes564: /home/hduser/dba/bin>
> *hadoop version*Hadoop 2.6.0
>
> Thanks
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 11:48, Bhaskar Dutta  wrote:
>
>> This constant was added in Hadoop 2.3. Maybe you are using an older
>> version?
>>
>> ~bhaskar
>>
>> On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Actually I started using Spark to import data from RDBMS (in this case
>>> Oracle) after upgrading to Hive 2, running an import like below
>>>
>>> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>> --username scratchpad -P \
>>> --query "select * from scratchpad.dummy2 where \
>>>  \$CONDITIONS" \
>>>   --split-by ID \
>>>--hive-import  --hive-table "test.dumy2" --target-dir
>>> "/tmp/dummy2" *--direct*
>>>
>>> This gets the data into HDFS and then throws this error
>>>
>>> ERROR [main] tool.ImportTool: Imported Failed: No enum constant
>>> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>>>
>>> I can easily get the data into Hive from the file on HDFS or dig into
>>> the problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark
>>> trouble free like below
>>>
>>>  val df = HiveContext.read.format("jdbc").options(
>>>  Map("url" -> dbURL,
>>>  "dbtable" -> "scratchpad.dummy)",
>>>  "partitionColumn" -> partitionColumnName,
>>>  "lowerBound" -> lowerBoundValue,
>>>  "upperBound" -> upperBoundValue,
>>>  "numPartitions" -> numPartitionsValue,
>>>  "user" -> dbUserName,
>>>  "password" -> dbPassword)).load
>>>
>>> It does work, opens parallel connections to Oracle DB and creates DF
>>> with the specified number of partitions.
>>>
>>> One thing I am not sure or tried if Spark supports direct mode yet.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this

Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread Kevin Mellott

The "unresolved dependency" error is stating that the datastax dependency
could not be located in the Maven repository. I believe that this should
work if you change that portion of your command to the following.

--packages com.datastax.spark:spark-cassandra-connector_2.10:2.0.0-M3

You can verify the available versions by searching Maven at
http://search.maven.org.

Thanks,
Kevin

On Wed, Sep 21, 2016 at 3:38 AM, muhammet pakyürek 
wrote:

> while i run the spark-shell as below
>
> spark-shell --jars '/home/ktuser/spark-cassandra-
> connector/target/scala-2.11/root_2.11-2.0.0-M3-20-g75719df.jar'
> --packages datastax:spark-cassandra-connector:2.0.0-s_2.11-M3-20-g75719df
> --conf spark.cassandra.connection.host=localhost
>
> i get the error
> unresolved dependency: datastax#spark-cassandra-
> connector;2.0.0-s_2.11-M3-20-g75719df.
>
>
> the second question even if i added
>
> libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
> "2.0.0-M3"
>
> to spark-cassandra-connector/sbt/sbt file jar files are
> root_2.11-2.0.0-M3-20-g75719df
>
>
> teh third question after build of connectpr scala 2.11 how do i integrate
> it with pyspark?
>
>

Re: problems with checkpoint and spark sql

2016-09-21 Thread Dhimant

Hi David,

You got any solution for this ? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problems-with-checkpoint-and-spark-sql-tp26080p27773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Bizarre behavior using Datasets/ML on Spark 2.0

2016-09-21 Thread Miles Crawford

Hello folks. I recently migrated my application to Spark 2.0, and
everything worked well, except for one function that uses "toDS" and the ML
libraries.

This stage used to complete in 15 minutes or so on 1.6.2, and now takes
almost two hours.

The UI shows very strange behavior - completed stages still being worked
on, concurrent work on tons of stages, including ones from downstream jobs:
https://dl.dropboxusercontent.com/u/231152/spark.png

Anyone know what might be going on? The only source change I made was
changing "toDF" to "toDS()" before handing my RDDs to the ML libraries.

Thanks,
-miles

Apache Spark JavaRDD pipe() need help

2016-09-21 Thread shashikant.kulka...@gmail.com

Hi All,

I am trying to use the JavaRDD.pipe() API.

I have one object with me from the JavaRDD and not the complete RDD. I mean
I am operating on one object inside the RDD.  In my object I have some
attribute values using which I create one string like "param1 param2 param3
param4". I have one C binary file with me which does some complex mathematic
calculations. Now I want to invoke the C binary using JavaRDD.pipe() API but
I do not have the RDD with me. I just have a string which I want to  pass to
the C binary.

How do I do this? My code is not in driver program. It is in some Java class
which is part of JavaRDD.

In driver.java class
{
 //create config
 //create context
  // generate RDD using data from Cassandra
 // Do a map operation on RDD and for each object in RDD do some operation
in helper Java class
 //Get the updated objects and save them back in Cassandra
}

In Helper.java class
{
//Get the data from object
//do some processing
//get some attributes string values
//Create one string out of those attributes

//Now invoke the C library and pass this one string as input parameter 
<---How to do this

//Read the output and update he object
}
Let me know if you need more inputs from me.
Thanks in advance. 
Shashi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-JavaRDD-pipe-need-help-tp27772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: The coming data on Spark Streaming

2016-09-21 Thread pcandido

Anybody?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-coming-data-on-Spark-Streaming-tp27720p27771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: OutOfMemory while calculating window functions

2016-09-21 Thread Jeremy Davis


Here is a unit test that will OOM a 10G heap
--

import java.sql.Timestamp

import org.apache.spark.sql.SparkSession
import org.junit.Test
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

import scala.collection.mutable.ArrayBuffer


/**
 * A Small Unit Test to demonstrate Spark Window Functions OOM
 */
class SparkTest {


  @Test
  def testWindows() {
val sparkSession = 
SparkSession.builder().master("local[7]").appName("tests").getOrCreate()
import sparkSession.implicits._

println("Init Dataset")

val partitions = (0 until 4)
val entries = (0 until 6500)

//val windows = (5 to 15 by 5) //Works
val windows = (5 to 65 by 5)   //OOM 10G

val testData = new ArrayBuffer[(String,Timestamp,Double)]


for( p <- partitions) {
  for( e <- entries ) yield {
testData += (("Key"+p,new Timestamp(6*e),e*2.0))
  }
}

val ds = testData.toDF("key","datetime","value")
ds.show()


var resultFrame = ds
resultFrame.schema.fields.foreach(println)


val baseWin = Window.partitionBy("key").orderBy("datetime")
for( win <- windows ) {
  resultFrame = 
resultFrame.withColumn("avg"+(-win),avg("value").over(baseWin.rowsBetween(-win,0)))

.withColumn("stddev"+(-win),stddev("value").over(baseWin.rowsBetween(-win,0)))

.withColumn("min"+(-win),min("value").over(baseWin.rowsBetween(-win,0)))

.withColumn("max"+(-win),max("value").over(baseWin.rowsBetween(-win,0)))
}
resultFrame.show()

  }

}



> On Sep 20, 2016, at 10:26 PM, Jeremy Davis  wrote:
> 
>  Hello all,
> I ran in to a weird OOM issue when using the sliding windows. (Spark 2.0, 
> Scala 2.11.7, Java 1.8.0_11, OSX 10.10.5)
> I’m using the Dataframe API and calling various:
> Window.partitionBy(...).orderBy(...).rowsBetween(…) etc.
> with just 4 types of aggregations:(avg,min,max,stddev),  … avg.over(window)  
> etc..
> All parameterized over several window sizes 
> (-3,-4,-5,-8,-9,-13,-18,-20,-21,-34,-55)
> -3 meaning (-3,0)
> 
> I’m using a Timestamp as the order column, and aggregating over a Double.
> In a given partition, there are only around 6500 samples of data that are 
> being aggregated. For some reason I hit some sort of non-linear memory use 
> around 5000 samples per partition (Perhaps doubling an array somewhere?).
> I shrank the dataset down to just 4 partitions, but I still OOM a 10G heap 
> while running in Local Mode. It all seems odd when my input is on the order 
> of a just a few MB for this test case.
> 
> I’m wondering if this sounds like expected behavior? Seems like my use case 
> is reasonable.
> 
> Also, it will OOM when I run multithreaded with 2 threads and up, but seems 
> to work single threaded  (“local[1]”).
> 
> I will try to put together a simple repro case tomorrow.
> 
> 
> Attached is a Yourkit Screen Shot. I suspect the long[] arrays double once 
> more before OOM.
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail

spark stream based deduplication

2016-09-21 Thread backtrack5

I want to do hash based comparison to find duplicate records. Record which i
receive from stream will have hashid,recordid field in it.

1. I want to have all the historic records (hashid, recordid --> key,value)
in memory RDD
2. When a new record is received in spark DStream RDD i want to compare
against the historic records (hash, recordid)
3. also add the new records into existing historic records (hashid, recordid
--> key,value) in memory RDD

My thoughts:


1. join the time based RDD and cache them in memory (historic lookup)
2. compare the new RDD comes, foreach record compare againt the historic
lookup

What I have done:


1. I have created a stream line and able to consume the records.
2. But i am not sure how to store it in memory

I have the following questions:


1. How can i achieve this or workaround ?
2. Can i do this using MLib? or spark stream fits for my usecase ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-based-deduplication-tp27770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Get profile from sbt

2016-09-21 Thread Bedrytski Aliaksandr

Hi Saurabh,

you may use BuildInfo[1] sbt plugin to access values defined in
build.sbt

Regards,
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Mon, Sep 19, 2016, at 18:28, Saurabh Malviya (samalviy) wrote:
> Hi,
>
> Is there any way equivalent to profiles in maven in sbt. I want spark
> build to pick up endpoints based on environment jar is built for
>
> In build.sbt we are ingesting variable dev,stage etc and pick up all
> dependencies. Similar way I need a way to pick up config for external
> dependencies like endpoints etc.
>
> Or another approach is there any way I can access variable defined in
> built.sbt in scala code.
>
> -Saurabh


Links:

  1. https://github.com/sbt/sbt-buildinfo

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh

LOL

I think we should try the Chrystal ball to answer this question.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 21 September 2016 at 13:14, Jörn Franke  wrote:

> Do you mind sharing what your software does? What is the input data size?
> What is the spark version and apis used? How many nodes? What is the input
> data format? Is compression used?
>
> On 21 Sep 2016, at 13:37, Trinadh Kaja  wrote:
>
> Hi all,
>
> how to increase spark performance ,i am using pyspark.
>
> cluster info :
>
> Total memory :600gb
> Cores:96
>
> command :
> spark-submit --master  yarn-client --executor-memory 10G --num-executors
> 50 --executor-cores 2 --driver-memory 10g --queue thequeue
>
>
> please help on this
>
> --
> Thanks
> K.Trinadh
> Ph-7348826118
>
>

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi

I'm sure there's another way to do it; I hope someone can show us.  I
couldn't figure out how to use `map` either.

On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai)  wrote:

> Thanks, Peter.
> It works!
>
> Why udf is needed?
>
>
>
>
> On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <
> pete.figlio...@gmail.com> wrote:
>
>> Hi Yan, I agree, it IS really confusing.  Here is the technique for
>> transforming a column.  It is very general because you can make "myConvert"
>> do whatever you want.
>>
>> import org.apache.spark.mllib.linalg.Vectors
>> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>
>> df.show()
>> // The columns were named "_1" and "_2"
>> // Very confusing, because it looks like a Scala wildcard when we refer
>> to it in code
>>
>> val myConvert = (x: String) => { Vectors.parse(x) }
>> val myConvertUDF = udf(myConvert)
>>
>> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>>
>> newDf.show()
>>
>> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi, all.
>>> I find that it's really confuse.
>>>
>>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>>
>>> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>>> Vectors.parse("[2,4,6]"))).toDF
>>> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>>
>>>
>>> But using map to convert String to Vector throws an error:
>>>
>>> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>>
>>> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>>> :30: error: Unable to find encoder for type stored in a
>>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>>> classes) are supported by importing spark.implicits._  Support for
>>> serializing other types will be added in future releases.
>>>   dataStr.map(row => Vectors.parse(row.getString(1)))
>>>
>>>
>>> Dose anyone can help me,
>>> thanks very much!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
 Hi Yan, I think you'll have to map the features column to a new
 numerical features column.

 Here's one way to do the individual transform:

 scala> val x = "[1, 2, 3, 4, 5]"
 x: String = [1, 2, 3, 4, 5]

 scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
 split(" ") map(_.toInt)
 y: Array[Int] = Array(1, 2, 3, 4, 5)

 If you don't know about the Scala command line, just type "scala" in a
 terminal window.  It's a good place to try things out.

 You can make a function out of this transformation and apply it to your
 features column to make a new column.  Then add this with
 Dataset.withColumn.

 See here
 
 on how to apply a function to a Column to make a new column.

 On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) 
 wrote:

> Hi,
> I have a csv file like:
> uid  mid  features   label
> 1235231[0, 1, 3, ...]True
>
> Both  "features" and "label" columns are used for GBTClassifier.
>
> However, when I read the file:
> Dataset samples = sparkSession.read().csv(file);
> The type of samples.select("features") is String.
>
> My question is:
> How to map samples.select("features") to Vector or any appropriate
> type,
> so I can use it to train like:
> GBTClassifier gbdt = new GBTClassifier()
> .setLabelCol("label")
> .setFeaturesCol("features")
> .setMaxIter(2)
> .setMaxDepth(7);
>
> Thanks.
>


>>>
>>
>

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke

Do you mind sharing what your software does? What is the input data size? What 
is the spark version and apis used? How many nodes? What is the input data 
format? Is compression used?

> On 21 Sep 2016, at 13:37, Trinadh Kaja  wrote:
> 
> Hi all,
> 
> how to increase spark performance ,i am using pyspark.
> 
> cluster info :
> 
> Total memory :600gb
> Cores:96
> 
> command :
> spark-submit --master  yarn-client --executor-memory 10G --num-executors 50 
> --executor-cores 2 --driver-memory 10g --queue thequeue
>  
> 
> please help on this 
> 
> -- 
> Thanks
> K.Trinadh
> Ph-7348826118

SPARK PERFORMANCE TUNING

2016-09-21 Thread Trinadh Kaja

Hi all,

how to increase spark performance ,i am using pyspark.

cluster info :

Total memory :600gb
Cores:96

command :
spark-submit --master  yarn-client --executor-memory 10G --num-executors 50
--executor-cores 2 --driver-memory 10g --queue thequeue


please help on this

-- 
Thanks
K.Trinadh
Ph-7348826118

increase spark performance

2016-09-21 Thread Trinadh Kaja

Hi all,

how to increase spark performance ,

cluster info :

total memory :600gb
cores

-- 
Thanks
K.Trinadh
Ph-7348826118

Re: Spark tasks blockes randomly on standalone cluster

2016-09-21 Thread bogdanbaraila

Does anyone has any ideas o what may be happening?

Regards,
Bogdan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-tasks-blockes-randomly-on-standalone-cluster-tp27693p27769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Similar Items

2016-09-21 Thread Nick Pentreath

Sorry, the original repo: https://github.com/karlhigley/spark-neighbors

On Wed, 21 Sep 2016 at 13:09 Nick Pentreath 
wrote:

> I should also point out another library I had not come across before :
> https://github.com/sethah/spark-neighbors
>
>
> On Tue, 20 Sep 2016 at 21:03 Kevin Mellott 
> wrote:
>
>> Using the Soundcloud implementation of LSH, I was able to process a 22K
>> product dataset in a mere 65 seconds! Thanks so much for the help!
>>
>> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott > > wrote:
>>
>>> Thanks Nick - those examples will help a ton!!
>>>
>>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 A few options include:

 https://github.com/marufaytekin/lsh-spark - I've used this a bit and
 it seems quite scalable too from what I've looked at.
 https://github.com/soundcloud/cosine-lsh-join-spark - not used this
 but looks like it should do exactly what you need.
 https://github.com/mrsqueeze/*spark*-hash
 


 On Tue, 20 Sep 2016 at 18:06 Kevin Mellott 
 wrote:

> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
> products at a time (as an isolated set of products). Within this set of
> products (which represents all products for a particular supplier), I am
> also analyzing each category separately. The largest categories typically
> have around 10K products.
>
> That being said, when calculating IDFs for the 10K product set we come
> out with roughly 12K unique tokens. In other words, our vectors are 12K
> columns wide (although they are being represented using SparseVectors). We
> have a step that is attempting to locate all documents that share the same
> tokens, and for those items we will calculate the cosine similarity.
> However, the part that attempts to identify documents with shared tokens 
> is
> the bottleneck.
>
> For this portion, we map our data down to the individual tokens
> contained by each document. For example:
>
> DocumentId   |   Description
>
> 
> 1   Easton Hockey Stick
> 2   Bauer Hockey Gloves
>
> In this case, we'd map to the following:
>
> (1, 'Easton')
> (1, 'Hockey')
> (1, 'Stick')
> (2, 'Bauer')
> (2, 'Hockey')
> (2, 'Gloves')
>
> Our goal is to aggregate this data as follows; however, our code that
> currently does this is does not perform well. In the realistic 12K product
> scenario, this resulted in 430K document/token tuples.
>
> ((1, 2), ['Hockey'])
>
> This then tells us that documents 1 and 2 need to be compared to one
> another (via cosine similarity) because they both contain the token
> 'hockey'. I will investigate the methods that you recommended to see if
> they may resolve our problem.
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
> nick.pentre...@gmail.com> wrote:
>
>> How many products do you have? How large are your vectors?
>>
>> It could be that SVD / LSA could be helpful. But if you have many
>> products then trying to compute all-pair similarity with brute force is 
>> not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <
>> kevin.r.mell...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to write a Spark application that will detect similar
>>> items (in this case products) based on their descriptions. I've got an 
>>> ML
>>> pipeline that transforms the product data to TF-IDF representation, 
>>> using
>>> the following components.
>>>
>>>- *RegexTokenizer* - strips out non-word characters, results in
>>>a list of tokens
>>>- *StopWordsRemover* - removes common "stop words", such as
>>>"the", "and", etc.
>>>- *HashingTF* - assigns a numeric "hash" to each token and
>>>calculates the term frequency
>>>- *IDF* - computes the inverse document frequency
>>>
>>> After this pipeline evaluates, I'm left with a SparseVector that
>>> represents the inverse document frequency of tokens for each product. 
>>> As a
>>> next step, I'd like to be able to compare each vector to one another, to
>>> detect similarities.
>>>
>>> Does anybody know of a straightforward way to do this in Spark? I
>>> tried creating a UDF (that used the Breeze linear algebra methods
>>> internally); however, that did not scale well.
>>>
>>> Thanks,
>>> Kevin

Re: Similar Items

2016-09-21 Thread Nick Pentreath

I should also point out another library I had not come across before :
https://github.com/sethah/spark-neighbors

On Tue, 20 Sep 2016 at 21:03 Kevin Mellott 
wrote:

> Using the Soundcloud implementation of LSH, I was able to process a 22K
> product dataset in a mere 65 seconds! Thanks so much for the help!
>
> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott 
> wrote:
>
>> Thanks Nick - those examples will help a ton!!
>>
>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> A few options include:
>>>
>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
>>> seems quite scalable too from what I've looked at.
>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
>>> looks like it should do exactly what you need.
>>> https://github.com/mrsqueeze/*spark*-hash
>>> 
>>>
>>>
>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott 
>>> wrote:
>>>
 Thanks for the reply, Nick! I'm typically analyzing around 30-50K
 products at a time (as an isolated set of products). Within this set of
 products (which represents all products for a particular supplier), I am
 also analyzing each category separately. The largest categories typically
 have around 10K products.

 That being said, when calculating IDFs for the 10K product set we come
 out with roughly 12K unique tokens. In other words, our vectors are 12K
 columns wide (although they are being represented using SparseVectors). We
 have a step that is attempting to locate all documents that share the same
 tokens, and for those items we will calculate the cosine similarity.
 However, the part that attempts to identify documents with shared tokens is
 the bottleneck.

 For this portion, we map our data down to the individual tokens
 contained by each document. For example:

 DocumentId   |   Description

 
 1   Easton Hockey Stick
 2   Bauer Hockey Gloves

 In this case, we'd map to the following:

 (1, 'Easton')
 (1, 'Hockey')
 (1, 'Stick')
 (2, 'Bauer')
 (2, 'Hockey')
 (2, 'Gloves')

 Our goal is to aggregate this data as follows; however, our code that
 currently does this is does not perform well. In the realistic 12K product
 scenario, this resulted in 430K document/token tuples.

 ((1, 2), ['Hockey'])

 This then tells us that documents 1 and 2 need to be compared to one
 another (via cosine similarity) because they both contain the token
 'hockey'. I will investigate the methods that you recommended to see if
 they may resolve our problem.

 Thanks,
 Kevin

 On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> How many products do you have? How large are your vectors?
>
> It could be that SVD / LSA could be helpful. But if you have many
> products then trying to compute all-pair similarity with brute force is 
> not
> going to be scalable. In this case you may want to investigate hashing
> (LSH) techniques.
>
>
> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott 
> wrote:
>
>> Hi all,
>>
>> I'm trying to write a Spark application that will detect similar
>> items (in this case products) based on their descriptions. I've got an ML
>> pipeline that transforms the product data to TF-IDF representation, using
>> the following components.
>>
>>- *RegexTokenizer* - strips out non-word characters, results in a
>>list of tokens
>>- *StopWordsRemover* - removes common "stop words", such as
>>"the", "and", etc.
>>- *HashingTF* - assigns a numeric "hash" to each token and
>>calculates the term frequency
>>- *IDF* - computes the inverse document frequency
>>
>> After this pipeline evaluates, I'm left with a SparseVector that
>> represents the inverse document frequency of tokens for each product. As 
>> a
>> next step, I'd like to be able to compare each vector to one another, to
>> detect similarities.
>>
>> Does anybody know of a straightforward way to do this in Spark? I
>> tried creating a UDF (that used the Breeze linear algebra methods
>> internally); however, that did not scale well.
>>
>> Thanks,
>> Kevin
>>
>

>>
>

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon

Well I have already tried that.
You are talking about a command similar to this right? *yarn logs
-applicationId application_Number *
This gives me the processing logs, that contain information about the
tasks, RDD blocks etc.

What I really need is the output log that gets generated as part of the
Spark job. Which means I generate some output by the Spark job that gets
written to a file mentioned in the job itself. So this file is currently
residing within the appcache, is there a way that I can get this once the
job is over?

On Wed, Sep 21, 2016 at 4:00 PM, ayan guha  wrote:

> On yarn, logs are aggregated from each containers to hdfs. You can use
> yarn CLI or ui to view. For spark, you would have a history server which
> consolidate s the logs
> On 21 Sep 2016 19:03, "Nisha Menon"  wrote:
>
>> I looked at the driver logs, that reminded me that I needed to look at
>> the executor logs. There the issue was that the spark executors were not
>> getting a configuration file. I broadcasted the file and now the processing
>> happens. Thanks for the suggestion.
>> Currently my issue is that the log file generated independently by the
>> executors goes to the respective containers' appcache, and then it gets
>> lost. Is there a recommended way to get the output files from the
>> individual executors?
>>
>> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal 
>> wrote:
>>
>>> Are you looking at the worker logs or the driver?
>>>
>>>
>>> On Thursday, September 8, 2016, Nisha Menon 
>>> wrote:
>>>
 I have an RDD created as follows:

 *JavaPairRDD inputDataFiles =
 sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*

 On this RDD I perform a map to process individual files and invoke a
 foreach to trigger the same map.

* JavaRDD output = inputDataFiles.map(new
 Function,Object[]>()*
 *{*

 *private static final long serialVersionUID = 1L;*

 * @Override*
 * public Object[] call(Tuple2 v1) throws Exception *
 *{ *
 *  System.out.println("in map!");*
 *   //do something with v1. *
 *  return Object[]*
 *} *
 *});*

 *output.foreach(new VoidFunction() {*

 * private static final long serialVersionUID = 1L;*

 * @Override*
 * public void call(Object[] t) throws Exception {*
 * //do nothing!*
 * System.out.println("in foreach!");*
 * }*
 * }); *

 This code works perfectly fine for standalone setup on my local laptop
 while accessing both local files as well as remote HDFS files.

 In cluster the same code produces no results. My intuition is that the
 data has not reached the individual executors and hence both the `map` and
 `foreach` does not work. It might be a guess. But I am not able to figure
 out why this would not work in cluster. I dont even see the print
 statements in `map` and `foreach` getting printed in cluster mode of
 execution.

 I notice a particular line in standalone output that I do NOT see in
 cluster execution.

 *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
 Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*

 I had a similar code with textFile() that worked earlier for individual
 files on cluster. The issue is with wholeTextFiles() only.

 Please advise what is the best way to get this working or other
 alternate ways.

 My setup is cloudera 5.7 distribution with Spark Service. I used the
 master as `yarn-client`.

 The action can be anything. Its just a dummy step to invoke the map. I
 also tried *System.out.println("Count is:"+output.count());*, for
 which I got the correct answer of `10`, since there were 10 files in the
 folder, but still the map refuses to work.

 Thanks.

>>>
>>> --
>>> Thanks,
>>> Sonal
>>> Nube Technologies 
>>>
>>> 
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nisha Menon
>> BTech (CS) Sahrdaya CET,
>> MTech (CS) IIIT Banglore.
>>
>

-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread ayan guha

On yarn, logs are aggregated from each containers to hdfs. You can use yarn
CLI or ui to view. For spark, you would have a history server which
consolidate s the logs
On 21 Sep 2016 19:03, "Nisha Menon"  wrote:

> I looked at the driver logs, that reminded me that I needed to look at the
> executor logs. There the issue was that the spark executors were not
> getting a configuration file. I broadcasted the file and now the processing
> happens. Thanks for the suggestion.
> Currently my issue is that the log file generated independently by the
> executors goes to the respective containers' appcache, and then it gets
> lost. Is there a recommended way to get the output files from the
> individual executors?
>
> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal 
> wrote:
>
>> Are you looking at the worker logs or the driver?
>>
>>
>> On Thursday, September 8, 2016, Nisha Menon 
>> wrote:
>>
>>> I have an RDD created as follows:
>>>
>>> *JavaPairRDD inputDataFiles =
>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>
>>> On this RDD I perform a map to process individual files and invoke a
>>> foreach to trigger the same map.
>>>
>>>* JavaRDD output = inputDataFiles.map(new
>>> Function,Object[]>()*
>>> *{*
>>>
>>> *private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public Object[] call(Tuple2 v1) throws Exception *
>>> *{ *
>>> *  System.out.println("in map!");*
>>> *   //do something with v1. *
>>> *  return Object[]*
>>> *} *
>>> *});*
>>>
>>> *output.foreach(new VoidFunction() {*
>>>
>>> * private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public void call(Object[] t) throws Exception {*
>>> * //do nothing!*
>>> * System.out.println("in foreach!");*
>>> * }*
>>> * }); *
>>>
>>> This code works perfectly fine for standalone setup on my local laptop
>>> while accessing both local files as well as remote HDFS files.
>>>
>>> In cluster the same code produces no results. My intuition is that the
>>> data has not reached the individual executors and hence both the `map` and
>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>> out why this would not work in cluster. I dont even see the print
>>> statements in `map` and `foreach` getting printed in cluster mode of
>>> execution.
>>>
>>> I notice a particular line in standalone output that I do NOT see in
>>> cluster execution.
>>>
>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>
>>> I had a similar code with textFile() that worked earlier for individual
>>> files on cluster. The issue is with wholeTextFiles() only.
>>>
>>> Please advise what is the best way to get this working or other
>>> alternate ways.
>>>
>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>> master as `yarn-client`.
>>>
>>> The action can be anything. Its just a dummy step to invoke the map. I
>>> also tried *System.out.println("Count is:"+output.count());*, for which
>>> I got the correct answer of `10`, since there were 10 files in the folder,
>>> but still the map refuses to work.
>>>
>>> Thanks.
>>>
>>>
>>
>> --
>> Thanks,
>> Sonal
>> Nube Technologies 
>>
>> 
>>
>>
>>
>>
>
>
> --
> Nisha Menon
> BTech (CS) Sahrdaya CET,
> MTech (CS) IIIT Banglore.
>

How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon

I looked at the driver logs, that reminded me that I needed to look at the
executor logs. There the issue was that the spark executors were not
getting a configuration file. I broadcasted the file and now the processing
happens. Thanks for the suggestion.
Currently my issue is that the log file generated independently by the
executors goes to the respective containers' appcache, and then it gets
lost. Is there a recommended way to get the output files from the
individual executors?

On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal  wrote:

> Are you looking at the worker logs or the driver?
>
>
> On Thursday, September 8, 2016, Nisha Menon 
> wrote:
>
>> I have an RDD created as follows:
>>
>> *JavaPairRDD inputDataFiles =
>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>
>> On this RDD I perform a map to process individual files and invoke a
>> foreach to trigger the same map.
>>
>>* JavaRDD output = inputDataFiles.map(new
>> Function,Object[]>()*
>> *{*
>>
>> *private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public Object[] call(Tuple2 v1) throws Exception *
>> *{ *
>> *  System.out.println("in map!");*
>> *   //do something with v1. *
>> *  return Object[]*
>> *} *
>> *});*
>>
>> *output.foreach(new VoidFunction() {*
>>
>> * private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public void call(Object[] t) throws Exception {*
>> * //do nothing!*
>> * System.out.println("in foreach!");*
>> * }*
>> * }); *
>>
>> This code works perfectly fine for standalone setup on my local laptop
>> while accessing both local files as well as remote HDFS files.
>>
>> In cluster the same code produces no results. My intuition is that the
>> data has not reached the individual executors and hence both the `map` and
>> `foreach` does not work. It might be a guess. But I am not able to figure
>> out why this would not work in cluster. I dont even see the print
>> statements in `map` and `foreach` getting printed in cluster mode of
>> execution.
>>
>> I notice a particular line in standalone output that I do NOT see in
>> cluster execution.
>>
>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>
>> I had a similar code with textFile() that worked earlier for individual
>> files on cluster. The issue is with wholeTextFiles() only.
>>
>> Please advise what is the best way to get this working or other alternate
>> ways.
>>
>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>> master as `yarn-client`.
>>
>> The action can be anything. Its just a dummy step to invoke the map. I
>> also tried *System.out.println("Count is:"+output.count());*, for which
>> I got the correct answer of `10`, since there were 10 files in the folder,
>> but still the map refuses to work.
>>
>> Thanks.
>>
>>
>
> --
> Thanks,
> Sonal
> Nube Technologies 
>
> 
>
>
>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.

Re: Missing output partition file in S3

2016-09-21 Thread Steve Loughran

On 19 Sep 2016, at 18:54, Chen, Kevin 
> wrote:

Hi Steve,

Our S3 is on US east. But this issue also occurred when we using a S3 bucket on 
US west. We are using S3n. We use Spark standalone deployment. We run the job 
in EC2. The datasets are about 25GB. We did not have speculative execution 
turned on. We did not use DirectCommiter.

Thanks,
Kevin

the closest thing I know to that on the version of Spark you are using is : 
https://issues.apache.org/jira/browse/SPARK-4879 , but that's assuming 
speculative exection is on

From: Steve Loughran >
Date: Friday, September 16, 2016 at 3:46 AM
To: Chen Kevin >
Cc: "user@spark.apache.org" 
>
Subject: Re: Missing output partition file in S3

On 15 Sep 2016, at 19:37, Chen, Kevin 
> wrote:

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My 
spark job writes output to a S3 location. Occasionally, I noticed one partition 
file is missing. As a result, one chunk of data was lost. If I rerun the same 
job, the problem usually goes away. This has been happening pretty random. I 
observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.

This doesn't sound good

Without making any promises about being able to fix this,  I would like to 
understand the setup to see if there is something that could be done to address 
this

  1.  Which S3 installation? US East or elsewhere
  2.  Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if 
you haven't already (exception, if you are using AWS EMR you have to stick with 
their s3:// client)
  3.  Are you running in-EC2 or remotely?
  4.  How big are the datasets being generated?
  5.  Do you have speculative execution turned on
  6.  which committer? is the external "DirectCommitter", or the classic Hadoop 
FileOutputCommitter? If so  are using Hadoop 2.7.x, can you try the v2 
algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)

I should warn that the stance of myself and colleagues is "dont commit direct 
to S3", write to HDFS and do a distcp when you finally copy out the data. S3 
itself doesn't have enough consistency for committing output to work in the 
presence of all race conditions and failure modes. At least here you've noticed 
the problem; the thing people fear is not noticing that a problem has arisen

-Steve

unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread muhammet pakyürek

while i run the spark-shell as below

spark-shell --jars 
'/home/ktuser/spark-cassandra-connector/target/scala-2.11/root_2.11-2.0.0-M3-20-g75719df.jar'
 --packages datastax:spark-cassandra-connector:2.0.0-s_2.11-M3-20-g75719df 
--conf spark.cassandra.connection.host=localhost

i get the error
unresolved dependency: 
datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df.


the second question even if i added

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
"2.0.0-M3"

to spark-cassandra-connector/sbt/sbt file jar files are 
root_2.11-2.0.0-M3-20-g75719df


teh third question after build of connectpr scala 2.11 how do i integrate it 
with pyspark?

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Yan Facai

Thanks, Peter.
It works!

Why udf is needed?




On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi 
wrote:

> Hi Yan, I agree, it IS really confusing.  Here is the technique for
> transforming a column.  It is very general because you can make "myConvert"
> do whatever you want.
>
> import org.apache.spark.mllib.linalg.Vectors
> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>
> df.show()
> // The columns were named "_1" and "_2"
> // Very confusing, because it looks like a Scala wildcard when we refer to
> it in code
>
> val myConvert = (x: String) => { Vectors.parse(x) }
> val myConvertUDF = udf(myConvert)
>
> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>
> newDf.show()
>
> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:
>
>> Hi, all.
>> I find that it's really confuse.
>>
>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>
>> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>> Vectors.parse("[2,4,6]"))).toDF
>> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>
>>
>> But using map to convert String to Vector throws an error:
>>
>> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>
>> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>> :30: error: Unable to find encoder for type stored in a
>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>> classes) are supported by importing spark.implicits._  Support for
>> serializing other types will be added in future releases.
>>   dataStr.map(row => Vectors.parse(row.getString(1)))
>>
>>
>> Dose anyone can help me,
>> thanks very much!
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi > > wrote:
>>
>>> Hi Yan, I think you'll have to map the features column to a new
>>> numerical features column.
>>>
>>> Here's one way to do the individual transform:
>>>
>>> scala> val x = "[1, 2, 3, 4, 5]"
>>> x: String = [1, 2, 3, 4, 5]
>>>
>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>> split(" ") map(_.toInt)
>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>
>>> If you don't know about the Scala command line, just type "scala" in a
>>> terminal window.  It's a good place to try things out.
>>>
>>> You can make a function out of this transformation and apply it to your
>>> features column to make a new column.  Then add this with
>>> Dataset.withColumn.
>>>
>>> See here
>>> 
>>> on how to apply a function to a Column to make a new column.
>>>
>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>>
 Hi,
 I have a csv file like:
 uid  mid  features   label
 1235231[0, 1, 3, ...]True

 Both  "features" and "label" columns are used for GBTClassifier.

 However, when I read the file:
 Dataset samples = sparkSession.read().csv(file);
 The type of samples.select("features") is String.

 My question is:
 How to map samples.select("features") to Vector or any appropriate type,
 so I can use it to train like:
 GBTClassifier gbdt = new GBTClassifier()
 .setLabelCol("label")
 .setFeaturesCol("features")
 .setMaxIter(2)
 .setMaxDepth(7);

 Thanks.

>>>
>>>
>>
>

How to write multiple outputs in avro format in spark(java)?

2016-09-21 Thread Mahebub Sayyed

Hello,

Currently I am writing multiple text files based on keys.
Code :

public class RDDMultipleTextOutputFormat extends
MultipleTextOutputFormat {

@Override
protected String generateFileNameForKeyValue(String key, String value,
String name) {
return key;
}}

rddPairs.saveAsHadoopFile(args[1], String.class, String.class,
RDDMultipleTextOutputFormat.class);

How to write multiple output in avro format.

Also I have checked *"Avro Data Source for Apache Spark" *by *databricks*

Is it possible to write Multiple output using this library?

-- 
*Regards,*
*Mahebub Sayyed*

Re: Israel Spark Meetup

2016-09-21 Thread Sean Owen

Done.

On Wed, Sep 21, 2016 at 5:53 AM, Romi Kuntsman  wrote:
> Hello,
> Please add a link in Spark Community page
> (https://spark.apache.org/community.html)
> To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/)
> We're an active meetup group, unifying the local Spark user community, and
> having regular meetups.
> Thanks!
> Romi K.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

49 matches

Mail list logo