date:20151207

Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu

Can you try like this in your sbt:


val spark_version = "1.5.2"
val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact 
= "servlet-api")
val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty")

libraryDependencies ++= Seq(
  "org.apache.spark" %%  "spark-sql"  % spark_version % "provided” 
excludeAll(excludeServletApi, excludeEclipseJetty),
  "org.apache.spark" %%  "spark-hive" % spark_version % "provided" 
excludeAll(excludeServletApi, excludeEclipseJetty)
)



> On Dec 8, 2015, at 2:26 PM, Sunil Tripathy  wrote:
> 
> I am getting the following exception when I use spark-submit to submit a 
> spark streaming job.
> 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
> at 
> com.amazonaws.internal.config.InternalConfig.(InternalConfig.java:43)
> 
> I tried with diferent version of of jackson libraries but that does not seem 
> to help.
>  libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % 
> "2.6.3"
> libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.3"
> libraryDependencies += "com.fasterxml.jackson.core" % "jackson-annotations" % 
> "2.6.3"
> 
> Any pointers to resolve the issue?
> 
> Thanks

HiveContext creation failed with Kerberos

2015-12-07 Thread Neal Yin

Hi

I am using Spark 1.5.1 with CDH 5.4.2.  My cluster is kerberos protected.

Here is pseudocode  for what I am trying to do.
ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(“foo", “…")
ugi.doAs( new PrivilegedExceptionAction() {
   val sparkConf: SparkConf = createSparkConf(…)
   val sparkContext = new JavaSparkContext(sparkConf)
new HiveContext(sparkContext.sc)  // failed
})

Spark context boots up fine with UGI, but HiveContext creation failed with 
following message.   If I manually do kinit within same shell, this code works.
Any thoughts?



15/12/08 04:12:27 INFO hive.HiveContext: Initializing execution hive, version 
1.2.1
15/12/08 04:12:27 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0
15/12/08 04:12:27 INFO client.ClientWrapper: Loaded 
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
15/12/08 04:12:27 INFO cluster.YarnClientSchedulerBackend: Registered executor: 
AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@ip-10-222-0-230.us-west-2.compute.internal:48544/user/Executor#684269808])
 with ID 5
15/12/08 04:12:27 INFO hive.metastore: Trying to connect to metastore with URI 
thrift://ip-10-222-0-145.us-west-2.compute.internal:9083
15/12/08 04:12:27 INFO hive.metastore: Connected to metastore.
15/12/08 04:12:27 INFO session.SessionState: Created local directory: 
/tmp/01726939-f7f4-45a4-a027-feb0ab9b0b68_resources
15/12/08 04:12:27 INFO session.SessionState: Created HDFS directory: 
/tmp/hive/dev.baseline/01726939-f7f4-45a4-a027-feb0ab9b0b68
15/12/08 04:12:27 INFO session.SessionState: Created local directory: 
/tmp/developer/01726939-f7f4-45a4-a027-feb0ab9b0b68
15/12/08 04:12:27 INFO cluster.YarnClientSchedulerBackend: Registered executor: 
AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@ip-10-222-0-230.us-west-2.compute.internal:47786/user/Executor#-468362093])
 with ID 9
15/12/08 04:12:27 INFO session.SessionState: Created HDFS directory: 
/tmp/hive/dev.baseline/01726939-f7f4-45a4-a027-feb0ab9b0b68/_tmp_space.db
15/12/08 04:12:27 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager ip-10-222-0-230.us-west-2.compute.internal:38220 with 530.0 MB RAM, 
BlockManagerId(1, ip-10-222-0-230.us-west-2.compute.internal, 38220)
15/12/08 04:12:28 INFO hive.HiveContext: default warehouse location is 
/user/hive/warehouse
15/12/08 04:12:28 INFO hive.HiveContext: Initializing HiveMetastoreConnection 
version 1.2.1 using Spark classes.
15/12/08 04:12:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0
15/12/08 04:12:28 INFO client.ClientWrapper: Loaded 
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
15/12/08 04:12:28 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager ip-10-222-0-230.us-west-2.compute.internal:56661 with 530.0 MB RAM, 
BlockManagerId(5, ip-10-222-0-230.us-west-2.compute.internal, 56661)
15/12/08 04:12:28 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager ip-10-222-0-230.us-west-2.compute.internal:43290 with 530.0 MB RAM, 
BlockManagerId(9, ip-10-222-0-230.us-west-2.compute.internal, 43290)
15/12/08 04:12:28 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/12/08 04:12:28 INFO hive.metastore: Trying to connect to metastore with URI 
thrift://ip-10-222-0-145.us-west-2.compute.internal:9083
15/12/08 04:12:28 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

Try Phoenix from Cloudera parcel distribution

https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/

They may have better Kerberos support ..

On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Yes, its a kerberized cluster and ticket was generated using kinit command
> before running spark job. That's why Spark on hbase worked but when phoenix
> is used to get the connection to hbase, it does not pass the authentication
> to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
> 1.3.1 does not provide integration with Phoenix for kerberized cluster.
>
> Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured
> cluster or not?
>
> Thanks,
> Akhilesh
>
> On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov 
> wrote:
>
>> That error is not directly related to spark nor hbase
>>
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>>
>> Is this a kerberized cluster? You likely don't have a good (non-expired)
>> kerberos ticket for authentication to pass.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
>> pathodia.akhil...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running spark job on yarn in cluster mode in secured cluster. I am
>>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>>> unable to get hbase connection using phoenix. I am running knit command to
>>> get the ticket before starting the job and also keytab file and principal
>>> are correctly specified in connection URL. But still spark job on each node
>>> throws below error:
>>>
>>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>>> failed. The most likely cause is missing or invalid credentials. Consider
>>> 'kinit'.
>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>> GSSException: No valid credentials provided (Mechanism level: Failed to
>>> find any Kerberos tgt)]
>>> at
>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>
>>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>>> this link:
>>>
>>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>>
>>> Also, I found that there is a known issue for yarn-cluster mode for
>>> Spark 1.3.1 version:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6918
>>>
>>> Has anybody been successful in running Spark on hbase using Phoenix in
>>> yarn cluster or client mode?
>>>
>>> Thanks,
>>> Akhilesh Pathodia
>>>
>>
>>
>

NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Sunil Tripathy

I am getting the following exception when I use spark-submit to submit a spark 
streaming job.

Exception in thread "main" java.lang.NoSuchMethodError: 
com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
at 
com.amazonaws.internal.config.InternalConfig.(InternalConfig.java:43)

I tried with diferent version of of jackson libraries but that does not seem to 
help.
 libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % 
"2.6.3"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.3"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-annotations" % 
"2.6.3"

Any pointers to resolve the issue?

Thanks

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia

Yes, its a kerberized cluster and ticket was generated using kinit command
before running spark job. That's why Spark on hbase worked but when phoenix
is used to get the connection to hbase, it does not pass the authentication
to all nodes. Probably it is not handled in Phoenix version 4.3 or Spark
1.3.1 does not provide integration with Phoenix for kerberized cluster.

Can anybody confirm whether Spark 1.3.1 supports Phoenix on secured cluster
or not?

Thanks,
Akhilesh

On Tue, Dec 8, 2015 at 2:57 AM, Ruslan Dautkhanov 
wrote:

> That error is not directly related to spark nor hbase
>
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
>
> Is this a kerberized cluster? You likely don't have a good (non-expired)
> kerberos ticket for authentication to pass.
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
> pathodia.akhil...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running spark job on yarn in cluster mode in secured cluster. I am
>> trying to run Spark on Hbase using Phoenix, but Spark executors are
>> unable to get hbase connection using phoenix. I am running knit command to
>> get the ticket before starting the job and also keytab file and principal
>> are correctly specified in connection URL. But still spark job on each node
>> throws below error:
>>
>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>> failed. The most likely cause is missing or invalid credentials. Consider
>> 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>> at
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>
>> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
>> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
>> this link:
>>
>> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>>
>> Also, I found that there is a known issue for yarn-cluster mode for Spark
>> 1.3.1 version:
>>
>> https://issues.apache.org/jira/browse/SPARK-6918
>>
>> Has anybody been successful in running Spark on hbase using Phoenix in
>> yarn cluster or client mode?
>>
>> Thanks,
>> Akhilesh Pathodia
>>
>
>

Spark with MapDB

2015-12-07 Thread Ramkumar V

Hi,

I'm running java over spark in cluster mode. I want to apply filter on
javaRDD based on some previous batch values. if i store those values in
mapDB, is it possible to apply filter during the current batch ?

*Thanks*,

Unable to acces hive table (created through hive context) in hive console

2015-12-07 Thread Divya Gehlot

Hi,

I am new bee to Spark and using HDP 2.2 which comes with Spark 1.3.1
I tried following  code example

> import org.apache.spark.sql.SQLContext
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
>
> val personFile = "/user/hdfs/TestSpark/Person.csv"
> val df = sqlContext.load(
> "com.databricks.spark.csv",
> Map("path" -> personFile, "header" -> "true", "inferSchema" -> "true"))
> df.printSchema()
> val selectedData = df.select("Name", "Age")
> selectedData.save("NewPerson.csv", "com.databricks.spark.csv")
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("CREATE TABLE IF NOT EXISTS PersonTable (Name STRING, Age
> STRING)")
> hiveContext.sql("LOAD DATA  INPATH '/user/hdfs/NewPerson.csv' INTO TABLE
> PersonTable")
> hiveContext.sql("from PersonTable SELECT Name, Age
> ").collect.foreach(println)


I am able to access above table in HDFS

> [hdfs@sandbox ~]$ hadoop fs -ls /user/hive/warehouse/persontable
> Found 3 items
> -rw-r--r--   1 hdfs hdfs  0 2015-12-08 04:40
> /user/hive/warehouse/persontable/_SUCCESS
> -rw-r--r--   1 hdfs hdfs 47 2015-12-08 04:40
> /user/hive/warehouse/persontable/part-0
> -rw-r--r--   1 hdfs hdfs 33 2015-12-08 04:40
> /user/hive/warehouse/persontable/part-1


But when I try show tables in hive console ,I couldnt find the table.

> hive> use default ;
> OK
> Time taken: 0.864 seconds
> hive> show tables;
> OK
> dataframe_test
> sample_07
> sample_08
> Time taken: 0.521 seconds, Fetched: 3 row(s)
> hive> use xademo ;
> OK
> Time taken: 0.791 seconds
> hive> show tables;
> OK
> call_detail_records
> customer_details
> recharge_details
> Time taken: 0.256 seconds, Fetched: 3 row(s)


Can somebody guide me to right direction ,if something is wrong with the
code or I am unable to understand the concepts.
Would really appreciate your help.

Thanks,
Divya

RE: parquet file doubts

2015-12-07 Thread Singh, Abhijeet

Yes, Parquet has min/max.

From: Cheng Lian [mailto:l...@databricks.com]
Sent: Monday, December 07, 2015 11:21 AM
To: Ted Yu
Cc: Shushant Arora; user@spark.apache.org
Subject: Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since Shushant and I had 
been discussed some Spark related issues before. Then I realized that this is a 
pure Parquet issue, but forgot to change the cc list. Thanks for pointing this 
out! Please ignore this thread.

Cheng
On 12/7/15 12:43 PM, Ted Yu wrote:
Cheng:
I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian 
> wrote:
cc parquet-dev list (it would be nice to always do so for these general 
questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:
Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I see parquet 
version(whether its1.1,1.2or1.3) for parquet file generated using hive or 
custom MR or AvroParquetoutputFormat.
Yes, Parquet also keeps row group statistics. You may check the Parquet file 
using the parquet-meta CLI tool in parquet-tools (see 
https://github.com/Parquet/parquet-mr/issues/321 for details), then look for 
the "creator" field of the file. For programmatic access, check for 
o.a.p.hadoop.metadata.FileMetaData.createdBy.

2.how to sort parquet records while generating parquet file using 
avroparquetoutput format?
AvroParquetOutputFormat is not a format. It's just responsible for converting 
Avro records to Parquet records. How are you using AvroParquetOutputFormat? Any 
example snippets?

Thanks

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Not able to receive data in spark from rsyslog

2015-12-07 Thread Akhil Das

Just make sure you are binding on the correct interface.

- java.net.ConnectException: Connection refused


Means spark was not able to connect to that host/port. You can validate it
by telneting to that host/port.



Thanks
Best Regards

On Fri, Dec 4, 2015 at 1:00 PM, masoom alam 
wrote:

> I am getting am error that I am not able receive data in spark streaming
> application from spark.please help with any pointers.
> 9 - java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)
> at java.net.Socket.connect(Socket.java:528)
> at java.net.Socket.(Socket.java:425)
> at java.net.Socket.(Socket.java:208)
> at
> org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:73)
> at
> org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:59)
>
> 15/12/04 02:21:29 INFO ReceiverSupervisorImpl: Stopped receiver 0
>
> However from nc -lk 999 gives the data which is received perfectlyany
> clue...
>
> Thanks
>

[SPARK] Obtaining matrices of an individual Spark job

2015-12-07 Thread diplomatic Guru

Hello team,

I need to present the Spark job performance to my management. I could get
the execution time by measuring the starting and finishing time of the job
(includes overhead). However, not sure how to get the other matrices e.g
cpu, i/o, memory etc..

I want to measure the  individual job, not the whole cluster. Please let me
know the best way to do it. if there are any useful resources the please
provide links.


Thank you.

spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com

Hi All, 

Is there a spark sql function which returns current time stamp 

Example:- 
In Impala:- select NOW();
SQL Server:- select GETDATE();
Netezza:- select NOW();

Thanks
Sri



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How the cores are used in Directstream approach

2015-12-07 Thread Akhil Das

You will have to do a repartition after creating the dstream to utilize all
cores. directStream keeps exactly the same partitions as in kafka for spark.

Thanks
Best Regards

On Thu, Dec 3, 2015 at 9:42 AM, Charan Ganga Phani Adabala <
char...@eiqnetworks.com> wrote:

> Hi,
>
> We have* 1 kafka topic*, by using the direct stream approach in spark we
> have to processing the data present in topic , with one node R cluster
> for to understand how the Spark will behave.
>
> My machine configuration is *4 Cores, 16 GB RAM with 1 executor.*
>
> My question is how many cores are used for this job while running.
>
> *In web console it show 4 cores are used.*
>
> *How the cores are used in Directstream approach*?
>
> Command to run the Job :
>
> *./spark/bin/spark-submit --master spark://XX.XX.XX.XXX:7077 --class
> org.eiq.IndexingClient ~/spark/lib/IndexingClient.jar*
>
>
>
> Thanks & Regards,
>
> *Ganga Phani Charan Adabala | Software Engineer*
>
> o:  +91-40-23116680 | c:  +91-9491418099
>
> EiQ Networks, Inc. 
>
>
>
>
>
> [image: cid:image001.png@01D11C9D.AF5CC1F0] 
>
> *"This email is intended only for the use of the individual or entity
> named above and may contain information that is confidential and
> privileged. If you are not the intended recipient, you are hereby notified
> that any dissemination, distribution or copying of the email is strictly
> prohibited. If you have received this email in error, please destroy
> the original message."*
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das

UpdateStateByKey and your batch data could be filling up your executor
memory and hence it might be hitting the disk, you can verify it by looking
at the memory footprint while your job is running. Looking at the executor
logs will also give you a better understanding of whats going on.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 8:24 AM, Steven Pearson  wrote:

> I'm running a Spark Streaming job on 1.3.1 which contains an
> updateStateByKey.  The job works perfectly fine, but at some point (after a
> few runs), it starts shuffling to disk no matter how much memory I give the
> executors.
>
> I have tried changing --executor-memory on
> spark-submit, spark.shuffle.memoryFraction, spark.storage.memoryFraction,
> and spark.storage.unrollFraction.  But no matter how I configure these, it
> always spills to disk around 2.5GB.
>
> What is the best way to avoid spilling shuffle to disk?
>
>

Re: Predictive Modeling

2015-12-07 Thread Akhil Das

You can write a simple python script to process the 1.5GB dataset, use the
pandas library for building your predictive model.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 3:02 PM, Chintan Bhatt <
chintanbhatt...@charusat.ac.in> wrote:

> Hi,
> I'm very much interested to make a predictive model using crime data
> (from 2001-present. It is big .csv file (about 1.5 GB) )in spark on
> hortonworks.
> Can anyone tell me how to start?
>
> --
> CHINTAN BHATT 
> Assistant Professor,
> U & P U Patel Department of Computer Engineering,
> Chandubhai S. Patel Institute of Technology,
> Charotar University of Science And Technology (CHARUSAT),
> Changa-388421, Gujarat, INDIA.
> http://www.charusat.ac.in
> *Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/
>
> [image: IBM]
> 
>
>
> DISCLAIMER: The information transmitted is intended only for the person or
> entity to which it is addressed and may contain confidential and/or
> privileged material which is the intellectual property of Charotar
> University of Science & Technology (CHARUSAT). Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon
> this information by persons or entities other than the intended recipient
> is strictly prohibited. If you are not the intended recipient, or the
> employee, or agent responsible for delivering the message to the intended
> recipient and/or if you have received this in error, please contact the
> sender and delete the material from the computer or device. CHARUSAT does
> not take any liability or responsibility for any malicious codes/software
> and/or viruses/Trojan horses that may have been picked up during the
> transmission of this message. By opening and solely relying on the contents
> or part thereof this message, and taking action thereof, the recipient
> relieves the CHARUSAT of all the liabilities including any damages done to
> the recipient's pc/laptop/peripherals and other communication devices due
> to any reason.
>

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Sean Owen

You can't broadcast an RDD to begin with, and can't use RDDs inside
RDDs. They are really driver-side concepts.

Yes that's how you'd use a broadcast of anything else though, though
you need to reference ".value" on the broadcast. The 'if' is redundant
in that example, and if it's a map- or collection-like structure, you
don't even need the arg.

RDD2.filter(broadcasted.value.contains)

On Mon, Dec 7, 2015 at 2:43 PM, Akhil Das  wrote:
> Something like this?
>
> val broadcasted = sc.broadcast(...)
>
> RDD2.filter(value => {
>
> //simply use broadcasted
> if(broadcasted.contains(value)) true
>
> })
>
>
>
> Thanks
> Best Regards
>
> On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar
>  wrote:
>>
>> Hi,
>>
>>  I have RDD1 that is broadcasted.
>>
>> I have a user defined method for the filter functionality of RDD2, written
>> as follows:
>>
>> RDD2.filter(my_func)
>>
>>
>> I want to access the values of RDD1 inside my_func. Is that possible?
>> Should I pass RDD1 as a parameter into my_func?
>>
>> Thanks
>> Abhishek S
>>
>> NOTICE AND DISCLAIMER
>>
>> This email (including attachments) is confidential. If you are not the
>> intended recipient, notify the sender immediately, delete this email from
>> your system and do not disclose or use for any purpose.
>>
>> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
>> Kingdom
>> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
>> Kingdom
>> Big Data Partnership Limited is a company registered in England & Wales
>> with Company No 7904824
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark applications metrics

2015-12-07 Thread Akhil Das

Usually your application is composed of jobs and jobs are composed of
tasks, on the task level you can see how much read/write was happened from
the stages tab of your driver ui.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 6:20 PM, patcharee  wrote:

> Hi
>
> How can I see the summary of data read / write, shuffle read / write, etc
> of an Application, not per stage?
>
> Thanks,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-07 Thread Akhil Das

Whats in your SparkIsAwesome class? Just make sure that you are giving
enough partition to spark to evenly distribute the job throughout the
cluster.
Try submitting the job this way:

~/spark/bin/spark-submit --executor-cores 10 --executor-memory 5G
--driver-memory 5G --class com.example.SparkIsAwesome awesome/spark.jar


Thanks
Best Regards

On Sat, Dec 5, 2015 at 12:58 AM, Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

FW: Managed to make Hive run on Spark engine

2015-12-07 Thread Mich Talebzadeh

For those interested

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 06 December 2015 20:33
To: u...@hive.apache.org
Subject: Managed to make Hive run on Spark engine

 

Thanks all especially to Xuefu.for contributions. Finally it works, which means 
don’t give up until it works :)

 

hduser@rhes564::/usr/lib/hive/lib> hive

Logging initialized using configuration in 
jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive> set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;

hive> set hive.execution.engine=spark;

hive> set spark.master=spark://50.140.197.217:7077;

hive> set spark.eventLog.enabled=true;

hive> set spark.eventLog.dir= /usr/lib/spark-1.3.1-bin-hadoop2.6/logs;

hive> set spark.executor.memory=512m;

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

hive> set hive.spark.client.server.connect.timeout=22ms;

hive> set spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;

hive> use asehadoop;

OK

Time taken: 0.638 seconds

hive> select count(1) from t;

Query ID = hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42b

Total jobs = 1

Launching Job 1 out of 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=

In order to set a constant number of reducers:

  set mapreduce.job.reduces=

Starting Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a

 

Query Hive on Spark job[0] stages:

0

1

 

Status: Running (Hive on Spark job[0])

Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1  Stage-1_0: 0/1

2015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1

2015-12-06 20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished

Status: Finished successfully in 8.10 seconds

OK

 

The versions used for this project

 

 

OS version Linux version 2.6.18-92.el5xen 
(brewbuil...@ls20-bc2-13.build.redhat.com 
 ) (gcc version 4.1.2 20071124 
(Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008

 

Hadoop 2.6.0

Hive 1.2.1

spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
spark-1.3.1-bin-hadoop2.6.gz for starting spark standalone cluster)

The Jar file used in $HIVE_HOME/lib to link Hive to spark was --> 
spark-assembly-1.3.1-hadoop2.4.0.jar 

   (built from the source downloaded as zipped file spark-1.3.1.gz and built 
with command line make-distribution.sh --name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

Pretty picky on parameters, CLASSPATH, IP addresses or hostname etc to make it 
work

 

I will create a full guide on how to build and make Hive to run with Spark as 
its engine (as opposed to MR).

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

Obtaining metrics of an individual Spark job

2015-12-07 Thread diplomatic Guru

Hello team,

I need to present the Spark job performance to my management. I could get
the execution time by measuring the starting and finishing time of the job
(includes overhead). However, not sure how to get the other matrices e.g
cpu, i/o, memory etc..

I want to measure the  individual job, not the whole cluster. Please let me
know the best way to do it. if there are any useful resources the please
provide links.


Thank you.

Re: spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com

I found a way out.

import java.text.SimpleDateFormat
import java.util.Date;

val format = new SimpleDateFormat("-M-dd hh:mm:ss")

 val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
Date(


Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu

Does unix_timestamp() satisfy your needs ?
See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> I found a way out.
>
> import java.text.SimpleDateFormat
> import java.util.Date;
>
> val format = new SimpleDateFormat("-M-dd hh:mm:ss")
>
>  val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
> ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
> Date(
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs


Jonathan,
Did you ever get to the bottom of this? I have some users working with 
Spark in a classroom setting and our example notebooks run into problems 
where there is so much spilled to disk that they run out of quota. A 
1.5G input set becomes >30G of spilled data on disk. I looked into how I 
could unpersist the data so I could clean up the files, but I was 
unsuccessful.


We're using Spark 1.5.0

Yours,
Ewan

On 16/07/15 23:18, Stahlman, Jonathan wrote:

Hello all,

I am running the Spark recommendation algorithm in MLlib and I have 
been studying its output with various model configurations.  Ideally I 
would like to be able to run one job that trains the recommendation 
model with many different configurations to try to optimize for 
performance.  A sample code in python is copied below.


The issue I have is that each new model which is trained caches a set 
of RDDs and eventually the executors run out of memory.  Is there any 
way in Pyspark to unpersist() these RDDs after each iteration?  The 
names of the RDDs which I gather from the UI is:


itemInBlocks
itemOutBlocks
Products
ratingBlocks
userInBlocks
userOutBlocks
users

I am using Spark 1.3.  Thank you for any help!

Regards,
Jonathan




data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
  functions = [rating] #defined elsewhere
  ranks = [10,20]
  iterations = [10,20]
  lambdas = [0.01,0.1]
  alphas  = [1.0,50.0]

  results = []
  for ratingFunction, rank, numIterations, m_lambda, m_alpha in 
itertools.product( functions, ranks, iterations, lambdas, alphas ):

#train model
ratings_train = data_train.map(lambda l: Rating( l.user, 
l.product, ratingFunction(l) ) )
model   = ALS.trainImplicit( ratings_train, rank, numIterations, 
lambda_=float(m_lambda), alpha=float(m_alpha) )


#test performance on CV data
ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product, 
ratingFunction(l) ) )

auc = areaUnderCurve( ratings_cv, model.predictAll )

#save results
result = ",".join(str(l) for l in 
[ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])

results.append(result)



The information contained in this e-mail is confidential and/or 
proprietary to Capital One and/or its affiliates and may only be used 
solely in performance of work or services for Capital One. The 
information transmitted herewith is intended only for use by the 
individual or entity to which it is addressed. If the reader of this 
message is not the intended recipient, you are hereby notified that 
any review, retransmission, dissemination, distribution, copying or 
other use of, or taking of any action in reliance upon this 
information is strictly prohibited. If you have received this 
communication in error, please contact the sender and delete the 
material from your computer.

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Sean Owen

I'm not sure if this is available in Python but from 1.3 on you should
be able to call ALS.setFinalRDDStorageLevel with level "none" to ask
it to unpersist when it is done.

On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs  wrote:
> Jonathan,
> Did you ever get to the bottom of this? I have some users working with Spark
> in a classroom setting and our example notebooks run into problems where
> there is so much spilled to disk that they run out of quota. A 1.5G input
> set becomes >30G of spilled data on disk. I looked into how I could
> unpersist the data so I could clean up the files, but I was unsuccessful.
>
> We're using Spark 1.5.0
>
> Yours,
> Ewan
>
> On 16/07/15 23:18, Stahlman, Jonathan wrote:
>
> Hello all,
>
> I am running the Spark recommendation algorithm in MLlib and I have been
> studying its output with various model configurations.  Ideally I would like
> to be able to run one job that trains the recommendation model with many
> different configurations to try to optimize for performance.  A sample code
> in python is copied below.
>
> The issue I have is that each new model which is trained caches a set of
> RDDs and eventually the executors run out of memory.  Is there any way in
> Pyspark to unpersist() these RDDs after each iteration?  The names of the
> RDDs which I gather from the UI is:
>
> itemInBlocks
> itemOutBlocks
> Products
> ratingBlocks
> userInBlocks
> userOutBlocks
> users
>
> I am using Spark 1.3.  Thank you for any help!
>
> Regards,
> Jonathan
>
>
>
>
>   data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
>   functions = [rating] #defined elsewhere
>   ranks = [10,20]
>   iterations = [10,20]
>   lambdas = [0.01,0.1]
>   alphas  = [1.0,50.0]
>
>   results = []
>   for ratingFunction, rank, numIterations, m_lambda, m_alpha in
> itertools.product( functions, ranks, iterations, lambdas, alphas ):
> #train model
> ratings_train = data_train.map(lambda l: Rating( l.user, l.product,
> ratingFunction(l) ) )
> model   = ALS.trainImplicit( ratings_train, rank, numIterations,
> lambda_=float(m_lambda), alpha=float(m_alpha) )
>
> #test performance on CV data
> ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,
> ratingFunction(l) ) )
> auc = areaUnderCurve( ratings_cv, model.predictAll )
>
> #save results
> result = ",".join(str(l) for l in
> [ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
> results.append(result)
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary
> to Capital One and/or its affiliates and may only be used solely in
> performance of work or services for Capital One. The information transmitted
> herewith is intended only for use by the individual or entity to which it is
> addressed. If the reader of this message is not the intended recipient, you
> are hereby notified that any review, retransmission, dissemination,
> distribution, copying or other use of, or taking of any action in reliance
> upon this information is strictly prohibited. If you have received this
> communication in error, please contact the sender and delete the material
> from your computer.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Akhil Das

Something like this?

val broadcasted = sc.broadcast(...)

RDD2.filter(value => {

//simply use *broadcasted*
if(broadcasted.contains(value)) true

})



Thanks
Best Regards

On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar <
abhishek.shivku...@bigdatapartnership.com> wrote:

> Hi,
>
>  I have RDD1 that is broadcasted.
>
> I have a user defined method for the filter functionality of RDD2, written
> as follows:
>
> RDD2.filter(my_func)
>
>
> I want to access the values of RDD1 inside my_func. Is that possible?
> Should I pass RDD1 as a parameter into my_func?
>
> Thanks
> Abhishek S
>
> *NOTICE AND DISCLAIMER*
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales
> with Company No 7904824
>

Spark sql random number or sequence numbers ?

2015-12-07 Thread kali.tumm...@gmail.com

Hi All,

I did implemented random_numbers using scala spark , is there a function to
get row_number equivalent in spark sql ? 

example:- 
sql server:-row_number() 
Netezza:- sequence number 
mysql:- sequence number

Example:-
val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
,row_number() as random from TestTable limit 10")

Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-random-number-or-sequence-numbers-tp25623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das

Not a direct answer but you can create a big fat jar combining all the
classes in the three jars and pass it.

Thanks
Best Regards

On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan 
wrote:

> Hello
>
> I have a question about AWS CLI for people who use it.
>
> I create a spark cluster with aws cli and i’m using spark step with jar
> dependencies. But as you can see below i can not set multiple jars because
> AWS CLI replaces comma with space in ARGS.
>
> Is there a way of doing it? I can accept every kind of solutions. For
> example, i tried to merge these two jar dependencies but i could not manage
> it.
>
>
> aws emr create-cluster
> …..
> …..
> Args=[--class,com.blabla.job,
> —jars,"/home/hadoop/first.jar,/home/hadoop/second.jar",
> /home/hadoop/main.jar,--verbose]
>
>
> I also tried to escape comma with \\, but it did not work.
>

Re: Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Akhil Das

Did you read
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support



Thanks
Best Regards

On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
>
>
> I am trying to make Hive work with Spark.
>
>
>
> I have been told that I need to use Spark 1.3 and build it from source
> code WITHOUT HIVE libraries.
>
>
>
> I have built it as follows:
>
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
>
>
> Now the issue I have that I cannot start master node.
>
>
>
> When I try
>
>
>
> hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin>
> ./start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> failed to launch org.apache.spark.deploy.master.Master:
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
> full log in
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
>
>
> I get
>
>
>
> Spark Command: /usr/java/latest/bin/java -cp
> :/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
> --webui-port 8080
>
> 
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
>
> at java.lang.Class.getMethod0(Class.java:2764)
>
> at java.lang.Class.getMethod(Class.java:1653)
>
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
>
>
> Any advice will be appreciated.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>

Re: Where to implement synchronization is GraphX Pregel API

2015-12-07 Thread Robineast

Not sure exactly what your asking but:

1) if you are asking do you need to implement synchronisation code - no that
is built into the call to Pregel
2) if you are asking how is synchronisation implemented in GraphX - the
superstep starts and ends with the beginning and end of a while loop in the
Pregel implementation code (see
http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api
for pseudo-code or Pregel.scala in the source). This code will run in the
driver and orchestrates the receipt of messages, vertex update program and
send messages. All you need to do is supply the Merge message, vertex update
and the send message functions to the Pregel method. Since GraphX objects
are backed by RDDs and RDDs provided distributed processing you get
synchronous distributed processing.



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-implement-synchronization-is-GraphX-Pregel-API-tp25612p25622.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: parquet file doubts

2015-12-07 Thread Shushant Arora

how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.

How can I see parquet version of my file.Is min max respective to some
parquet version or available since beginning?


On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet 
wrote:

> Yes, Parquet has min/max.
>
>
>
> *From:* Cheng Lian [mailto:l...@databricks.com]
> *Sent:* Monday, December 07, 2015 11:21 AM
> *To:* Ted Yu
> *Cc:* Shushant Arora; user@spark.apache.org
> *Subject:* Re: parquet file doubts
>
>
>
> Oh sorry... At first I meant to cc spark-user list since Shushant and I
> had been discussed some Spark related issues before. Then I realized that
> this is a pure Parquet issue, but forgot to change the cc list. Thanks for
> pointing this out! Please ignore this thread.
>
> Cheng
>
> On 12/7/15 12:43 PM, Ted Yu wrote:
>
> Cheng:
>
> I only see user@spark in the CC.
>
>
>
> FYI
>
>
>
> On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian  wrote:
>
> cc parquet-dev list (it would be nice to always do so for these general
> questions.)
>
> Cheng
>
> On 12/6/15 3:10 PM, Shushant Arora wrote:
>
> Hi
>
> I have few doubts on parquet file format.
>
> 1.Does parquet keeps min max statistics like in ORC. how can I see parquet
> version(whether its1.1,1.2or1.3) for parquet file generated using hive or
> custom MR or AvroParquetoutputFormat.
>
> Yes, Parquet also keeps row group statistics. You may check the Parquet
> file using the parquet-meta CLI tool in parquet-tools (see
> https://github.com/Parquet/parquet-mr/issues/321 for details), then look
> for the "creator" field of the file. For programmatic access, check for
> o.a.p.hadoop.metadata.FileMetaData.createdBy.
>
>
> 2.how to sort parquet records while generating parquet file using
> avroparquetoutput format?
>
> AvroParquetOutputFormat is not a format. It's just responsible for
> converting Avro records to Parquet records. How are you using
> AvroParquetOutputFormat? Any example snippets?
>
>
> Thanks
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Nisrina Luthfiyati

Hi Jacek, thank you for your answer. I looked at TaskSchedulerImpl and
TaskSetManager and it does looked like tasks are directly sent to
executors. Also would love to be corrected if mistaken as I have little
knowledge about Spark internals and very new at scala.

On Tue, Dec 1, 2015 at 1:16 AM, Jacek Laskowski  wrote:

> On Fri, Nov 27, 2015 at 12:12 PM, Nisrina Luthfiyati <
> nisrina.luthfiy...@gmail.com> wrote:
>
>> Hi all,
>> I'm trying to understand how yarn-client mode works and found these two
>> diagrams:
>>
>>
>>
>>
>> In the first diagram, it looks like the driver running in client directly
>> communicates with executors to issue application commands, while in the
>> second diagram it looks like application commands is sent to application
>> master first and then forwarded to executors.
>>
>
> My limited understanding tells me that regardless of deploy mode (local,
> standalone, YARN or mesos), drivers (using TaskSchedulerImpl) sends
> TaskSets to executors once they're launched. YARN and Mesos are only used
> until they offer resources (CPU and memory) and once executors start, these
> cluster managers are not engaged in the communication (driver and executors
> communicate using RPC over netty since 1.6-SNAPSHOT or akka before).
>
> I'd love being corrected if mistaken. Thanks.
>
> Jacek
>



-- 
Nisrina Luthfiyati - Ilmu Komputer Fasilkom UI 2010
http://www.facebook.com/nisrina.luthfiyati
http://id.linkedin.com/in/nisrina

RE: Broadcasting a parquet file using spark and python

2015-12-07 Thread Shuai Zheng

Hi Michael,

 

Thanks for feedback.

 

I am using version 1.5.2 now.

 

Can you tell me how to enforce the broadcast join? I don’t want to let the 
engine to decide the execution path of join. I want to use hint or parameter to 
enforce broadcast join (because I also have some cases are inner join but I 
want to use broadcast join).

 

Or is there any ticket or roadmap for this feature?

 

Regards,

 

Shuai

 

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Saturday, December 05, 2015 4:11 PM
To: Shuai Zheng
Cc: Jitesh chandra Mishra; user
Subject: Re: Broadcasting a parquet file using spark and python

 

I believe we started supporting broadcast outer joins in Spark 1.5.  Which 
version are you using? 

 

On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng  wrote:

Hi all,

 

Sorry to re-open this thread.

 

I have a similar issue, one big parquet file left outer join quite a few 
smaller parquet files. But the running is extremely slow and even OOM sometimes 
(with 300M , I have two questions here:

 

1, If I use outer join, will Spark SQL auto use broadcast hashjoin?

2, If not, in the latest documents: 
http://spark.apache.org/docs/latest/sql-programming-guide.html

 


spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

Configures the maximum size in bytes for a table that will be broadcast to all 
worker nodes when performing a join. By setting this value to -1 broadcasting 
can be disabled. Note that currently statistics are only supported for Hive 
Metastore tables where the command ANALYZE TABLE  COMPUTE STATISTICS 
noscan has been run.

 

How can I do this (run command analyze table) in Java? I know I can code it by 
myself (create a broadcast val and implement lookup by myself), but it will 
make code super ugly.

 

I hope we can have either API or hint to enforce the hashjoin (instead of this 
suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or 
roadmap for this feature?

 

Regards,

 

Shuai

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Wednesday, April 01, 2015 2:01 PM
To: Jitesh chandra Mishra
Cc: user
Subject: Re: Broadcasting a parquet file using spark and python

 

You will need to create a hive parquet table that points to the data and run 
"ANALYZE TABLE tableName noscan" so that we have statistics on the size.

 

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra  
wrote:

Hi Michael,

 

Thanks for your response. I am running 1.2.1. 

 

Is there any workaround to achieve the same with 1.2.1?

 

Thanks,

Jitesh

 

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust  
wrote:

In Spark 1.3 I would expect this to happen automatically when the parquet table 
is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).  If 
you are running 1.3 and not seeing this, can you show the code you are using to 
create the table?

 

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129  wrote:

How can we implement a BroadcastHashJoin for spark with python?

My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.

Tables on which join is performed are stored as parquet files.

Please help.

Thanks and regards,
Jitesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia

Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing list.
Yes, we have a distributed C++ application, that will store data on each node 
in the cluster, and we hope to leverage Spark to do more fancy analytics on 
those data. But we need high performance, that’s why we want shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

> -dev, +user (this is not a question about development of Spark itself so 
> you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m 
> sure it would be possible with enough tinkering but it’s not clear what you 
> are trying to achieve. Spark is a distributed processing system, it has 
> multiple JVMs running on different machines that each run a small part of the 
> overall processing. Unless you have some sort of idea to have multiple C++ 
> processes collocated with the distributed JVMs using named memory mapped 
> files doesn’t make architectural sense. 
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 6 Dec 2015, at 20:43, Jia  wrote:
>> 
>> Dears, for one project, I need to implement something so Spark can read data 
>> from a C++ process. 
>> To provide high performance, I really hope to implement this through shared 
>> memory between the C++ process and Java JVM process.
>> It seems it may be possible to use named memory mapped files and JNI to do 
>> this, but I wonder whether there is any existing efforts or more efficient 
>> approach to do this?
>> Thank you very much!
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>

Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Mich Talebzadeh

 

Thanks sorted.

 

Actually I used version 1.3.1 and now I managed to make it work as Hive 
execution engine.

 

Cheers,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Akhil Das [mailto:ak...@sigmoidanalytics.com] 
Sent: 07 December 2015 14:31
To: Mich Talebzadeh  >
Cc: user  >
Subject: Re: Getting error when trying to start master node after building 
spark 1.3

 

Did you read 
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support

 

 




Thanks

Best Regards

 

On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh  > wrote:

Hi,

 

 

I am trying to make Hive work with Spark.

 

I have been told that I need to use Spark 1.3 and build it from source code 
WITHOUT HIVE libraries.

 

I have built it as follows:

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

 

Now the issue I have that I cannot start master node.

 

When I try

 

hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin 
 > 
./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to 
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out

failed to launch org.apache.spark.deploy.master.Master:

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

full log in 
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out

 

I get

 

Spark Command: /usr/java/latest/bin/java -cp 
:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077 
--webui-port 8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Any advice will be appreciated.

 

Thanks,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd,

Re: spark.authenticate=true YARN mode doesn't work

2015-12-07 Thread Marcelo Vanzin

Prasad,

As I mentioned in my first reply, you need to enable
spark.authenticate in the shuffle service's configuration too for this
to work. It doesn't seem like you have done that.

On Sun, Dec 6, 2015 at 5:09 PM, Prasad Reddy  wrote:
> Hi Marcelo,
>
> I am attaching all container logs. can you please take a look at it when you
> get a chance.
>
> Thanks
> Prasad
>
> On Sat, Dec 5, 2015 at 2:30 PM, Marcelo Vanzin  wrote:
>>
>> On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy  wrote:
>> > I am running Spark YARN and trying to enable authentication by setting
>> > spark.authenticate=true. After enable authentication I am not able to
>> > Run
>> > Spark word count or any other programs.
>>
>> Define "I am not able to run". What doesn't work? What error do you get?
>>
>> None of the things Ted mentioned should affect this. Enabling that
>> option should be all that's needed. If you're using the external
>> shuffle service, make sure the option is also enabled in the service's
>> configuration.
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)

How can I create a RDD from a SQL query against SQLServer database? Here is the 
example of dataframe

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview


val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

This code create dataframe from a table. How can I create dataframe from a 
query, e.g. "select docid, title, docText from dbo.document where docid between 
10 and 1000"?

Ningjun

Re: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Sujit Pal

Hi Ningjun,

Haven't done this myself, saw your question and was curious about the
answer and found this article which you might find useful:
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/

According this article, you can pass in your SQL statement in the "dbtable"
mapping, ie, something like:

val jdbcDF = sqlContext.read.format("jdbc")
.options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "(select docid, title, docText from
dbo.document where docid between 10 and 1000)"
)).load

-sujit

On Mon, Dec 7, 2015 at 8:26 AM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

> How can I create a RDD from a SQL query against SQLServer database? Here
> is the example of dataframe
>
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
>
>
>
>
>
> *val* jdbcDF *=* sqlContext.read.format("jdbc").options(
>
>   *Map*("url" -> "jdbc:postgresql:dbserver",
>
>   "dbtable" -> "schema.tablename")).load()
>
>
>
> This code create dataframe from a table. How can I create dataframe from a
> query, e.g. “select docid, title, docText from dbo.document where docid
> between 10 and 1000”?
>
>
>
> Ningjun
>
>
>

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Jacek Laskowski

Hi,

That's my understanding, too. Just spent an entire morning today to check
it out and would be surprised to hear otherwise.

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

On Mon, Dec 7, 2015 at 4:01 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote:

> Hi Jacek, thank you for your answer. I looked at TaskSchedulerImpl and
> TaskSetManager and it does looked like tasks are directly sent to
> executors. Also would love to be corrected if mistaken as I have little
> knowledge about Spark internals and very new at scala.
>
> On Tue, Dec 1, 2015 at 1:16 AM, Jacek Laskowski  wrote:
>
>> On Fri, Nov 27, 2015 at 12:12 PM, Nisrina Luthfiyati <
>> nisrina.luthfiy...@gmail.com> wrote:
>>
>>> Hi all,
>>> I'm trying to understand how yarn-client mode works and found these two
>>> diagrams:
>>>
>>>
>>>
>>>
>>> In the first diagram, it looks like the driver running in client
>>> directly communicates with executors to issue application commands, while
>>> in the second diagram it looks like application commands is sent to
>>> application master first and then forwarded to executors.
>>>
>>
>> My limited understanding tells me that regardless of deploy mode (local,
>> standalone, YARN or mesos), drivers (using TaskSchedulerImpl) sends
>> TaskSets to executors once they're launched. YARN and Mesos are only used
>> until they offer resources (CPU and memory) and once executors start, these
>> cluster managers are not engaged in the communication (driver and executors
>> communicate using RPC over netty since 1.6-SNAPSHOT or akka before).
>>
>> I'd love being corrected if mistaken. Thanks.
>>
>> Jacek
>>
>
>
>
> --
> Nisrina Luthfiyati - Ilmu Komputer Fasilkom UI 2010
> http://www.facebook.com/nisrina.luthfiyati
> http://id.linkedin.com/in/nisrina
>
>

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 6 Dec 2015, at 20:43, Jia  wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data 
> from a C++ process. 
> To provide high performance, I really hope to implement this through shared 
> memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do 
> this, but I wonder whether there is any existing efforts or more efficient 
> approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

RE: spark sql current time stamp function ?

2015-12-07 Thread Mich Talebzadeh

Or try this

 

cast(from_unixtime(unix_timestamp()) AS timestamp

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: 07 December 2015 15:49
To: kali.tumm...@gmail.com
Cc: user 
Subject: Re: spark sql current time stamp function ?

 

Does unix_timestamp() satisfy your needs ?

See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

 

On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com 
   > wrote:

I found a way out.

import java.text.SimpleDateFormat
import java.util.Date;

val format = new SimpleDateFormat("-M-dd hh:mm:ss")

 val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
Date(


Thanks
Sri



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 
For additional commands, e-mail: user-h...@spark.apache.org

How to change StreamingContext batch duration after loading from checkpoint

2015-12-07 Thread yam

Is there a way to change the streaming context batch interval after reloading
from checkpoint?

I would like to be able to change the batch interval after restarting the
application without loosing the checkpoint of course.

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-change-StreamingContext-batch-duration-after-loading-from-checkpoint-tp25624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu

Have you tried using monotonicallyIncreasingId ?

Cheers

On Mon, Dec 7, 2015 at 7:56 AM, Sri  wrote:

> Thanks , I found the right function current_timestamp().
>
> different Question:-
> Is there a row_number() function in spark SQL ? Not in Data frame just
> spark SQL?
>
>
> Thanks
> Sri
>
> Sent from my iPhone
>
> On 7 Dec 2015, at 15:49, Ted Yu  wrote:
>
> Does unix_timestamp() satisfy your needs ?
> See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
>
> On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com <
> kali.tumm...@gmail.com> wrote:
>
>> I found a way out.
>>
>> import java.text.SimpleDateFormat
>> import java.util.Date;
>>
>> val format = new SimpleDateFormat("-M-dd hh:mm:ss")
>>
>>  val testsql=sqlContext.sql("select
>> column1,column2,column3,column4,column5
>> ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
>> Date(
>>
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: spark sql current time stamp function ?

2015-12-07 Thread Sri

Thanks , I found the right function current_timestamp().

different Question:-
Is there a row_number() function in spark SQL ? Not in Data frame just spark 
SQL?


Thanks
Sri

Sent from my iPhone

> On 7 Dec 2015, at 15:49, Ted Yu  wrote:
> 
> Does unix_timestamp() satisfy your needs ?
> See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
> 
>> On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com 
>>  wrote:
>> I found a way out.
>> 
>> import java.text.SimpleDateFormat
>> import java.util.Date;
>> 
>> val format = new SimpleDateFormat("-M-dd hh:mm:ss")
>> 
>>  val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
>> ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
>> Date(
>> 
>> 
>> Thanks
>> Sri
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>

Task hung on SocketInputStream.socketRead0 when reading large a mount of data from AWS S3

2015-12-07 Thread Sa Xiao

Hi,


We encounter a problem very similar to this one:

https://www.mail-archive.com/search?l=user@spark.apache.org=subject:%22Spark+task+hangs+infinitely+when+accessing+S3+from+AWS%22=newest=1


When reading large amount of data from S3, one or several tasks hung. It
doesn't happen every time, but  pretty consistently about at least 1 out of
3 times.


Spark 1.5

mesos slaves: 40 amazon 3r.xlarge (4 core, 30 GB) machines.

total data read from S3: ~380 GB


*spark config that's not default:*

spark.mesos.coarse = true

--conf spark.sql.shuffle.partitions=300

--conf spark.executor.memory=25G

--conf spark.sql.tungsten.enabled=false


*The thread dump of the hanging task:*

Executor task launch worker-3[1] where

  [1] java.net.SocketInputStream.socketRead0 (native method)

  [2] java.net.SocketInputStream.socketRead (SocketInputStream.java:116)

  [3] java.net.SocketInputStream.read (SocketInputStream.java:170)

  [4] java.net.SocketInputStream.read (SocketInputStream.java:141)

  [5] sun.security.ssl.InputRecord.readFully (InputRecord.java:465)

  [6] sun.security.ssl.InputRecord.read (InputRecord.java:503)

  [7] sun.security.ssl.SSLSocketImpl.readRecord (SSLSocketImpl.java:961)

  [8] sun.security.ssl.SSLSocketImpl.performInitialHandshake
(SSLSocketImpl.java:1,363)

  [9] sun.security.ssl.SSLSocketImpl.startHandshake
(SSLSocketImpl.java:1,391)

  [10] sun.security.ssl.SSLSocketImpl.startHandshake
(SSLSocketImpl.java:1,375)

  [11] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket
(SSLSocketFactory.java:533)

  [12] org.apache.http.conn.ssl.SSLSocketFactory.connectSocket
(SSLSocketFactory.java:401)

  [13]
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection
(DefaultClientConnectionOperator.java:177)

  [14] org.apache.http.impl.conn.ManagedClientConnectionImpl.open
(ManagedClientConnectionImpl.java:304)

  [15] org.apache.http.impl.client.DefaultRequestDirector.tryConnect
(DefaultRequestDirector.java:610)

  [16] org.apache.http.impl.client.DefaultRequestDirector.execute
(DefaultRequestDirector.java:445)

  [17] org.apache.http.impl.client.AbstractHttpClient.doExecute
(AbstractHttpClient.java:863)

  [18] org.apache.http.impl.client.CloseableHttpClient.execute
(CloseableHttpClient.java:82)

  [19] org.apache.http.impl.client.CloseableHttpClient.execute
(CloseableHttpClient.java:57)

  [20] com.amazonaws.http.AmazonHttpClient.executeHelper
(AmazonHttpClient.java:384)

  [21] com.amazonaws.http.AmazonHttpClient.execute
(AmazonHttpClient.java:232)

  [22] com.amazonaws.services.s3.AmazonS3Client.invoke
(AmazonS3Client.java:3,528)

  [23] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata
(AmazonS3Client.java:976)

  [24] com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata
(AmazonS3Client.java:956)

  [25] org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus
(S3AFileSystem.java:892)

  [26] org.apache.hadoop.fs.s3a.S3AFileSystem.open (S3AFileSystem.java:373)

  [27] org.apache.hadoop.fs.FileSystem.open (FileSystem.java:711)

  [28] org.apache.hadoop.mapred.LineRecordReader.
(LineRecordReader.java:93)

  [29] org.apache.hadoop.mapred.TextInputFormat.getRecordReader
(TextInputFormat.java:54)

  [30] org.apache.spark.rdd.HadoopRDD$$anon$1. (HadoopRDD.scala:239)

  [31] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:216)

  [32] org.apache.spark.rdd.HadoopRDD.compute (HadoopRDD.scala:101)

  [33] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [34] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [35] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [36] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [37] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [38] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [39] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [40] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [41] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [42] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [43] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [44] org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:38)

  [45] org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:297)

  [46] org.apache.spark.rdd.RDD.iterator (RDD.scala:264)

  [47] org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:73)

  [48] org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:41)

  [49] org.apache.spark.scheduler.Task.run (Task.scala:88)

  [50] org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:214)

  [51] java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1,142)

  [52] java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)

  [53] java.lang.Thread.run (Thread.java:745)


*In the mesos-slave where the task is hung:*

*lsof -p 27391 | grep amazon*

java

persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot

Hi,
I am new bee to Spark.
Could somebody guide me how can I persist my spark RDD results in Hive
using SaveAsTable API.
Would  appreciate if you could  provide the example for hive external table.

Thanks in advance.

Re: mllib.recommendations.als recommendForAll not ported to ml?

2015-12-07 Thread Nick Pentreath

I can't find a JIRA for this, though there are some related to the existing
MLlib implementation (https://issues.apache.org/jira/browse/SPARK-10802 and
https://issues.apache.org/jira/browse/SPARK-11968) - would be good to port
it over, and in the process also speed it up as per SPARK-11968, and
possibly add ability to recommend for a subset as per SPARK-10802. Perhaps
file a JIRA ticket for this issue, and link the other two?

On Sun, Dec 6, 2015 at 9:59 PM, guillaume 
wrote:

> I have experimented very low performance with the ALSModel.transform method
> when feeding it with even a small cartesian product of user x items.
>
> The former mllib implementation has a recommendForAll method to return topn
> items per users in an efficient way (using the blockify method to
> distribute
> parts of users and items factors).
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L271
>
> I could revert to mlib, but the ALS benefits nice optimization in ml
> (https://issues.apache.org/jira/browse/SPARK-3541). Do you guys consider
> to
> port the recommendForAll to ml?
>
> Thanks in advance!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-recommendations-als-recommendForAll-not-ported-to-ml-tp25609.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Available options for Spark REST API

2015-12-07 Thread sunil m

Dear Spark experts!

I would like to know the best practices used for invoking spark jobs via
REST API.

We tried out the hidden REST API mentioned here:
http://arturmkrtchyan.com/apache-spark-hidden-rest-api

It works fine for spark standalone mode but does not seem to be working
when i specify
 "spark.master" : "YARN-CLUSTER" or "mesos://..."
did anyone encounter a similar problem?

Has anybody used:
https://github.com/spark-jobserver/spark-jobserver

If yes please share your experience. Does it work good with both Scala and
Java classes. I saw only scala example. Are there any known disadvantages
of using it?

Is there anything better available, which is used in Production environment?

Any advise is appreciated. We are using Spark 1.5.1.

Thanks in advance.

Warm regards,
Sunil M.

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread UMESH CHAUDHARY

currently saveAsTable will create Hive Internal table by default see here



If you want to save it as external table, use saveAsParquetFile and create
an external hive table on that parquet file.

On Mon, Dec 7, 2015 at 3:13 PM, Fengdong Yu 
wrote:

> If your RDD is JSON format, that’s easy.
>
> val df = sqlContext.read.json(rdd)
> df.saveAsTable(“your_table_name")
>
>
>
> > On Dec 7, 2015, at 5:28 PM, Divya Gehlot 
> wrote:
> >
> > Hi,
> > I am new bee to Spark.
> > Could somebody guide me how can I persist my spark RDD results in Hive
> using SaveAsTable API.
> > Would  appreciate if you could  provide the example for hive external
> table.
> >
> > Thanks in advance.
> >
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

How to config the log in Spark

2015-12-07 Thread Guillermo Ortiz

I don't get to activate the logs for my classes. I'm using CDH 5.4 with
Spark 1.3.0

I have a class in Scala with some log.debug, I create a class to log:

package example.spark
import org.apache.log4j.Logger
object Holder extends Serializable {
  @transient lazy val log = Logger.getLogger(getClass.getName)
}

And I use the log inside of a map function which it's executed in the
executors. I'm looking for the logs in the executors (YARN).


My log4j.properties is
log4j.appender.myConsoleAppender=org.apache.log4j.ConsoleAppender
log4j.appender.myConsoleAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.myConsoleAppender.layout.ConversionPattern=%d [%t] %-5p %c -
%m%n

log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.File=/opt/centralLog/log/spark.log
log4j.appender.RollingAppender.DatePattern='.'-MM-dd
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n

log4j.rootLogger=INFO, myConsoleAppender, RollingAppender
*log4j.logger.example.spark=DEBUG, RollingAppender, myConsoleAppender*

And I created a script to execute Spark:
#!/bin/bash

export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_CONF_DIR=/opt/centralLogs/conf
SPARK_CLASSPATH="file:/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf/"
for lib in `ls /opt/centralLogs/lib/*.jar`
do
if [ -z "$SPARK_CLASSPATH" ]; then
SPARK_CLASSPATH=$lib
else
SPARK_CLASSPATH=$SPARK_CLASSPATH,$lib
fi
done

spark-submit --name "CentralLog" --master yarn-client --conf
"spark.driver.extraJavaOptions=-*Dlog4j.configuration=log4j.properties"
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--class example.spark.CentralLog* --jars $SPARK_CLASSPATH,
*file:/opt/centralLogs/conf/log4j.properties*  --executor-memory 2g
/opt/centralLogs/libProject/paas.jar X kafka-topic3 X,X,X

I added *file:/opt/centralLogs/conf/log4j.properties, *but it's not
working, I can't see the debug logs..

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu

If your RDD is JSON format, that’s easy.

val df = sqlContext.read.json(rdd)
df.saveAsTable(“your_table_name")



> On Dec 7, 2015, at 5:28 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am new bee to Spark.
> Could somebody guide me how can I persist my spark RDD results in Hive using 
> SaveAsTable API. 
> Would  appreciate if you could  provide the example for hive external table.
> 
> Thanks in advance.
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu

I suppose your output data is “ORC”, and want to save to hive database: test, 
external table name is : testTable



import scala.collection.immutable

sqlContext.createExternalTable(“test.testTable", 
"org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata"))


> On Dec 7, 2015, at 5:28 PM, Divya Gehlot  wrote:
> 
> Hi,
> I am new bee to Spark.
> Could somebody guide me how can I persist my spark RDD results in Hive using 
> SaveAsTable API. 
> Would  appreciate if you could  provide the example for hive external table.
> 
> Thanks in advance.
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot

My input format is CSV  and I am using Spark 1.3(HDP 2,2 comes with Spark
1.3  so ...)
I am using Spark-csv to read my CSV file and using dataframe API to process
...
I followed these steps

and
succesfully able to read the ORC file .
As these are temp tables and it doesnt store the data in hive.
I trying to figure out how to do that ?

Thanks in advance
Divya

On 7 December 2015 at 17:43, Fengdong Yu  wrote:

> If your RDD is JSON format, that’s easy.
>
> val df = sqlContext.read.json(rdd)
> df.saveAsTable(“your_table_name")
>
>
>
> > On Dec 7, 2015, at 5:28 PM, Divya Gehlot 
> wrote:
> >
> > Hi,
> > I am new bee to Spark.
> > Could somebody guide me how can I persist my spark RDD results in Hive
> using SaveAsTable API.
> > Would  appreciate if you could  provide the example for hive external
> table.
> >
> > Thanks in advance.
> >
> >
>
>

How to use all available memory per worker?

2015-12-07 Thread George Sigletos

Hello,

In a 2-worker cluster: 6 cores/30 GB RAM, 24cores/60GB RAM,

how can I tell my executor to use all 90 GB of available memory?

In the configuration you set e.g. "spark.cores.max" to 30 (24+6),

but cannot set "spark.executor.memory" to 90g (30+60).

Kind regards,
George

Re: how create hbase connect?

2015-12-07 Thread censj

ok! I try it. 
> 在 2015年12月7日，20:11，ayan guha  写道：
> 
> Kindly take a look https://github.com/nerdammer/spark-hbase-connector 
>  
> 
> On Mon, Dec 7, 2015 at 10:56 PM, censj  > wrote:
> hi all,
>   I want to update row on base. how to create connecting base on Rdd?
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Spark and Kafka Integration

2015-12-07 Thread Prashant Bhardwaj

Hi

Some Background:
We have a Kafka cluster with ~45 topics. Some of topics contains logs in
Json format and some in PSV(pipe separated value) format. Now I want to
consume these logs using Spark streaming and store them in Parquet format
in HDFS.

Now my question is:
1. Can we create a InputDStream per topic in the same application?

 Since for every topic Schema of logs might differ, so want to process some
topics in different way.
I want to store logs in different output directory based on the topic name.

2. Also how to partition logs based on timestamp?

-- 
Regards
Prashant

Re: Scala 2.11 and Akka 2.4.0

2015-12-07 Thread RodrigoB

Hi Manas,

Thanks for the reply. I've done that. The problem lies with Spark + akka
2.4.0 build. Seems the maven shader plugin is altering some class files and
breaking the Akka runtime.

Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build
errors using sbt due to the issues found in the below thread in July of this
year.
https://mail-archives.apache.org/mod_mbox/spark-dev/201507.mbox/%3CCA+3qhFSJGmZToGmBU1=ivy7kr6eb7k8t6dpz+ibkstihryw...@mail.gmail.com%3E

So I went back to maven and decided to risk building Spark on akka 2.3.11
and force the akka 2.4.0 jars onto the server's classpath. I find this a
temporary solution while I cannot have a proper akka 2.4.0 runable build.

If anyone has managed to get it working, please let me know.

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-2-11-and-Akka-2-4-0-tp25535p25618.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

how create hbase connect?

2015-12-07 Thread censj

hi all,
  I want to update row on base. how to create connecting base on Rdd?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how create hbase connect?

2015-12-07 Thread ayan guha

Kindly take a look https://github.com/nerdammer/spark-hbase-connector

On Mon, Dec 7, 2015 at 10:56 PM, censj  wrote:

> hi all,
>   I want to update row on base. how to create connecting base on Rdd?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Re: Available options for Spark REST API

2015-12-07 Thread Василец Дмитрий

hello
if i correct understand - sparkui with rest api - for monitoring
spark-jobserver - for submit job.


On Mon, Dec 7, 2015 at 9:42 AM, sunil m <260885smanik...@gmail.com> wrote:

> Dear Spark experts!
>
> I would like to know the best practices used for invoking spark jobs via
> REST API.
>
> We tried out the hidden REST API mentioned here:
> http://arturmkrtchyan.com/apache-spark-hidden-rest-api
>
> It works fine for spark standalone mode but does not seem to be working
> when i specify
>  "spark.master" : "YARN-CLUSTER" or "mesos://..."
> did anyone encounter a similar problem?
>
> Has anybody used:
> https://github.com/spark-jobserver/spark-jobserver
>
> If yes please share your experience. Does it work good with both Scala and
> Java classes. I saw only scala example. Are there any known disadvantages
> of using it?
>
> Is there anything better available, which is used in Production
> environment?
>
> Any advise is appreciated. We are using Spark 1.5.1.
>
> Thanks in advance.
>
> Warm regards,
> Sunil M.
>

RE: Spark and Kafka Integration

2015-12-07 Thread Singh, Abhijeet

For Q2. The order of the logs in each partition is guaranteed but there cannot 
be any such thing as global order.

From: Prashant Bhardwaj [mailto:prashant2006s...@gmail.com]
Sent: Monday, December 07, 2015 5:46 PM
To: user@spark.apache.org
Subject: Spark and Kafka Integration

Hi

Some Background:
We have a Kafka cluster with ~45 topics. Some of topics contains logs in Json 
format and some in PSV(pipe separated value) format. Now I want to consume 
these logs using Spark streaming and store them in Parquet format in HDFS.

Now my question is:
1. Can we create a InputDStream per topic in the same application?

 Since for every topic Schema of logs might differ, so want to process some 
topics in different way.
I want to store logs in different output directory based on the topic name.

2. Also how to partition logs based on timestamp?

--
Regards
Prashant

Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu

refer here: 
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html


of section:
Example 4-27. Python custom partitioner




> On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote:
> 
> I'm not a python expert, so I'm wondering if anybody has a working example of 
> a partitioner for the "partitionFunc" argument (default "portable_hash") to 
> rdd.partitionBy()?
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Best way to save key-value pair rdd ?

2015-12-07 Thread Anup Sawant

Hello,

what would be the best way to save key-value pair rdd so that I don't have
to convert the saved record into tuple while reading the rdd back into
spark ?

-- 

Best,
Anup

Kryo Serialization in Spark

2015-12-07 Thread prasad223

Hi All,

I'm unable to use Kryo serializer in my Spark program.
I'm loading a graph from an edgelist file using GraphLoader and performing a
BFS using pregel API.
But I get the below mentioned error while I'm running.
Can anybody tell me what is the right way to serialize a class in Spark and
what are the functions that needs to be implemented.


class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[GraphBFS])
kryo.register(classOf[Config])
kryo.register(classOf[Iterator[(Long, Double)]])
  }
}


Class GraphBFS{

def vprog(id: VertexId, attr: Double, msg: Double): Double =
math.min(attr,msg) 

def sendMessage(triplet: EdgeTriplet[Double, Int]) :
Iterator[(VertexId, Double)] = {
var iter:Iterator[(VertexId, Double)] = Iterator.empty
val isSrcMarked = triplet.srcAttr != Double.PositiveInfinity
val isDstMarked = triplet.dstAttr != Double.PositiveInfinity
if(!(isSrcMarked && isDstMarked)){
if(isSrcMarked){
iter =
Iterator((triplet.dstId,triplet.srcAttr+1))
}else{
iter =
Iterator((triplet.srcId,triplet.dstAttr+1))
}
}
iter
}

def reduceMessage(a: Double, b: Double) = math.min(a,b)

 def main() {
..
  val bfs = initialGraph.pregel(initialMessage, maxIterations,
activeEdgeDirection)(vprog, sendMessage, reduceMessage)
.

 }
}



15/12/07 21:52:49 INFO BlockManager: Removing RDD 8
15/12/07 21:52:49 INFO BlockManager: Removing RDD 2
Exception in thread "main" org.apache.spark.SparkException: Task not
serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
at
org.apache.spark.graphx.impl.VertexRDDImpl.mapVertexPartitions(VertexRDDImpl.scala:96)
at
org.apache.spark.graphx.impl.GraphImpl.mapVertices(GraphImpl.scala:132)
at org.apache.spark.graphx.Pregel$.apply(Pregel.scala:122)
at org.apache.spark.graphx.GraphOps.pregel(GraphOps.scala:362)
at GraphBFS.main(GraphBFS.scala:241)
at run$.main(GraphBFS.scala:268)
at run.main(GraphBFS.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: GraphBFS
Serialization stack:
- object not serializable (class: GraphBFS, value:
GraphBFS@575c3e9b)
- field (class: GraphBFS$$anonfun$17, name: $outer, type: class
GraphBFS)
- object (class GraphBFS$$anonfun$17, )
- field (class: org.apache.spark.graphx.Pregel$$anonfun$1, name:
vprog$1, type: interface scala.Function3)
- object (class org.apache.spark.graphx.Pregel$$anonfun$1,
)
- field (class: org.apache.spark.graphx.impl.GraphImpl$$anonfun$5,
name: f$1, type: interface scala.Function2)
- object (class org.apache.spark.graphx.impl.GraphImpl$$anonfun$5,
)
- field (class:
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$1, name: f$1, type:
interface scala.Function1)
- object (class
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$1, )

Thanks,
Prasad




-
Thanks,
Prasad
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Serialization-in-Spark-tp25628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo

Robin,
Maybe you didn't read my post in which I stated that Spark works on top of 
HDFS. What Jia wants is to have Spark interacts with a C++ process to read and 
write data.
I've never heard about Jia's use case in Spark. If you know one, please share 
that with me.
Thanks 


On Monday, December 7, 2015 1:57 PM, Robin East  
wrote:
 

 Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. 
Have a look at the wide variety of connectors to things like Cassandra, HBase, 
etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo  wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia  wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo  wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia  wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East

Hi Annabel

I certainly did read your post. My point was that Spark can read from HDFS but 
is in no way tied to that storage layer . A very interesting use case that 
sounds very similar to Jia's (as mentioned by another poster) is contained in 
https://issues.apache.org/jira/browse/SPARK-10399. The comments section 
provides a specific example of processing very large images using a 
pre-existing c++ library.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo  
> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
> What you're requesting, reading and writing to a C++ process, is not part of 
> that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia  wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write 
> and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
> data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo  
>> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>> and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the 
>> data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia  wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can 
>> connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting 
>> to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a 
>> very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful  wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>>> bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing 
>>> list.
>>> Yes, we have a distributed C++ application, that will store data on each 
>>> node in the cluster, and we hope to leverage Spark to do more fancy 
>>> analytics on those data. But we need high performance, that’s why we want 
>>> shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
 On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
 
 -dev, +user (this is not a question about development of Spark itself so 
 you’ll get more answers in the user mailing list)
 
 First up let me say that I don’t really know how this could be done - I’m 
 sure it would be possible with enough tinkering but it’s not clear what 
 you are trying to achieve. Spark is a distributed processing system, it 
 has multiple JVMs running on different machines that each run a small part 
 of the overall processing. Unless you have some sort of idea to have 
 multiple C++ processes collocated with the distributed JVMs using named 
 memory mapped files doesn’t make architectural sense. 
 ---
 Robin East
 Spark GraphX in Action Michael Malak and Robin East
 Manning Publications Co.
 http://www.manning.com/books/spark-graphx-in-action
 
 
 
 
 
> On 6 Dec 2015, at 20:43, Jia  wrote:
> 
> Dears, for one project, I need to implement something so Spark can read 
> data from a C++ process. 
> To provide high performance, I really hope to implement this through 
> shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to 
> do this, but I wonder whether there is any existing efforts or more 
> efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
>

Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia

Hi,

I am running spark job on yarn in cluster mode in secured cluster. I am
trying to run Spark on Hbase using Phoenix, but Spark executors are unable
to get hbase connection using phoenix. I am running knit command to get the
ticket before starting the job and also keytab file and principal are
correctly specified in connection URL. But still spark job on each node
throws below error:

15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)

I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark on
Hbase(without phoenix) successfully in yarn-client mode as mentioned in
this link:

https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos

Also, I found that there is a known issue for yarn-cluster mode for Spark
1.3.1 version:

https://issues.apache.org/jira/browse/SPARK-6918

Has anybody been successful in running Spark on hbase using Phoenix in yarn
cluster or client mode?

Thanks,
Akhilesh Pathodia

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo

Robin,
To prove my point, this is an unresolved issue still in the implementation 
stage. 


On Monday, December 7, 2015 2:49 PM, Robin East  
wrote:
 

 Hi Annabel
I certainly did read your post. My point was that Spark can read from HDFS but 
is in no way tied to that storage layer . A very interesting use case that 
sounds very similar to Jia's (as mentioned by another poster) is contained in 
https://issues.apache.org/jira/browse/SPARK-10399. The comments section 
provides a specific example of processing very large images using a 
pre-existing c++ library.
Robin
Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo  
wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia  wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo  wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia  wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Nick Pentreath

SparkNet may have some interesting ideas - https://github.com/amplab/SparkNet. 
Haven't had a deep look at it yet but it seems to have some functionality 
allowing caffe to read data from RDDs, though I'm not certain the memory is 
shared.



—
Sent from Mailbox

On Mon, Dec 7, 2015 at 9:55 PM, Robin East  wrote:

> Hi Annabel
> I certainly did read your post. My point was that Spark can read from HDFS 
> but is in no way tied to that storage layer . A very interesting use case 
> that sounds very similar to Jia's (as mentioned by another poster) is 
> contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments 
> section provides a specific example of processing very large images using a 
> pre-existing c++ library.
> Robin
> Sent from my iPhone
>> On 7 Dec 2015, at 18:50, Annabel Melongo  
>> wrote:
>> 
>> Jia,
>> 
>> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
>> What you're requesting, reading and writing to a C++ process, is not part of 
>> that requirement.
>> 
>> 
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:42 PM, Jia  wrote:
>> 
>> 
>> Thanks, Annabel, but I may need to clarify that I have no intention to write 
>> and run Spark UDF in C++, I'm just wondering whether Spark can read and 
>> write data to a C++ process with zero copy.
>> 
>> Best Regards,
>> Jia
>>  
>> 
>> 
>>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo  
>>> wrote:
>>> 
>>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>>> and R.
>>> 
>>> The best way to achieve this is to run your application in C++ and used the 
>>> data created by said application to do manipulation within Spark.
>>> 
>>> 
>>> 
>>> On Monday, December 7, 2015 1:15 PM, Jia  wrote:
>>> 
>>> 
>>> Thanks, Dewful!
>>> 
>>> My impression is that Tachyon is a very nice in-memory file system that can 
>>> connect to multiple storages.
>>> However, because our data is also hold in memory, I suspect that connecting 
>>> to Spark directly may be more efficient in performance.
>>> But definitely I need to look at Tachyon more carefully, in case it has a 
>>> very efficient C++ binding mechanism.
>>> 
>>> Best Regards,
>>> Jia
>>> 
 On Dec 7, 2015, at 11:46 AM, Dewful  wrote:
 
 Maybe looking into something like Tachyon would help, I see some sample 
 c++ bindings, not sure how much of the current functionality they 
 support...
 Hi, Robin, 
 Thanks for your reply and thanks for copying my question to user mailing 
 list.
 Yes, we have a distributed C++ application, that will store data on each 
 node in the cluster, and we hope to leverage Spark to do more fancy 
 analytics on those data. But we need high performance, that’s why we want 
 shared memory.
 Suggestions will be highly appreciated!
 
 Best Regards,
 Jia
 
> On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
> 
> -dev, +user (this is not a question about development of Spark itself so 
> you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m 
> sure it would be possible with enough tinkering but it’s not clear what 
> you are trying to achieve. Spark is a distributed processing system, it 
> has multiple JVMs running on different machines that each run a small 
> part of the overall processing. Unless you have some sort of idea to have 
> multiple C++ processes collocated with the distributed JVMs using named 
> memory mapped files doesn’t make architectural sense. 
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 6 Dec 2015, at 20:43, Jia  wrote:
>> 
>> Dears, for one project, I need to implement something so Spark can read 
>> data from a C++ process. 
>> To provide high performance, I really hope to implement this through 
>> shared memory between the C++ process and Java JVM process.
>> It seems it may be possible to use named memory mapped files and JNI to 
>> do this, but I wonder whether there is any existing efforts or more 
>> efficient approach to do this?
>> Thank you very much!
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 
>>

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust

These specific JIRAs don't exist yet, but watch SPARK- as we'll make
sure everything shows up there.

On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers  wrote:

> that's good news about plans to avoid unnecessary conversions, and allow
> access to more efficient internal types. could you point me to the jiras,
> if they exist already? i just tried to find them but had little luck.
> best, koert
>
> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust 
> wrote:
>
>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>>
>>> hello all,
>>> DataFrame internally uses a different encoding for values then what the
>>> user sees. i assume the same is true for Dataset?
>>>
>>
>> This is true.  We encode objects in the tungsten binary format using code
>> generated serializers.
>>
>>
>>> if so, does this means that a function like Dataset.map needs to convert
>>> all the values twice (once to user format and then back to internal
>>> format)? or is it perhaps possible to write scala functions that operate on
>>> internal formats and avoid this?
>>>
>>
>> Currently this is true, but there are plans to avoid unnecessary
>> conversions (back to back maps / filters, etc) and only convert when we
>> need to (shuffles, sorting, hashing, SQL operations).
>>
>> There are also plans to allow you to directly access some of the more
>> efficient internal types by using them as fields in your classes (mutable
>> UTF8 String instead of the immutable java.lang.String).
>>
>>
>

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust

On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar 
wrote:
>
> On a similar note, what is involved in getting native support for some
> user defined functions, so that they are as efficient as native Spark SQL
> expressions? I had one particular one - an arraySum (element wise sum) that
> is heavily used in a lot of risk analytics.
>

To get the best performance you have to implement a catalyst expression
with codegen.  This however is necessarily an internal (unstable) interface
since we are constantly making breaking changes to improve performance.  So
if its a common enough operation we should bake it into the engine.

That said, the code generated encoders that we created for datasets should
lower the cost of calling into external functions as we start using them in
more and more places (i.e. https://issues.apache.org/jira/browse/SPARK-11593
)

Re: Dataset and lambas

2015-12-07 Thread Koert Kuipers

great thanks

On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust 
wrote:

> These specific JIRAs don't exist yet, but watch SPARK- as we'll make
> sure everything shows up there.
>
> On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers  wrote:
>
>> that's good news about plans to avoid unnecessary conversions, and allow
>> access to more efficient internal types. could you point me to the jiras,
>> if they exist already? i just tried to find them but had little luck.
>> best, koert
>>
>> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust 
>> wrote:
>>
>>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>>>
 hello all,
 DataFrame internally uses a different encoding for values then what the
 user sees. i assume the same is true for Dataset?

>>>
>>> This is true.  We encode objects in the tungsten binary format using
>>> code generated serializers.
>>>
>>>
 if so, does this means that a function like Dataset.map needs to
 convert all the values twice (once to user format and then back to internal
 format)? or is it perhaps possible to write scala functions that operate on
 internal formats and avoid this?

>>>
>>> Currently this is true, but there are plans to avoid unnecessary
>>> conversions (back to back maps / filters, etc) and only convert when we
>>> need to (shuffles, sorting, hashing, SQL operations).
>>>
>>> There are also plans to allow you to directly access some of the more
>>> efficient internal types by using them as fields in your classes (mutable
>>> UTF8 String instead of the immutable java.lang.String).
>>>
>>>
>>
>

Re: Dataset and lambas

2015-12-07 Thread Deenar Toraskar

Michael

Having VectorUnionSumUDAF implemented would be great. This is quite
generic, it does element-wise sum of arrays and maps
https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java
and would be massive benefit for a lot of risk analytics.

In general most of the brickhouse UDFs are quite useful
https://github.com/klout/brickhouse. Happy to help out.

On another note what would be involved to have arrays backed by a sparse
Array (I am assuming the current implementation is dense), sort of native
support for http://spark.apache.org/docs/latest/mllib-data-types.html

Regards
Deenar



Regards
Deenar

On 7 December 2015 at 20:21, Michael Armbrust 
wrote:

> On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar  > wrote:
>>
>> On a similar note, what is involved in getting native support for some
>> user defined functions, so that they are as efficient as native Spark SQL
>> expressions? I had one particular one - an arraySum (element wise sum) that
>> is heavily used in a lot of risk analytics.
>>
>
> To get the best performance you have to implement a catalyst expression
> with codegen.  This however is necessarily an internal (unstable) interface
> since we are constantly making breaking changes to improve performance.  So
> if its a common enough operation we should bake it into the engine.
>
> That said, the code generated encoders that we created for datasets should
> lower the cost of calling into external functions as we start using them in
> more and more places (i.e.
> https://issues.apache.org/jira/browse/SPARK-11593)
>

Re: Implementing fail-fast upon critical spark streaming tasks errors

2015-12-07 Thread Cody Koeninger

Personally, for jobs that I care about I store offsets in transactional
storage rather than checkpoints, which eliminates that problem (just
enforce whatever constraints you want when storing offsets).

Regarding the question of communication of errors back to the
streamingListener, there is an onReceiverError callback.  Direct stream
isn't a receiver, and I'm not sure it'd be appropriate to try to change the
direct stream to use that as a means of communication.  Maybe TD can chime
in if you get his attention.


On Sun, Dec 6, 2015 at 9:11 AM, yam  wrote:

> When a spark streaming task is failed (after exceeding
> spark.task.maxFailures), the related batch job is considered failed and the
> driver continues to the next batch in the pipeline after updating
> checkpoint
> to the next checkpoint positions (the new offsets when using Kafka direct
> streaming).
>
> I'm looking for a fail-fast implementation where one or more (or all) tasks
> are failed (a critical error case happens and there's no point in
> progressing to the next batch in line and also un-processed data should not
> be lost) - identify this condition and exit application before committing
> updated checkpoint.
>
> Any ideas? or more specifically:
>
> 1. Is there a way for the driver program to get notified upon a job failure
> (which tasks have failed / rdd metadata) before updating checkpoint.
> StreamingContext has a addStreamingListener method but its onBatchCompleted
> event has no indication on batch failure (only completion).
>
> 2. Stopping and exiting the application. when running in yarn-client mode,
> calling streamingContext.stop halts the process and does not exit.
>
> Thanks!
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-fail-fast-upon-critical-spark-streaming-tasks-errors-tp25606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-07 Thread Cody Koeninger

Just to be clear, spark checkpoints have nothing to do with zookeeper,
they're stored in the filesystem you specify.

On Sun, Dec 6, 2015 at 1:25 AM, manasdebashiskar 
wrote:

> When you enable check pointing your offsets get written in zookeeper. If
> you
> program dies or shutdowns and later restarted kafkadirectstream api knows
> where to start by looking at those offsets from zookeeper.
>
> This is as easy as it gets.
> However if you are planning to re-use the same checkpoint folder among
> different spark version that is currently not supported.
> In that case you might want to go for writing the offset and topic in your
> favorite database. Assuming that DB is high available you can later retried
> the previously worked offset and start from there.
>
> Take a look at the blog post of cody.(the guy who wrote kafkadirectstream)
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461p25597.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread ayan guha

One more thing I feel for better maintability would be to create a dB view
and then use the view in spark. This will avoid burying complicated SQL
queries within application code.
On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)" 
wrote:

> This is a very helpful article. Thanks for the help.
>
>
>
> Ningjun
>
>
>
> *From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
> *Sent:* Monday, December 07, 2015 12:42 PM
> *To:* Wang, Ningjun (LNG-NPV)
> *Cc:* user@spark.apache.org
> *Subject:* Re: How to create dataframe from SQL Server SQL query
>
>
>
> Hi Ningjun,
>
>
>
> Haven't done this myself, saw your question and was curious about the
> answer and found this article which you might find useful:
>
>
> http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
>
>
>
> According this article, you can pass in your SQL statement in the
> "dbtable" mapping, ie, something like:
>
>
>
> val jdbcDF = sqlContext.read.format("jdbc")
>
> .options(
>
> Map("url" -> "jdbc:postgresql:dbserver",
>
> "dbtable" -> "(select docid, title, docText from
> dbo.document where docid between 10 and 1000)"
>
> )).load
>
>
>
> -sujit
>
>
>
> On Mon, Dec 7, 2015 at 8:26 AM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
> How can I create a RDD from a SQL query against SQLServer database? Here
> is the example of dataframe
>
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
>
>
>
>
>
> *val* jdbcDF *=* sqlContext.read.format("jdbc").options(
>
>   *Map*("url" -> "jdbc:postgresql:dbserver",
>
>   "dbtable" -> "schema.tablename")).load()
>
>
>
> This code create dataframe from a table. How can I create dataframe from a
> query, e.g. “select docid, title, docText from dbo.document where docid
> between 10 and 1000”?
>
>
>
> Ningjun
>
>
>
>
>

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Davies Liu

Could you reproduce this problem in 1.5 or 1.6?

On Sun, Dec 6, 2015 at 12:29 AM, YaoPau  wrote:
> If anyone runs into the same issue, I found a workaround:
>
 df.where('state_code = "NY"')
>
> works for me.
>
 df.where(df.state_code == "NY").collect()
>
> fails with the error from the first post.
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-3-not-finding-attribute-in-DF-tp25599p25600.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql current time stamp function ?

2015-12-07 Thread sri hari kali charan Tummala

Hi Ted,

Gave and exception am I following right approach ?

val test=sqlContext.sql("select *,  monotonicallyIncreasingId()  from kali")


On Mon, Dec 7, 2015 at 4:52 PM, Ted Yu  wrote:

> Have you tried using monotonicallyIncreasingId ?
>
> Cheers
>
> On Mon, Dec 7, 2015 at 7:56 AM, Sri  wrote:
>
>> Thanks , I found the right function current_timestamp().
>>
>> different Question:-
>> Is there a row_number() function in spark SQL ? Not in Data frame just
>> spark SQL?
>>
>>
>> Thanks
>> Sri
>>
>> Sent from my iPhone
>>
>> On 7 Dec 2015, at 15:49, Ted Yu  wrote:
>>
>> Does unix_timestamp() satisfy your needs ?
>> See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
>>
>> On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com <
>> kali.tumm...@gmail.com> wrote:
>>
>>> I found a way out.
>>>
>>> import java.text.SimpleDateFormat
>>> import java.util.Date;
>>>
>>> val format = new SimpleDateFormat("-M-dd hh:mm:ss")
>>>
>>>  val testsql=sqlContext.sql("select
>>> column1,column2,column3,column4,column5
>>> ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
>>> Date(
>>>
>>>
>>> Thanks
>>> Sri
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


-- 
Thanks & Regards
Sri Tummala

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu

scala> val test=sqlContext.sql("select monotonically_increasing_id() from
t").show
+---+
|_c0|
+---+
|  0|
|  1|
|  2|
+---+

Cheers

On Mon, Dec 7, 2015 at 12:48 PM, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:

> Hi Ted,
>
> Gave and exception am I following right approach ?
>
> val test=sqlContext.sql("select *,  monotonicallyIncreasingId()  from kali")
>
>
> On Mon, Dec 7, 2015 at 4:52 PM, Ted Yu  wrote:
>
>> Have you tried using monotonicallyIncreasingId ?
>>
>> Cheers
>>
>> On Mon, Dec 7, 2015 at 7:56 AM, Sri  wrote:
>>
>>> Thanks , I found the right function current_timestamp().
>>>
>>> different Question:-
>>> Is there a row_number() function in spark SQL ? Not in Data frame just
>>> spark SQL?
>>>
>>>
>>> Thanks
>>> Sri
>>>
>>> Sent from my iPhone
>>>
>>> On 7 Dec 2015, at 15:49, Ted Yu  wrote:
>>>
>>> Does unix_timestamp() satisfy your needs ?
>>> See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
>>>
>>> On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com <
>>> kali.tumm...@gmail.com> wrote:
>>>
 I found a way out.

 import java.text.SimpleDateFormat
 import java.util.Date;

 val format = new SimpleDateFormat("-M-dd hh:mm:ss")

  val testsql=sqlContext.sql("select
 column1,column2,column3,column4,column5
 ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
 Date(


 Thanks
 Sri



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>
>
> --
> Thanks & Regards
> Sri Tummala
>
>

Example of a Trivial Custom PySpark Transformer

2015-12-07 Thread Andy Davidson

FYI

Hopeful other will find this example helpful

Andy

Example of a Trivial Custom PySpark Transformer
ref:
* 
* NLTKWordPunctTokenizer example
 
* 
* pyspark.sql.functions.udf


In [12]:
1
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param
2
from pyspark.ml.util import keyword_only
3
 
4
from pyspark.sql.functions import udf
5
from pyspark.sql.types import FloatType
6
from pyspark.ml.pipeline import Transformer
7
 
8
class TrivialTokenizer(Transformer, HasInputCol, HasOutputCol):
9
 
10
@keyword_only
11
def __init__(self, inputCol=None, outputCol=None, constant=None):
12
super(TrivialTokenizer, self).__init__()
13
self.constant = Param(self, "constant", 0)
14
self._setDefault(constant=0)
15
kwargs = self.__init__._input_kwargs
16
self.setParams(**kwargs)
17
 
18
@keyword_only
19
def setParams(self, inputCol=None, outputCol=None, constant=None):
20
kwargs = self.setParams._input_kwargs
21
return self._set(**kwargs)
22
 
23
def setConstant(self, value):
24
self._paramMap[self.constant] = value
25
return self
26
 
27
def getConstant(self):
28
return self.getOrDefault(self.constant)
29
 
30
def _transform(self, dataset):
31
const = self.getConstant()
32
 
33
def f(v):
34
return v + const
35
 
36
t = FloatType()
37
out_col = self.getOutputCol()
38
in_col = dataset[self.getInputCol()]
39
return dataset.withColumn(out_col, udf(f, t)(in_col))
40

41
sentenceDataFrame = sqlContext.createDataFrame([
42
  (0, 1.1, "Hi I heard who the about Spark"),
43
  (0, 1.2, "I wish Java could use case classes"),
44
  (1, 1.3, "Logistic regression models are neat")
45
], ["label", "x1", "sentence"])
46
 
47
testTokenizer = TrivialTokenizer(inputCol="x1", outputCol="x2",
constant=1.0) 
48
 
49
testTokenizer.transform(sentenceDataFrame).show()
+-+---++---+
|label| x1|sentence| x2|
+-+---++---+
|0|1.1|Hi I heard who th...|2.1|
|0|1.2|I wish Java could...|2.2|
|1|1.3|Logistic regressi...|2.3|
+-+---++---+

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jian Feng

The only way I can think of is through some kind of wrapper. For java/scala, 
use JNI. For Python, use extensions. There should not be a lot of work if you 
know these tools. 

  From: Robin East 
 To: Annabel Melongo  
Cc: Jia ; Dewful ; "user @spark" 
; "d...@spark.apache.org" 
 Sent: Monday, December 7, 2015 10:57 AM
 Subject: Re: Shared memory between C++ process and Spark

Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. 
Have a look at the wide variety of connectors to things like Cassandra, HBase, 
etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo  wrote:

Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

On Monday, December 7, 2015 1:42 PM, Jia  wrote:

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo  wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 

On Monday, December 7, 2015 1:15 PM, Jia  wrote:

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action

On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Ali Tajeldin EDU

Checkout the Sameer Farooqui video on youtube for spark internals 
(https://www.youtube.com/watch?v=7ooZ4S7Ay6Y=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX)
Starting at 2:15:00, he describes YARN mode.

btw, highly recommend the entire video.  Very detailed and concise.

--
Ali


On Dec 7, 2015, at 8:38 AM, Jacek Laskowski  wrote:

> Hi,
> 
> That's my understanding, too. Just spent an entire morning today to check it 
> out and would be surprised to hear otherwise.
> 
> Pozdrawiam,
> Jacek
> 
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ | 
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
> 
> On Mon, Dec 7, 2015 at 4:01 PM, Nisrina Luthfiyati 
>  wrote:
> Hi Jacek, thank you for your answer. I looked at TaskSchedulerImpl and 
> TaskSetManager and it does looked like tasks are directly sent to executors. 
> Also would love to be corrected if mistaken as I have little knowledge about 
> Spark internals and very new at scala.
> 
> On Tue, Dec 1, 2015 at 1:16 AM, Jacek Laskowski  wrote:
> On Fri, Nov 27, 2015 at 12:12 PM, Nisrina Luthfiyati 
>  wrote:
> Hi all,
> I'm trying to understand how yarn-client mode works and found these two 
> diagrams:
> 
> 
> 
> 
> In the first diagram, it looks like the driver running in client directly 
> communicates with executors to issue application commands, while in the 
> second diagram it looks like application commands is sent to application 
> master first and then forwarded to executors. 
> 
> My limited understanding tells me that regardless of deploy mode (local, 
> standalone, YARN or mesos), drivers (using TaskSchedulerImpl) sends TaskSets 
> to executors once they're launched. YARN and Mesos are only used until they 
> offer resources (CPU and memory) and once executors start, these cluster 
> managers are not engaged in the communication (driver and executors 
> communicate using RPC over netty since 1.6-SNAPSHOT or akka before).
> 
> I'd love being corrected if mistaken. Thanks.
> 
> Jacek
> 
> 
> 
> -- 
> Nisrina Luthfiyati - Ilmu Komputer Fasilkom UI 2010
> http://www.facebook.com/nisrina.luthfiyati
> http://id.linkedin.com/in/nisrina
> 
>

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov

How many reducers you had that created those avro files?
Each reducer very likely creates its own avro part- file.

We normally use Parquet, but it should be the same for Avro, so this might
be
relevant
http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289




-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 11:27 AM, Test One  wrote:

> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
>  |-- memberUuid: string (nullable = true)
>  |-- communityUuid: string (nullable = true)
>  |-- email: string (nullable = true)
>  |-- firstName: string (nullable = true)
>  |-- lastName: string (nullable = true)
>  |-- username: string (nullable = true)
>  |-- profiles: map (nullable = true)
>  ||-- key: string
>  ||-- value: string (valueContainsNull = true)
>
>
> When I write the file output as such with:
> originalDF.write.avro("masterNew.avro")
>
> The output location is a folder with masterNew.avro and with many many
> files like these:
> -rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
> .part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro
>
>
> Where there are ~10 record, it has ~28000 files in that folder. When I
> simply want to copy the same dataset to a new location as an exercise from
> a local master, it takes long long time and having errors like such as
> well.
>
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Not enough space to cache
> rdd_112058_10705 in memory! (computed 496.0 B so far)
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
> disk instead.
> [Stage 0:===>   (10706 + 1) /
> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>
>
> I'm attributing that there are way too many files to manipulate. The
> questions:
>
> 1. Is this the expected format of the avro file written by spark-avro? and
> each 'part-' is not more than 4k?
> 2. My use case is to append new records to the existing dataset using:
> originalDF.unionAll(stageDF).write.avro(masterNew)
> Any sqlconf, sparkconf that I should set to allow this to work?
>
>
> Thanks,
> kc
>
>
>

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu

BTW I forgot to mention that this was added through SPARK-11736 which went
into the upcoming 1.6.0 release

FYI

On Mon, Dec 7, 2015 at 12:53 PM, Ted Yu  wrote:

> scala> val test=sqlContext.sql("select monotonically_increasing_id() from
> t").show
> +---+
> |_c0|
> +---+
> |  0|
> |  1|
> |  2|
> +---+
>
> Cheers
>
> On Mon, Dec 7, 2015 at 12:48 PM, sri hari kali charan Tummala <
> kali.tumm...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> Gave and exception am I following right approach ?
>>
>> val test=sqlContext.sql("select *,  monotonicallyIncreasingId()  from kali")
>>
>>
>> On Mon, Dec 7, 2015 at 4:52 PM, Ted Yu  wrote:
>>
>>> Have you tried using monotonicallyIncreasingId ?
>>>
>>> Cheers
>>>
>>> On Mon, Dec 7, 2015 at 7:56 AM, Sri  wrote:
>>>
 Thanks , I found the right function current_timestamp().

 different Question:-
 Is there a row_number() function in spark SQL ? Not in Data frame just
 spark SQL?


 Thanks
 Sri

 Sent from my iPhone

 On 7 Dec 2015, at 15:49, Ted Yu  wrote:

 Does unix_timestamp() satisfy your needs ?
 See
 sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

 On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com <
 kali.tumm...@gmail.com> wrote:

> I found a way out.
>
> import java.text.SimpleDateFormat
> import java.util.Date;
>
> val format = new SimpleDateFormat("-M-dd hh:mm:ss")
>
>  val testsql=sqlContext.sql("select
> column1,column2,column3,column4,column5
> ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
> Date(
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-function-tp25620p25621.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>>
>> --
>> Thanks & Regards
>> Sri Tummala
>>
>>
>

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov

That error is not directly related to spark nor hbase

javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

Is this a kerberized cluster? You likely don't have a good (non-expired)
kerberos ticket for authentication to pass.


-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 12:54 PM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. I am
> trying to run Spark on Hbase using Phoenix, but Spark executors are
> unable to get hbase connection using phoenix. I am running knit command to
> get the ticket before starting the job and also keytab file and principal
> are correctly specified in connection URL. But still spark job on each node
> throws below error:
>
> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> I am using Spark 1.3.1, Hbase 1.0.0, Phoenix 4.3. I am able to run Spark
> on Hbase(without phoenix) successfully in yarn-client mode as mentioned in
> this link:
>
> https://github.com/cloudera-labs/SparkOnHBase#scan-that-works-on-kerberos
>
> Also, I found that there is a known issue for yarn-cluster mode for Spark
> 1.3.1 version:
>
> https://issues.apache.org/jira/browse/SPARK-6918
>
> Has anybody been successful in running Spark on hbase using Phoenix in
> yarn cluster or client mode?
>
> Thanks,
> Akhilesh Pathodia
>

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Jon Gregg

I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll
be able to upgrade in probably two months.  For now I'm seeing the same
issue with spark not recognizing an existing column name in many
hive-table-to-dataframe situations:

Py4JJavaError: An error occurred while calling o375.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes *state_code*
missing from
latitude,country_code,tim_zone_desc,longitude,dma_durable_key,submarket,dma_
code,dma_desc,county,city,zip_code,*state_code*;

On Mon, Dec 7, 2015 at 3:52 PM, Davies Liu  wrote:

> Could you reproduce this problem in 1.5 or 1.6?
>
> On Sun, Dec 6, 2015 at 12:29 AM, YaoPau  wrote:
> > If anyone runs into the same issue, I found a workaround:
> >
>  df.where('state_code = "NY"')
> >
> > works for me.
> >
>  df.where(df.state_code == "NY").collect()
> >
> > fails with the error from the first post.
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-3-not-finding-attribute-in-DF-tp25599p25600.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar

By default Spark will create one file per partition. Spark SQL defaults to
using 200 partitions. If you want to reduce the number of files written
out, repartition your dataframe using repartition and give it the desired
number of partitions.

originalDF.repartition(10).write.avro("masterNew.avro")

Deenar



On 7 December 2015 at 21:21, Ruslan Dautkhanov  wrote:

> How man
>
On 7 December 2015 at 21:21, Ruslan Dautkhanov  wrote:

> How many reducers you had that created those avro files?
> Each reducer very likely creates its own avro part- file.
>
> We normally use Parquet, but it should be the same for Avro, so this might
> be
> relevant
>
> http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289
>
>
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Dec 7, 2015 at 11:27 AM, Test One  wrote:
>
>> I'm using spark-avro with SparkSQL to process and output avro files. My
>> data has the following schema:
>>
>> root
>>  |-- memberUuid: string (nullable = true)
>>  |-- communityUuid: string (nullable = true)
>>  |-- email: string (nullable = true)
>>  |-- firstName: string (nullable = true)
>>  |-- lastName: string (nullable = true)
>>  |-- username: string (nullable = true)
>>  |-- profiles: map (nullable = true)
>>  ||-- key: string
>>  ||-- value: string (valueContainsNull = true)
>>
>>
>> When I write the file output as such with:
>> originalDF.write.avro("masterNew.avro")
>>
>> The output location is a folder with masterNew.avro and with many many
>> files like these:
>> -rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
>> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
>> .part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
>> .part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
>> .part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro
>>
>>
>> Where there are ~10 record, it has ~28000 files in that folder. When
>> I simply want to copy the same dataset to a new location as an exercise
>> from a local master, it takes long long time and having errors like such as
>> well.
>>
>> 22:01:44.247 [Executor task launch worker-21] WARN
>>  org.apache.spark.storage.MemoryStore - Not enough space to cache
>> rdd_112058_10705 in memory! (computed 496.0 B so far)
>> 22:01:44.247 [Executor task launch worker-21] WARN
>>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
>> disk instead.
>> [Stage 0:===>   (10706 + 1) /
>> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
>> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>>
>>
>> I'm attributing that there are way too many files to manipulate. The
>> questions:
>>
>> 1. Is this the expected format of the avro file written by spark-avro?
>> and each 'part-' is not more than 4k?
>> 2. My use case is to append new records to the existing dataset using:
>> originalDF.unionAll(stageDF).write.avro(masterNew)
>> Any sqlconf, sparkconf that I should set to allow this to work?
>>
>>
>> Thanks,
>> kc
>>
>>
>>
>

issue creating pyspark Transformer UDF that creates a LabeledPoint: AttributeError: 'DataFrame' object has no attribute '_get_object_id'

2015-12-07 Thread Andy Davidson

Hi 

I am running into a strange error. I am trying to write a transformer that
takes in to columns and creates a LabeledPoint. I can not figure out why I
am getting 

AttributeError: 'DataFrame' object has no attribute _get_object_id¹

I am using spark-1.5.1-bin-hadoop2.6

Any idea what I am doing wrong? Is this a bug with data frames?

Also I suspect the next problem I will run into is I do not think UDF¹s
support LabeledPoint?

Comments and suggestions are greatly appreciated

Andy




In [37]:
1
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param
2
from pyspark.ml.util import keyword_only
3
 
4
from pyspark.sql.functions import udf
5
from pyspark.ml.pipeline import Transformer
6
 
7
from pyspark.sql.types import BinaryType, DataType, ByteType, StringType
8
from pyspark.mllib.linalg import SparseVector
9
from pyspark.mllib.regression import LabeledPoint
10
 
11
 
12
class LabledPointTransformer(Transformer, HasInputCol, HasOutputCol):
13
@keyword_only
14
def __init__(self, inputCol=None, outputCol=None, featureCol=None):
15
super(LabledPointTransformer, self).__init__()
16
self.featureCol = Param(self, "featureCol", "")
17
self._setDefault(featureCol="feature")
18
kwargs = self.__init__._input_kwargs
19
self.setParams(**kwargs)
20

21
@keyword_only
22
def setParams(self, inputCol=None, outputCol=None, featureCol=None):
23
kwargs = self.setParams._input_kwargs
24
return self._set(**kwargs)
25
 
26
def setFeatureCol(self, value):
27
self._paramMap[self.featureCol] = value
28
return self
29
 
30
def getFeatureCol(self):
31
return self.getOrDefault(self.featureCol)
32

33
def _transform(self, dataset): # dataset is a data frame
34
out_col = self.getOutputCol()
35
labelCol = self.getInputCol()
36
featureCol = self.getFeatureCol()
37

38
def f(lf):
39
return str(LabeledPoint(lf[labelCol], lf[featureCol]))
40
 
41
t = StringType()
42
#data = dataset[labelCol, featureCol]
43
data = dataset.select(labelCol, featureCol)
44
return dataset.withColumn(out_col, udf(f, t)(data))
45
 
46
lpData = sqlContext.createDataFrame([
47
(0, SparseVector(3, [0, 1], [1.0, 2.0])),
48
(1, SparseVector(3, [1, 2], [3.0, 1.0])),
49
], ["label", "features"])
50

51
lpData.show()
52
lpt = LabledPointTransformer(inputCol="label", outputCol="labeledPoint",
featureCol="features",)
53
tmp = lpt.transform(lpData)
54
tmp.collect()
+-+---+
|label|   features|
+-+---+
|0|(3,[0,1],[1.0,2.0])|
|1|(3,[1,2],[3.0,1.0])|
+-+---+

---
AttributeErrorTraceback (most recent call last)
 in ()
 51 lpData.show()
 52 lpt = LabledPointTransformer(inputCol="label",
outputCol="labeledPoint", featureCol="features",)
---> 53 tmp = lpt.transform(lpData)
 54 tmp.collect()

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/ml/pipeline
.py in transform(self, dataset, params)
105 return self.copy(params,)._transform(dataset)
106 else:
--> 107 return self._transform(dataset)
108 else:
109 raise ValueError("Params must be either a param map but
got %s." % type(params))

 in _transform(self, dataset)
 42 #data = dataset[labelCol, featureCol]
 43 data = dataset.select(labelCol, featureCol)
---> 44 return dataset.withColumn(out_col, udf(f, t)(data))
 45 
 46 lpData = sqlContext.createDataFrame([

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/functio
ns.py in __call__(self, *cols)
   1436 def __call__(self, *cols):
   1437 sc = SparkContext._active_spark_context
-> 1438 jc = self._judf.apply(_to_seq(sc, cols, _to_java_column))
   1439 return Column(jc)
   1440 

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/column.
py in _to_seq(sc, cols, converter)
 58 """
 59 if converter:
---> 60 cols = [converter(c) for c in cols]
 61 return sc._jvm.PythonUtils.toSeq(cols)
 62 

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/column.
py in (.0)
 58 """
 59 if converter:
---> 60 cols = [converter(c) for c in cols]
 61 return sc._jvm.PythonUtils.toSeq(cols)
 62 

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/column.
py in _to_java_column(col)
 46 jcol = col._jc
 47 else:
---> 48 jcol = _create_column_from_name(col)
 49 return jcol
 50 

/Users//workSpace/spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/column.
py in _create_column_from_name(name)
 39 def _create_column_from_name(name):
 40 sc =

Re: Managed to make Hive run on Spark engine

2015-12-07 Thread Ashok Kumar

This is great news sir. It shows perseverance pays at last.
Can you inform us when the write-up is ready so I can set it up as well please.
I know a bit about the advantages of having Hive using Spark engine. However, 
the general question I have is when one should use Hive on spark as opposed to 
Hive on MapReduce engine?
Thanks again  


On Monday, 7 December 2015, 15:50, Mich Talebzadeh  
wrote:
 

 #yiv2894763605 #yiv2894763605 -- _filtered #yiv2894763605 
{font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 0;} _filtered #yiv2894763605 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv2894763605 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv2894763605 
#yiv2894763605 p.yiv2894763605MsoNormal, #yiv2894763605 
li.yiv2894763605MsoNormal, #yiv2894763605 div.yiv2894763605MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv2894763605 a:link, 
#yiv2894763605 span.yiv2894763605MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv2894763605 a:visited, #yiv2894763605 
span.yiv2894763605MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv2894763605 
span.yiv2894763605EmailStyle17 {color:windowtext;}#yiv2894763605 
span.yiv2894763605EmailStyle18 {color:windowtext;}#yiv2894763605 
.yiv2894763605MsoChpDefault {font-size:10.0pt;} _filtered #yiv2894763605 
{margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv2894763605 
div.yiv2894763605WordSection1 {}#yiv2894763605 For those interested  From: Mich 
Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 06 December 2015 20:33
To: u...@hive.apache.org
Subject: Managed to make Hive run on Spark engine  Thanks all especially to 
Xuefu.for contributions. Finally it works, which means don’t give up until it 
works J  hduser@rhes564::/usr/lib/hive/lib> hiveLogging initialized using 
configuration in 
jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.propertieshive> 
set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;hive> set 
hive.execution.engine=spark;hive> set 
spark.master=spark://50.140.197.217:7077;hive> set 
spark.eventLog.enabled=true;hive> set spark.eventLog.dir= 
/usr/lib/spark-1.3.1-bin-hadoop2.6/logs;hive> set 
spark.executor.memory=512m;hive> set 
spark.serializer=org.apache.spark.serializer.KryoSerializer;hive> set 
hive.spark.client.server.connect.timeout=22ms;hive> set 
spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;hive> use 
asehadoop;OKTime taken: 0.638 secondshive> select count(1) from t;Query ID = 
hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42bTotal jobs = 
1Launching Job 1 out of 1In order to change the average load for a reducer (in 
bytes):  set hive.exec.reducers.bytes.per.reducer=In order to limit the 
maximum number of reducers:  set hive.exec.reducers.max=In order to set 
a constant number of reducers:  set mapreduce.job.reduces=Starting 
Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a  Query Hive on Spark job[0] 
stages:01  Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime 
StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1  Stage-1_0: 
0/12015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/12015-12-06 
20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 FinishedStatus: Finished 
successfully in 8.10 secondsOK  The versions used for this project    OS 
version Linux version 2.6.18-92.el5xen 
(brewbuil...@ls20-bc2-13.build.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 
4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008  Hadoop 2.6.0Hive 
1.2.1spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
spark-1.3.1-bin-hadoop2.6.gz for starting spark standalone cluster)The Jar file 
used in $HIVE_HOME/lib to link Hive to spark was à 
spark-assembly-1.3.1-hadoop2.4.0.jar    (built from the source downloaded as 
zipped file spark-1.3.1.gz and built with command line make-distribution.sh 
--name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"  Pretty picky on 
parameters, CLASSPATH, IP addresses or hostname etc to make it work  I will 
create a full guide on how to build and make Hive to run with Spark as its 
engine (as opposed to MR).  HTH  Mich Talebzadeh  Sybase ASE 15 Gold Medal 
Award 2008A Winning Strategy: Running the most Critical Financial Data on ASE 
15http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdfAuthor
 of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 
978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", 
ISBN 978-0-9759693-0-4Publications due shortly:Complex Event Processing in 
Heterogeneous Environments, ISBN: 978-0-9563693-3-8Oracle and Sybase, Concepts 
and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly  
http://talebzadehmich.wordpress.com/  NOTE: The information in this email is 
proprietary and confidential. This message is for the designated recipient 
only, if you are not the

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit

Okay maybe these errors are more helpful -

WARN server.TransportChannelHandler: Exception in connection from 
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 ERROR client.TransportResponseHandler: Still have 5 requests 
outstanding when connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 is 
closed
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 
39 outstanding blocks after 5000 ms
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

That continues for a while.

There is also this error on the Stage status from the Spark History server:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 1
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at

Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher

Hi,

I've been running benchmarks against Spark in local mode in a long running
process. I'm seeing threads leaking each time it runs a job. It doesn't
matter if I recycle SparkContext constantly or have 1 context stay alive
for the entire application lifetime.

I see a huge accumulation ongoing of "pool--thread-1" threads with the
creating thread "Executor task launch worker-xx" where x's are numbers. The
number of leaks per launch worker varies but usually 1 to a few.

Searching the Spark code the pool is created in the Executor class. It is
`.shutdown()` in the stop for the executor. I've wired up logging and also
tried shutdownNow() and awaitForTermination on the pools. Every seems okay
there for every Executor that is called with `stop()` but I'm still not
sure yet if every Executor is called as such, which I am looking into now.

What I'm curious to know is if anyone has seen a similar issue?

-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn

Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-07 Thread Deenar Toraskar

Hi

I have a process that writes to multiple partitions of the same table in
parallel using multiple threads sharing the same SQL context
df.write.partitionedBy("partCol").insertInto("tableName") . I am
getting FileNotFoundException on _temporary directory. Each write only goes
to a single partition, is there some way of not using dynamic partitioning
using df.write without having to resort to .save as I dont want to hardcode
a physical DFS location in my code?

This is similar to this issue listed here
https://issues.apache.org/jira/browse/SPARK-2984

Regards
Deenar

*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Shixiong Zhu

Which version are you using? Could you post these thread names here?

Best Regards,
Shixiong Zhu

2015-12-07 14:30 GMT-08:00 Richard Marscher :

> Hi,
>
> I've been running benchmarks against Spark in local mode in a long running
> process. I'm seeing threads leaking each time it runs a job. It doesn't
> matter if I recycle SparkContext constantly or have 1 context stay alive
> for the entire application lifetime.
>
> I see a huge accumulation ongoing of "pool--thread-1" threads with the
> creating thread "Executor task launch worker-xx" where x's are numbers. The
> number of leaks per launch worker varies but usually 1 to a few.
>
> Searching the Spark code the pool is created in the Executor class. It is
> `.shutdown()` in the stop for the executor. I've wired up logging and also
> tried shutdownNow() and awaitForTermination on the pools. Every seems okay
> there for every Executor that is called with `stop()` but I'm still not
> sure yet if every Executor is called as such, which I am looking into now.
>
> What I'm curious to know is if anyone has seen a similar issue?
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher

Thanks for the response.

The version is Spark 1.5.2.

Some examples of the thread names:

pool-1061-thread-1
pool-1059-thread-1
pool-1638-thread-1

There become hundreds then thousands of these stranded in WAITING.

I added logging to try to track the lifecycle of the thread pool in
Executor as mentioned before. Here is an excerpt, but every seems fine
there. Every executor that starts is also shut down and it seems like it
shuts down fine.

15/12/07 23:30:21 WARN o.a.s.e.Executor: Threads finished in executor
driver. pool shut down
java.util.concurrent.ThreadPoolExecutor@e5d036b[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
15/12/07 23:30:28 WARN o.a.s.e.Executor: Executor driver created, thread
pool: java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Running, pool size =
0, active threads = 0, queued tasks = 0, completed tasks = 0]
15/12/07 23:31:06 WARN o.a.s.e.Executor: Threads finished in executor
driver. pool shut down
java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 36]
15/12/07 23:31:11 WARN o.a.s.e.Executor: Executor driver created, thread
pool: java.util.concurrent.ThreadPoolExecutor@6e85ece4[Running, pool size =
0, active threads = 0, queued tasks = 0, completed tasks = 0]
15/12/07 23:34:35 WARN o.a.s.e.Executor: Threads finished in executor
driver. pool shut down
java.util.concurrent.ThreadPoolExecutor@6e85ece4[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]

Also here is an example thread dump of such a thread:

"pool-493-thread-1" prio=10 tid=0x7f0e60612800 nid=0x18c4 waiting on
condition [0x7f0c33c3e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f10b3e8fb60> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

On Mon, Dec 7, 2015 at 6:23 PM, Shixiong Zhu  wrote:

> Which version are you using? Could you post these thread names here?
>
> Best Regards,
> Shixiong Zhu
>
> 2015-12-07 14:30 GMT-08:00 Richard Marscher :
>
>> Hi,
>>
>> I've been running benchmarks against Spark in local mode in a long
>> running process. I'm seeing threads leaking each time it runs a job. It
>> doesn't matter if I recycle SparkContext constantly or have 1 context stay
>> alive for the entire application lifetime.
>>
>> I see a huge accumulation ongoing of "pool--thread-1" threads with
>> the creating thread "Executor task launch worker-xx" where x's are numbers.
>> The number of leaks per launch worker varies but usually 1 to a few.
>>
>> Searching the Spark code the pool is created in the Executor class. It is
>> `.shutdown()` in the stop for the executor. I've wired up logging and also
>> tried shutdownNow() and awaitForTermination on the pools. Every seems okay
>> there for every Executor that is called with `stop()` but I'm still not
>> sure yet if every Executor is called as such, which I am looking into now.
>>
>> What I'm curious to know is if anyone has seen a similar issue?
>>
>> --
>> *Richard Marscher*
>> Software Engineer
>> Localytics
>> Localytics.com  | Our Blog
>>  | Twitter  |
>> Facebook  | LinkedIn
>> 
>>
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit

I have looked through the logs and do not see any WARNING or ERRORs - the 
executors just seem to stop logging.

I am running Spark 1.5.2 on YARN.

On Dec 7, 2015, at 1:20 PM, Ted Yu 
> wrote:

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?

Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, 
> 
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia

Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.

Best Regards,
Jia
 


On Dec 7, 2015, at 12:26 PM, Annabel Melongo  wrote:

> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
> and R.
> 
> The best way to achieve this is to run your application in C++ and used the 
> data created by said application to do manipulation within Spark.
> 
> 
> 
> On Monday, December 7, 2015 1:15 PM, Jia  wrote:
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can 
> connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting 
> to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a 
> very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful  wrote:
> 
>> Maybe looking into something like Tachyon would help, I see some sample c++ 
>> bindings, not sure how much of the current functionality they support...
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing 
>> list.
>> Yes, we have a distributed C++ application, that will store data on each 
>> node in the cluster, and we hope to leverage Spark to do more fancy 
>> analytics on those data. But we need high performance, that’s why we want 
>> shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so 
>>> you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m 
>>> sure it would be possible with enough tinkering but it’s not clear what you 
>>> are trying to achieve. Spark is a distributed processing system, it has 
>>> multiple JVMs running on different machines that each run a small part of 
>>> the overall processing. Unless you have some sort of idea to have multiple 
>>> C++ processes collocated with the distributed JVMs using named memory 
>>> mapped files doesn’t make architectural sense. 
>>> ---
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
 On 6 Dec 2015, at 20:43, Jia  wrote:
 
 Dears, for one project, I need to implement something so Spark can read 
 data from a C++ process. 
 To provide high performance, I really hope to implement this through 
 shared memory between the C++ process and Java JVM process.
 It seems it may be possible to use named memory mapped files and JNI to do 
 this, but I wonder whether there is any existing efforts or more efficient 
 approach to do this?
 Thank you very much!
 
 Best Regards,
 Jia
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
>>> 
>> 
> 
> 
>

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit

Here is the trace I get from the command line:
[Stage 4:>  (60 + 60) / 
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkYarnAM@10.0.0.138:33822] has failed, address is now 
gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-138.ec2.internal:54951] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 ERROR YarnScheduler: Lost executor 3 on 
ip-10-0-0-138.ec2.internal: remote Rpc client disassociated
15/12/07 18:59:41 WARN TaskSetManager: Lost task 62.0 in stage 4.0 (TID 2003, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
15/12/07 18:59:41 WARN TaskSetManager: Lost task 65.0 in stage 4.0 (TID 2006, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
…
…

On Dec 7, 2015, at 1:33 PM, Cramblit, Ross (Reuters News) 
> 
wrote:

I have looked through the logs and do not see any WARNING or ERRORs - the 
executors just seem to stop logging.

I am running Spark 1.5.2 on YARN.

On Dec 7, 2015, at 1:20 PM, Ted Yu 
> wrote:

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?

Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, 
> 
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread swetha kasireddy

OK. I think the following can be used.

mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-DskipTests clean package


On Mon, Dec 7, 2015 at 10:13 AM, SRK  wrote:

> Hi,
>
> How to do a maven build to enable monitoring using Ganglia? What is the
> command for the same?
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia

Thanks, Robin, you have a very good point!
We feel that the data copy and allocation overhead may become a performance 
bottleneck, and is evaluating it right now.
We will do the shared memory stuff only if we’re sure about the potential 
performance gain and sure that there is no existing stuff in Spark community 
that we can leverage to do this.

Best Regards,
Jia


On Dec 7, 2015, at 11:56 AM, Robin East  wrote:

> I guess you could write a custom RDD that can read data from a memory-mapped 
> file - not really my area of expertise so I’ll leave it to other members of 
> the forum to chip in with comments as to whether that makes sense. 
> 
> But if you want ‘fancy analytics’ then won’t the processing time more than 
> out-weigh the savings from using memory mapped files? Particularly if your 
> analytics involve any kind of aggregation of data across data nodes. Have you 
> looked at a Lambda architecture which could involve Spark but doesn’t 
> necessarily mean you would go to the trouble of implementing a custom 
> memory-mapped file reading feature.
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 7 Dec 2015, at 17:32, Jia  wrote:
>> 
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing 
>> list.
>> Yes, we have a distributed C++ application, that will store data on each 
>> node in the cluster, and we hope to leverage Spark to do more fancy 
>> analytics on those data. But we need high performance, that’s why we want 
>> shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so 
>>> you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m 
>>> sure it would be possible with enough tinkering but it’s not clear what you 
>>> are trying to achieve. Spark is a distributed processing system, it has 
>>> multiple JVMs running on different machines that each run a small part of 
>>> the overall processing. Unless you have some sort of idea to have multiple 
>>> C++ processes collocated with the distributed JVMs using named memory 
>>> mapped files doesn’t make architectural sense. 
>>> ---
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
 On 6 Dec 2015, at 20:43, Jia  wrote:
 
 Dears, for one project, I need to implement something so Spark can read 
 data from a C++ process. 
 To provide high performance, I really hope to implement this through 
 shared memory between the C++ process and Java JVM process.
 It seems it may be possible to use named memory mapped files and JNI to do 
 this, but I wonder whether there is any existing efforts or more efficient 
 approach to do this?
 Thank you very much!
 
 Best Regards,
 Jia
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
>>> 
>> 
>

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia

Hi, Kazuaki,

It’s very similar with my requirement, thanks!
It seems they want to write to a C++ process with zero copy, and I want to do 
both read/write with zero copy.
Any one knows how to obtain more information like current status of this JIRA 
entry?

Best Regards,
Jia




On Dec 7, 2015, at 12:26 PM, Kazuaki Ishizaki  wrote:

> Is this JIRA entry related to what you want?
> https://issues.apache.org/jira/browse/SPARK-10399
> 
> Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:Jia 
> To:Dewful 
> Cc:"user @spark" , d...@spark.apache.org, 
> Robin East 
> Date:2015/12/08 03:17
> Subject:Re: Shared memory between C++ process and Spark
> 
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can 
> connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting 
> to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a 
> very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful  wrote:
> Maybe looking into something like Tachyon would help, I see some sample c++ 
> bindings, not sure how much of the current functionality they support...
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
> 
> -dev, +user (this is not a question about development of Spark itself so 
> you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m 
> sure it would be possible with enough tinkering but it’s not clear what you 
> are trying to achieve. Spark is a distributed processing system, it has 
> multiple JVMs running on different machines that each run a small part of the 
> overall processing. Unless you have some sort of idea to have multiple C++ 
> processes collocated with the distributed JVMs using named memory mapped 
> files doesn’t make architectural sense. 
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
> On 6 Dec 2015, at 20:43, Jia  wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data 
> from a C++ process. 
> To provide high performance, I really hope to implement this through shared 
> memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do 
> this, but I wonder whether there is any existing efforts or more efficient 
> approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 
> 
> 
>

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia

Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.
However, because our data is also hold in memory, I suspect that connecting to 
Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a very 
efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

> Maybe looking into something like Tachyon would help, I see some sample c++ 
> bindings, not sure how much of the current functionality they support...
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node 
> in the cluster, and we hope to leverage Spark to do more fancy analytics on 
> those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East  wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so 
>> you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m 
>> sure it would be possible with enough tinkering but it’s not clear what you 
>> are trying to achieve. Spark is a distributed processing system, it has 
>> multiple JVMs running on different machines that each run a small part of 
>> the overall processing. Unless you have some sort of idea to have multiple 
>> C++ processes collocated with the distributed JVMs using named memory mapped 
>> files doesn’t make architectural sense. 
>> ---
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia  wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read 
>>> data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared 
>>> memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do 
>>> this, but I wonder whether there is any existing efforts or more efficient 
>>> approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>> 
>> 
>

SparkSQL AVRO

2015-12-07 Thread Test One

I'm using spark-avro with SparkSQL to process and output avro files. My
data has the following schema:

root
 |-- memberUuid: string (nullable = true)
 |-- communityUuid: string (nullable = true)
 |-- email: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- username: string (nullable = true)
 |-- profiles: map (nullable = true)
 ||-- key: string
 ||-- value: string (valueContainsNull = true)


When I write the file output as such with:
originalDF.write.avro("masterNew.avro")

The output location is a folder with masterNew.avro and with many many
files like these:
-rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro


Where there are ~10 record, it has ~28000 files in that folder. When I
simply want to copy the same dataset to a new location as an exercise from
a local master, it takes long long time and having errors like such as
well.

22:01:44.247 [Executor task launch worker-21] WARN
 org.apache.spark.storage.MemoryStore - Not enough space to cache
rdd_112058_10705 in memory! (computed 496.0 B so far)
22:01:44.247 [Executor task launch worker-21] WARN
 org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
disk instead.
[Stage 0:===>   (10706 + 1) /
28014]22:01:44.574 [Executor task launch worker-21] WARN
 org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.


I'm attributing that there are way too many files to manipulate. The
questions:

1. Is this the expected format of the avro file written by spark-avro? and
each 'part-' is not more than 4k?
2. My use case is to append new records to the existing dataset using:
originalDF.unionAll(stageDF).write.avro(masterNew)
Any sqlconf, sparkconf that I should set to allow this to work?


Thanks,
kc

1 2 >

1 - 100 of 109 matches

Mail list logo