Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
Hi Ewan, Thank you for your answer. I have already tried what you suggest. If I use: "hdfs://172.27.13.57:7077/user/hdfs/parquet-multi/BICC" I get the AssertionError exception: Exception in thread "main" java.lang.AssertionError: assertion failed: No predefined schema found, and no

Re: Recommendations using Spark

2016-01-08 Thread Jorge Machado
Hello anjali, You can Start here : org.apache.spark.mllib.recommendation Them you should build a “recomender” you need to transform your trainData into Rating objects them you can train a model with for example : val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0) Jorge > On

Re: write new data to mysql

2016-01-08 Thread Yasemin Kaya
Hi, There is no write function that Todd mentioned or i cant find it. The code and error are in gist . Could you check it out please? Best, yasemin 2016-01-08 18:23 GMT+02:00 Todd Nist : > It is not clear from the

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-08 Thread Umesh Kacha
Hi for a 30 GB executor how much offheap should I give along with yarn memory over head is it ok? On Thu, Jan 7, 2016 at 4:24 AM, Ted Yu wrote: > Turns out that I should have specified -i to my former grep command :-) > > Thanks Marcelo > > But does this mean that

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Chris Fregly
Yeah, this confused me, as well. Good question, Umesh. As Ted pointed out: between Spark 1.5 and 1.6, o.a.s.shuffle.unsafe.UnsafeShuffleManager no longer exists as a separate shuffle manager. Here's the old code (notice the o.a.s.shuffle.unsafe package):

Re: Spark job uses only one Worker

2016-01-08 Thread Michael Pisula
Hi Annabel, I am using Spark in stand-alone mode (deployment using the ec2 scripts packaged with spark). Cheers, Michael On 08.01.2016 00:43, Annabel Melongo wrote: > Michael, > > I don't know what's your environment but if it's Cloudera, you should > be able to see the link to your master in

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Ted Yu
For "spark.shuffle.manager", the default is "sort" >From core/src/main/scala/org/apache/spark/SparkEnv.scala : val shuffleMgrName = conf.get("spark.shuffle.manager", "sort") "tungsten-sort" is the same as "sort" : val shortShuffleMgrNames = Map( "hash" ->

Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread unk1102
Hi I was using Spark 1.5 with Tungsten sort and now I have using Spark 1.6 I dont see any difference I was expecting Spark 1.6 to be faster. Anyways do we need to enable Tunsten and unsafe options or they are enabled by default I see in documentation that default sort manager is sort I though it

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread Shixiong(Ryan) Zhu
Could you disable `spark.kryo.registrationRequired`? Some classes may not be registered but they work well with Kryo's default serializer. On Fri, Jan 8, 2016 at 8:58 AM, Ted Yu wrote: > bq. try adding scala.collection.mutable.WrappedArray > > But the hint said registering

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Ted Yu
>From sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala : case Some((SQLConf.Deprecated.TUNGSTEN_ENABLED, Some(value))) => val runFunc = (sqlContext: SQLContext) => { logWarning( s"Property ${SQLConf.Deprecated.TUNGSTEN_ENABLED} is deprecated and "

Re: write new data to mysql

2016-01-08 Thread Yasemin Kaya
When i change the version to 1.6.0, it worked. Thanks. 2016-01-08 21:27 GMT+02:00 Yasemin Kaya : > Hi, > There is no write function that Todd mentioned or i cant find it. > The code and error are in gist > . Could you

Re: SparkContext SyntaxError: invalid syntax

2016-01-08 Thread Bryan Cutler
Hi Andrew, I know that older versions of Spark could not run PySpark on YARN in cluster mode. I'm not sure if that is fixed in 1.6.0 though. Can you try setting deploy-mode option to "client" when calling spark-submit? Bryan On Thu, Jan 7, 2016 at 2:39 PM, weineran <

Re: write new data to mysql

2016-01-08 Thread Todd Nist
Sorry, did not see your update until now. On Fri, Jan 8, 2016 at 3:52 PM, Todd Nist wrote: > Hi Yasemin, > > What version of Spark are you using? Here is the reference, it is off of > the DataFrame >

Re: write new data to mysql

2016-01-08 Thread Todd Nist
Hi Yasemin, What version of Spark are you using? Here is the reference, it is off of the DataFrame https://spark.apache.org/docs/latest/api/java/index.html#org.apache.spark.sql.DataFrame and provides a DataFrameWriter,

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Umesh Kacha
ok thanks so it will be enabled by default always if yes then in documentation why default shuffle manager is mentioned as sort? On Sat, Jan 9, 2016 at 1:55 AM, Ted Yu wrote: > From sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala > : > > case

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Is your Parquet data source partitioned by date ? Can you dedup within partitions ? Cheers On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue wrote: > I tried on Three day's data. The total input is only 980GB, but the > shuffle write Data is about 6.2TB, then the job failed

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
hey Ted, Event table is like this: UserID, EventType, EventKey, TimeStamp, MetaData. I just parse it from Json and save as Parquet, did not change the partition. Annoyingly, every day's incoming Event data having duplicates among each other. One same event could show up in Day1 and Day2 and

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Benyi Wang
- I assume your parquet files are compressed. Gzip or Snappy? - What spark version did you use? It seems at least 1.4. If you use spark-sql and tungsten, you might have better performance. but spark 1.5.2 gave me a wrong result when the data was about 300~400GB, just for a simple

Re: SparkContext SyntaxError: invalid syntax

2016-01-08 Thread Andrew Weiner
Now for simplicity I'm testing with wordcount.py from the provided examples, and using Spark 1.6.0 The first error I get is: 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at

How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Hey, I got everyday's Event table and want to merge them into a single Event table. But there so many duplicates among each day's data. I use Parquet as the data source. What I am doing now is EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file"). Each day's Event is

Re: Spark job uses only one Worker

2016-01-08 Thread Prem Sure
to narrow down,you can try below 1) is the job going to same node everytime( when you execute job multiple times)?. enable property spark.speculation, keep thread.sleep for 2 mins and see if the job is going to a different worker from the executor posted on initially. ( trying to find, there are

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
And the most frequent operation I am gonna do is find the UserID who have some events, then retrieve all the events associted with the UserID. In this case, how should I partition to speed up the process? Thanks. On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue wrote: > hey

Create a n x n graph given only the vertices

2016-01-08 Thread praveen S
Is it possible in graphx to create/generate a graph n x n given n vertices?

Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Ophir Etzion
It didn't work. assuming I did the right thing. in the properties you could see

Unable to compile from source

2016-01-08 Thread Gaini Rajeshwar
Hi All, I am new to apache spark. I have downloaded *Spark 1.6.0 (Jan 04 2016) source code version*. I did run the following command following command as per spark documentation . build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Benyi Wang
Just try to give 1000, even 2000 to see if it works. If your see something like "Lost Executor", you'd better to stop your job, otherwise you are wasting time. Usually the container of the lost executor is killed by NodeManager because there is not enough memory. You can check NodeManager's log to

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
I tried on Three day's data. The total input is only 980GB, but the shuffle write Data is about 6.2TB, then the job failed during shuffle read step, which should be another 6.2TB shuffle read. I think to Dedup, the shuffling can not be avoided. Is there anything I could do to stablize this

Re: Date Time Regression as Feature

2016-01-08 Thread Chris Fregly
Here's a good blog post by Sandy Ryza @ Cloudera on Spark + Time Series Data: http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/ Might give you some things to try. On Thu, Jan 7, 2016 at 11:40 PM, dEEPU wrote: >

Standalone Scala Project 'sbt package erroring out"

2016-01-08 Thread srkanth devineni
Hi all, I am going over this official tutorial on standalone scala project in cloudera virtual machine I am using Spark 1.5.0 and Scala 2.10.4, and I change the parameters in the sparkpi.sbt file as the following: name := "SparkPi Project" version := "1.0" scalaVersion := "2.10.4"

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Benyi: bq. spark 1.5.2 gave me a wrong result when the data was about 300~400GB, just for a simple group-by and aggregate Can you reproduce the above using Spark 1.6.0 ? Thanks On Fri, Jan 8, 2016 at 2:48 PM, Benyi Wang wrote: > >- I assume your parquet files are

[no subject]

2016-01-08 Thread Suresh Thalamati

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
gzip is relatively slow. It consumes much CPU. snappy is faster. LZ4 is faster than GZIP and smaller than Snappy. Cheers On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue wrote: > Thank you . > > And speaking of compression, is there big difference on performance > between

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
bq. in an noSQL db such as Hbase +1 :-) On Fri, Jan 8, 2016 at 6:25 PM, ayan guha wrote: > One option you may want to explore is writing event table in an noSQL db > such as Hbase. One inherent problem in your approach is you always need to > load either full data set or

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
I used to maintain a HBase cluster. The experience with it was not happy. I just tried query the data from each day's first and dedup with smaller set, the performance is acceptable. So I guess I will use this method. Again, could anyone give advice about: - Automatically determine the

How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Ascot Moss
Hi, Instead of using Spark-shell, does anyone know how to build .zip (or .egg) for Python and use Spark-submit to run? Regards

Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee
Per http://spark.apache.org/docs/latest/submitting-applications.html: For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Thank you . And speaking of compression, is there big difference on performance between gzip and snappy? And why parquet is using gzip by default? Thanks. On Fri, Jan 8, 2016 at 6:39 PM, Ted Yu wrote: > Cycling old bits: > http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ > >

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Hey, Thank you for the answer. I checked the setting you mentioend they are all correct. I noticed that in the job, there are always only 200 reducers for shuffle read, I believe it is setting in the sql shuffle parallism. In the doc, it mentions: - Automatically determine the number of

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread ayan guha
One option you may want to explore is writing event table in an noSQL db such as Hbase. One inherent problem in your approach is you always need to load either full data set or a defined number of partitions to see if the event has already come (and no gurantee it is full proof, but lead to

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Cycling old bits: http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ Gavin: Which release of hbase did you play with ? HBase has been evolving and is getting more stable. Cheers On Fri, Jan 8, 2016 at 6:29 PM, Gavin Yue wrote: > I used to maintain a HBase cluster. The

how garbage collection works on parallelize

2016-01-08 Thread jluan
Hi, I am curious about garbage collect on an object which gets parallelized. Say if we have a really large array (say 40GB in ram) that we want to parallelize across our machines. I have the following function: def doSomething(): RDD[Double] = { val reallyBigArray = Array[Double[(some really

Re: how garbage collection works on parallelize

2016-01-08 Thread Josh Rosen
It won't be GC'd as long as the RDD which results from `parallelize()` is kept around; that RDD keeps strong references to the parallelized collection's elements in order to enable fault-tolerance. On Fri, Jan 8, 2016 at 6:50 PM, jluan wrote: > Hi, > > I am curious about

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this

Re: Unable to compile from source

2016-01-08 Thread hareesh makam
Are you behind a proxy? Or Try disabling the SSL check while building. http://stackoverflow.com/questions/21252800/maven-trusting-all-certs-unlimited-java-policy Check above link to know how to disable SSL check. - hareesh. On Jan 8, 2016 4:54 PM, "Gaini Rajeshwar"

Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-08 Thread Henrik Baastrup
I solved the problem. I needed to tell the SparkContext about my Hadoop set up, so now my program is as follow: SparkConf conf = new SparkConf() .setAppName("SparkTest") .setMaster("spark://172.27.13.57:7077") .set("spark.executor.memory", "2g") // We assign 2 GB ram

subscribe

2016-01-08 Thread Jeetendra Gangele

Re: spark 1.6 Issue

2016-01-08 Thread kali.tumm...@gmail.com
Hi All, worked OK by adding below in VM options. -Xms128m -Xmx512m -XX:MaxPermSize=300m -ea Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893p25920.html Sent from the Apache Spark User List mailing list archive at

how deploy pmml model in spark

2016-01-08 Thread Sangameshwar Swami
Hi, can anyone know how to deploy pmml model in spark machine learning or R. please give reply ASAP ::DISCLAIMER:: The contents of this e-mail

Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Edward Capriolo
You can not 'add jar' input formats and serde's. They need to be part of your auxlib. On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion wrote: > I tried now. still getting > > 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: >

Re: subscribe

2016-01-08 Thread Denny Lee
To subscribe, please go to http://spark.apache.org/community.html to join the mailing list. On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele wrote: > >

Re: write new data to mysql

2016-01-08 Thread Todd Nist
It is not clear from the information provided why the insertIntoJDBC failed in #2. I would note that method on the DataFrame as been deprecated since 1.4, not sure what version your on. You should be able to do something like this:

write new data to mysql

2016-01-08 Thread Yasemin Kaya
Hi, I want to write dataframe existing mysql table, but when i use *peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE, "track_on_alarm",false)* it says "Table track_on_alarm already exists." And when i *use peopleDataFrame.insertIntoJDBC(MYSQL_CONNECTION_URL_WRITE,

Efficient join multiple times

2016-01-08 Thread Jason White
I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm trying to keep the join as efficient as possible so each batch finishes within the batch window. I'm using PySpark on 1.6 I've tried the trick of keying the large RDD into (k, v) pairs and using

Re: write new data to mysql

2016-01-08 Thread Ted Yu
Which Spark release are you using ? For case #2, was there any error / clue in the logs ? Cheers On Fri, Jan 8, 2016 at 7:36 AM, Yasemin Kaya wrote: > Hi, > > I want to write dataframe existing mysql table, but when i use >

Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Ophir Etzion
I tried now. still getting 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:

Re: Spark Context not getting initialized in local mode

2016-01-08 Thread Dean Wampler
ClassNotFoundException usually means one of a few problems: 1. Your app assembly is missing the jar files with those classes. 2. You mixed jar files from imcompatible versions in your assembly. 3. You built with one version of Spark and deployed to another. Dean Wampler, Ph.D. Author:

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread jiml
(point of post is to see if anyone has ideas about errors at end of post) In addition, the real way to test if it's working is to force serialization: In Java: Create array of all your classes: // for kyro serializer it wants to register all classes that need to be serialized Class[]

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread Ted Yu
bq. try adding scala.collection.mutable.WrappedArray But the hint said registering scala.collection.mutable.WrappedArray$ofRef.class , right ? On Fri, Jan 8, 2016 at 8:52 AM, jiml wrote: > (point of post is to see if anyone has ideas about errors at end of post) > >