Re: Imported CSV file content isn't identical to the original file

2016-02-14 Thread SLiZn Liu
This Error message does not appear as I upgraded to 1.6.0 . -- Cheers, Todd Leo On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu <sliznmail...@gmail.com> wrote: > At least works for me though, temporarily disabled Kyro serilizer until > upgrade to 1.6.0. Appreciate for your update. :) >

Is this Task Scheduler Error normal?

2016-02-10 Thread SLiZn Liu
Hi Spark Users, I’m running Spark jobs on Mesos, and sometimes I get vast number of Task Scheduler Errors: ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1161 because its task set is gone (this is likely the result of receiving duplicate task finished status updates)T It

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
| 2015-11-0400:00:31| > |1446566431 | 2015-11-0400:00:31| > +--+------+ > > > > > On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com> wrote: > >> Hi Spark Users Group, >> >> I have a csv file to analysis wi

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
I’ve found the trigger of my issue: if I start my spark-shell or submit by spark-submit with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer, the DataFrame content goes wrong, as I described earlier. ​ On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmail...@gmail.com>

Re: Imported CSV file content isn't identical to the original file

2016-02-08 Thread SLiZn Liu
At least works for me though, temporarily disabled Kyro serilizer until upgrade to 1.6.0. Appreciate for your update. :) Luciano Resende <luckbr1...@gmail.com>于2016年2月9日 周二02:37写道: > Sorry, same expected results with trunk and Kryo serializer > > On Mon, Feb 8, 2016 at 4:1

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried HiveContext, but the result is exactly the same. ​ On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sliznmail...@gmail.com> wrote: > Hi Spark Users Group, > > I have a csv file to analysis with Spark, but I’m troubling

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
are missing. Good to know the way to show the whole content in a cell. — BR, Todd Leo ​ On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <igor.ber...@gmail.com> wrote: > show has argument of truncate > pass false so it wont truncate your results > > On 7 February 2016 at 11:01, SL

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread SLiZn Liu
and have Great fortune in the Year of Monkey! — BR, Todd Leo ​ On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liu <sliznmail...@gmail.com> wrote: > Hi Igor, > > In my case, it’s not a matter of *truncate*. As the show() function in > Spark API doc reads, > > truncate: Whether trunca

Imported CSV file content isn't identical to the original file

2016-02-06 Thread SLiZn Liu
Hi Spark Users Group, I have a csv file to analysis with Spark, but I’m troubling with importing as DataFrame. Here’s the minimal reproducible example. Suppose I’m having a *10(rows)x2(cols)* *space-delimited csv* file, shown as below: 1446566430 2015-11-0400:00:30 1446566430 2015-11-0400:00:30

Re: Save GraphX to disk

2015-11-13 Thread SLiZn Liu
Hi Gaurav, Your graph can be saved to graph databases like Neo4j or Titan through their drivers, that eventually saved to the disk. BR, Todd Gaurav Kumar gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道: > Hi, > > I was wondering how to save a graph to disk and load it back again. I know > how

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread SLiZn Liu
Hi Jerry, I think you are referring to --no-switch_user. =) chiling...@gmail.com>于2015年10月19日 周一21:05写道: > Can you try setting SPARK_USER at the driver? It is used to impersonate > users at the executor. So if you have user setup for launching spark jobs > on the executor machines, simply

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread SLiZn Liu
ons($"col")) .rdd.map( x: Row => (k, v) ) > .combineByKey() > > Deenar > > On 14 October 2015 at 05:18, SLiZn Liu <sliznmail...@gmail.com> wrote: > >> Hey Spark Users, >> >> I kept getting java.lang.OutOfMemoryError: Java heap space as I read a &

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread SLiZn Liu
ks.com> >> wrote: >> >> import org.apache.spark.sql.functions._ >> >> df.groupBy("category") >> .agg(callUDF("collect_set", df("id")).as("id_list")) >> >> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
;category") > .agg(callUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com> > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appen

OutOfMemoryError When Reading Many json Files

2015-10-13 Thread SLiZn Liu
Hey Spark Users, I kept getting java.lang.OutOfMemoryError: Java heap space as I read a massive amount of json files, iteratively via read.json(). Even the result RDD is rather small, I still get the OOM Error. The brief structure of my program reads as following, in psuedo-code:

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
(r)) > > > > You can always reconvert the obtained RDD after tranformation and reduce to a > DataFrame. > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > > https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf

Spark DataFrame GroupBy into List

2015-10-13 Thread SLiZn Liu
Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below: | category | id | | |:--:| | A| 1 | | A| 2 | | B| 3 | | B| 4 | | C| 5 | ideally, after

Re: Streaming Receiver Imbalance Problem

2015-09-23 Thread SLiZn Liu
m> wrote: Also, you could switch to the Direct KAfka API which was first released as > experimental in 1.3. In 1.5 we graduated it from experimental, but its > quite usable in Spark 1.3.1 > > TD > > On Tue, Sep 22, 2015 at 7:45 PM, SLiZn Liu <sliznmail...@gmail.com> wrote: &

Re: Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
es.apache.org/jira/browse/SPARK-8882 > > On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu <sliznmail...@gmail.com> > wrote: > >> Hi spark users, >> >> In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 >> receivers to receive 3 Kafka partition

Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Hi spark users, In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 receivers to receive 3 Kafka partitions, whereas records receiving rate imbalance been observed, with spark.streaming.receiver.maxRate is set to 120, sometimes 1 of which receives very close to the limit

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-07-01 Thread SLiZn Liu
org.apache.hbase:hbase:1.1.1, junit:junit:x --repositories http://some.other.repo,http://some.other.repo2 $YOUR_JAR Best, Burak On Mon, Jun 29, 2015 at 11:33 PM, SLiZn Liu sliznmail...@gmail.com wrote: Hi Burak, Is `--package` flag only available for maven, no sbt support? On Tue, Jun 30, 2015 at 2:26

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-30 Thread SLiZn Liu
29, 2015 at 10:46 PM, SLiZn Liu sliznmail...@gmail.com wrote: Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is packaging a **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to package **all dependencies** into one big jar. The rest work is copy

Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread SLiZn Liu
Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is packaging a **fat jar**: place dependencies in `build.sbt`, and use `sbt assembly` to package **all dependencies** into one big jar. The rest work is copy the fat jar to Spark master node and then launch by

Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this file cannot be fitted in my memory. However, it looks like no RDD will be received until I copy this big file to a prior-specified

Re: Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
in this use case? 50g need not to be in memory. Give it a try with high number of partitions. On 11 Jun 2015 23:09, SLiZn Liu sliznmail...@gmail.com wrote: Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using

Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
However this returns a single column of c, without showing the original col1 . ​ On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha sriharsha@gmail.com wrote: df.groupBy($col1).agg(count($col1).as(c)).show On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu sliznmail...@gmail.com wrote: Hi Spark

Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
, 2015 at 11:22 PM, SLiZn Liu sliznmail...@gmail.com wrote: However this returns a single column of c, without showing the original col1. ​ On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha sriharsha@gmail.com wrote: df.groupBy($col1).agg(count($col1).as(c)).show On Thu, May 21, 2015 at 3

DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
Hi Spark Users Group, I’m doing groupby operations on my DataFrame *df* as following, to get count for each value of col1: df.groupBy(col1).agg(col1 - count).show // I don't know if I should write like this. col1 COUNT(col1#347) aaa2 bbb4 ccc4 ... and more... As I ‘d like to

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
= ... sqlContext.createDataFrame(rdd, schema) 2015-05-13 12:00 GMT+02:00 SLiZn Liu sliznmail...@gmail.com: Additionally, after I successfully packaged the code, and submitted via spark-submit webcat_2.11-1.0.jar, the following error was thrown at the line where toDF() been called: Exception in thread

Fwd: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
toDF is not a member of RDD object To: SLiZn Liu sliznmail...@gmail.com Are you sure that you are submitting it correctly? Can you post the entire command you are using to run the .jar file via spark-submit? On Wed, May 13, 2015 at 4:07 PM, SLiZn Liu sliznmail...@gmail.com wrote: No, creating

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
. What else should I try? REGARDS, Todd Leo ​ On Wed, May 13, 2015 at 11:27 AM SLiZn Liu sliznmail...@gmail.com wrote: Thanks folks, really appreciate all your replies! I tried each of your suggestions and in particular, *Animesh*‘s second suggestion of *making case class definition global* helped

value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
Hi User Group, I’m trying to reproduce the example on Spark SQL Programming Guide https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, and got a compile error when packaging with sbt: [error] myfile.scala:30: value toDF is not a member of

Re: value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
wrote: you need to instantiate a SQLContext : val sc : SparkContext = ... val sqlContext = new SQLContext(sc) import sqlContext.implicits._ Le mar. 12 mai 2015 à 12:29, SLiZn Liu sliznmail...@gmail.com a écrit : I added `libraryDependencies += org.apache.spark % spark-sql_2.11 % 1.3.1

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide https://spark.apache.org/docs/latest/sql-programming-guide.html step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such